[3.14] gh-138860: Lazy import rlcompleter in pdb to avoid deadlock in subprocess (GH-139185)#139280
Conversation
…subprocess (pythonGH-139185) (cherry picked from commit c8624cd) Co-authored-by: Tian Gao <gaogaotiantian@hotmail.com>
|
We need to wait for 3.14.0 for this PR because it's not critical for rc. |
|
The original commit to |
|
Just to add that even though the refleak is triggered by this PR, the changes in the PR is valid. If we fixed the issue for |
|
Updated with a backport of the fix-up, GH-139305 |
|
🤖 New build scheduled with the buildbot fleet by @encukou for commit 6c10f12 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F139280%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
…171818) Fixes #159645 Makes `torch/distributed/__init__.py` only import `pdb` when needed, because we should avoid debugging-specific dependencies in production code. In Python 3.13.1 through 3.13.7, this also avoids the following chain of imports from : `torch` -> `torch.distributed` -> `pdb` -> `rlcompleter` -> `readline` Importing `readline`, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched with `process_group=0` or `preexec_fn=setpgrp` because it doesn't have access to stdin. Python 3.13.8 [fixed the `pdb` -> `rlcompleter` -> `readline` dependency](python/cpython#139280), but it's still good to import `pdb` only when necessary. ## Testing (All tests below on Mac.) ### Test script: `deadline_minimal.py`: ``` import sys import subprocess if __name__ == "__main__": code = """ print('importing torch...') import sys import torch print('imported torch.') if "pdb" in sys.modules: print("ERROR: pdb imported") exit(1) """ kwargs = dict(process_group=0) proc = subprocess.Popen([sys.executable, "-c", code], **kwargs) try: proc.communicate(timeout=20) if proc.returncode == 0: print("PASS") else: print("FAIL") except subprocess.TimeoutExpired: print("FAIL: Process deadlocked after 20 seconds") proc.kill() ``` ### Failure repro: python 3.13.7, old pytorch Deadlocks: ``` % conda create -n "pytorch-pdb-3.13.7" python=3.13.7 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.7 % python deadlock_minimal.py importing torch... FAIL: Process deadlocked after 10 seconds ``` ### Failure repro: python 3.13.8, new pytorch Does not deadlock due to underlying python fix, but still imports pdb: ``` % conda create -n "pytorch-pdb-3.13.8" python=3.13.8 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.8 % python deadlock_minimal.py imported torch. ERROR: pdb imported FAIL ``` ### Fix confirmation: python 3.13.7, new pytorch No longer deadlocks, does not import pdb. ``` % conda create -n "pytorch-3.13.7" python=3.13.7 % conda activate pytorch-3.13.7 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ``` % conda create -n "pytorch-3.13.11" python=3.13.11 % conda activate pytorch-3.13.11 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ### Test that `torch.distributed.breakpoint()` still works: `torch_breakpoint.py`: ``` import sys import torch.distributed as dist print(f"is available: {dist.is_available()}") dist.init_process_group() dist.breakpoint(rank = 0) print(f"pdb imported after breakpoint: {"pdb" in sys.modules}") ``` Then built with distributed on Mac and did a basic test: ``` % USE_DISTRIBUTED=1 python setup.py build --cmake % RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=49999 python torch_breakpoint.py is available: True # snipped some errors due to not actually setting up a full scenario > /Users/kelu/kelu-wandb/pytorch/torch/distributed/__init__.py(121)breakpoint() -> pdb.set_trace() (Pdb) ``` Pull Request resolved: #171818 Approved by: https://github.com/ezyang
…ytorch#171818) Fixes pytorch#159645 Makes `torch/distributed/__init__.py` only import `pdb` when needed, because we should avoid debugging-specific dependencies in production code. In Python 3.13.1 through 3.13.7, this also avoids the following chain of imports from : `torch` -> `torch.distributed` -> `pdb` -> `rlcompleter` -> `readline` Importing `readline`, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched with `process_group=0` or `preexec_fn=setpgrp` because it doesn't have access to stdin. Python 3.13.8 [fixed the `pdb` -> `rlcompleter` -> `readline` dependency](python/cpython#139280), but it's still good to import `pdb` only when necessary. ## Testing (All tests below on Mac.) ### Test script: `deadline_minimal.py`: ``` import sys import subprocess if __name__ == "__main__": code = """ print('importing torch...') import sys import torch print('imported torch.') if "pdb" in sys.modules: print("ERROR: pdb imported") exit(1) """ kwargs = dict(process_group=0) proc = subprocess.Popen([sys.executable, "-c", code], **kwargs) try: proc.communicate(timeout=20) if proc.returncode == 0: print("PASS") else: print("FAIL") except subprocess.TimeoutExpired: print("FAIL: Process deadlocked after 20 seconds") proc.kill() ``` ### Failure repro: python 3.13.7, old pytorch Deadlocks: ``` % conda create -n "pytorch-pdb-3.13.7" python=3.13.7 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.7 % python deadlock_minimal.py importing torch... FAIL: Process deadlocked after 10 seconds ``` ### Failure repro: python 3.13.8, new pytorch Does not deadlock due to underlying python fix, but still imports pdb: ``` % conda create -n "pytorch-pdb-3.13.8" python=3.13.8 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.8 % python deadlock_minimal.py imported torch. ERROR: pdb imported FAIL ``` ### Fix confirmation: python 3.13.7, new pytorch No longer deadlocks, does not import pdb. ``` % conda create -n "pytorch-3.13.7" python=3.13.7 % conda activate pytorch-3.13.7 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ``` % conda create -n "pytorch-3.13.11" python=3.13.11 % conda activate pytorch-3.13.11 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ### Test that `torch.distributed.breakpoint()` still works: `torch_breakpoint.py`: ``` import sys import torch.distributed as dist print(f"is available: {dist.is_available()}") dist.init_process_group() dist.breakpoint(rank = 0) print(f"pdb imported after breakpoint: {"pdb" in sys.modules}") ``` Then built with distributed on Mac and did a basic test: ``` % USE_DISTRIBUTED=1 python setup.py build --cmake % RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=49999 python torch_breakpoint.py is available: True # snipped some errors due to not actually setting up a full scenario > /Users/kelu/kelu-wandb/pytorch/torch/distributed/__init__.py(121)breakpoint() -> pdb.set_trace() (Pdb) ``` Pull Request resolved: pytorch#171818 Approved by: https://github.com/ezyang
…ytorch#171818) Fixes pytorch#159645 Makes `torch/distributed/__init__.py` only import `pdb` when needed, because we should avoid debugging-specific dependencies in production code. In Python 3.13.1 through 3.13.7, this also avoids the following chain of imports from : `torch` -> `torch.distributed` -> `pdb` -> `rlcompleter` -> `readline` Importing `readline`, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched with `process_group=0` or `preexec_fn=setpgrp` because it doesn't have access to stdin. Python 3.13.8 [fixed the `pdb` -> `rlcompleter` -> `readline` dependency](python/cpython#139280), but it's still good to import `pdb` only when necessary. ## Testing (All tests below on Mac.) ### Test script: `deadline_minimal.py`: ``` import sys import subprocess if __name__ == "__main__": code = """ print('importing torch...') import sys import torch print('imported torch.') if "pdb" in sys.modules: print("ERROR: pdb imported") exit(1) """ kwargs = dict(process_group=0) proc = subprocess.Popen([sys.executable, "-c", code], **kwargs) try: proc.communicate(timeout=20) if proc.returncode == 0: print("PASS") else: print("FAIL") except subprocess.TimeoutExpired: print("FAIL: Process deadlocked after 20 seconds") proc.kill() ``` ### Failure repro: python 3.13.7, old pytorch Deadlocks: ``` % conda create -n "pytorch-pdb-3.13.7" python=3.13.7 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.7 % python deadlock_minimal.py importing torch... FAIL: Process deadlocked after 10 seconds ``` ### Failure repro: python 3.13.8, new pytorch Does not deadlock due to underlying python fix, but still imports pdb: ``` % conda create -n "pytorch-pdb-3.13.8" python=3.13.8 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.8 % python deadlock_minimal.py imported torch. ERROR: pdb imported FAIL ``` ### Fix confirmation: python 3.13.7, new pytorch No longer deadlocks, does not import pdb. ``` % conda create -n "pytorch-3.13.7" python=3.13.7 % conda activate pytorch-3.13.7 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ``` % conda create -n "pytorch-3.13.11" python=3.13.11 % conda activate pytorch-3.13.11 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ### Test that `torch.distributed.breakpoint()` still works: `torch_breakpoint.py`: ``` import sys import torch.distributed as dist print(f"is available: {dist.is_available()}") dist.init_process_group() dist.breakpoint(rank = 0) print(f"pdb imported after breakpoint: {"pdb" in sys.modules}") ``` Then built with distributed on Mac and did a basic test: ``` % USE_DISTRIBUTED=1 python setup.py build --cmake % RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=49999 python torch_breakpoint.py is available: True # snipped some errors due to not actually setting up a full scenario > /Users/kelu/kelu-wandb/pytorch/torch/distributed/__init__.py(121)breakpoint() -> pdb.set_trace() (Pdb) ``` Pull Request resolved: pytorch#171818 Approved by: https://github.com/ezyang
(cherry picked from commit c8624cd)
Co-authored-by: Tian Gao gaogaotiantian@hotmail.com