Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
bpo-38323: Fix rare MultiLoopChildWatcher hangs.
This changes asyncio.MultiLoopChildWatcher's attach_loop() method
to call loop.add_signal_handler() instead of calling only
signal.signal().  This should eliminate some rare hangs since
loop.add_signal_handler() calls signal.set_wakeup_fd().  Without
this, the main thread sometimes wasn't getting awakened if a
signal occurred during an await.
  • Loading branch information
cjerdonek committed May 17, 2020
commit 5d1013256c133b61587b6a80a0f9d509ac11d123
4 changes: 3 additions & 1 deletion Doc/library/asyncio-eventloop.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1036,7 +1036,9 @@ Unix signals
The callback will be invoked by *loop*, along with other queued callbacks
and runnable coroutines of that event loop. Unlike signal handlers
registered using :func:`signal.signal`, a callback registered with this
function is allowed to interact with the event loop.
method is allowed to interact with the event loop. Using
:func:`signal.signal` instead of this method can also cause the event
loop not to awaken in rare situations when a signal is received.

Raise :exc:`ValueError` if the signal number is invalid or uncatchable.
Raise :exc:`RuntimeError` if there is a problem setting up the handler.
Expand Down
13 changes: 12 additions & 1 deletion Doc/library/asyncio-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ implementation used by the asyncio event loop:

This implementation registers a :py:data:`SIGCHLD` signal handler on
instantiation. That can break third-party code that installs a custom handler for
`SIGCHLD`. signal).
the :py:data:`SIGCHLD` signal).

The watcher avoids disrupting other code spawning processes
by polling every process explicitly on a :py:data:`SIGCHLD` signal.
Expand All @@ -233,6 +233,17 @@ implementation used by the asyncio event loop:

.. versionadded:: 3.8

.. method:: attach_loop(loop)

Registers the :py:data:`SIGCHLD` signal handler. Like
:meth:`loop.add_signal_handler`, this method can only be invoked
from the main thread.

.. versionchanged:: 3.9

The method now calls :func:`signal.set_wakeup_fd` as part of the
handler initialization.

.. class:: SafeChildWatcher

This implementation uses active event loop from the main thread to handle
Expand Down
34 changes: 25 additions & 9 deletions Lib/asyncio/unix_events.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,8 @@ def _process_self_data(self, data):
def add_signal_handler(self, sig, callback, *args):
"""Add a handler for a signal. UNIX only.

This method can only be called from the main thread.

Raise ValueError if the signal number is invalid or uncatchable.
Raise RuntimeError if there is a problem setting up the handler.
"""
Expand Down Expand Up @@ -1232,10 +1234,15 @@ def close(self):
self._callbacks.clear()
if self._saved_sighandler is not None:
handler = signal.getsignal(signal.SIGCHLD)
if handler != self._sig_chld:
# add_signal_handler() sets the handler to _sighandler_noop.
if handler != _sighandler_noop:
logger.warning("SIGCHLD handler was changed by outside code")
else:
loop = self._loop
# This clears the wakeup file descriptor if necessary.
loop.remove_signal_handler(signal.SIGCHLD)
signal.signal(signal.SIGCHLD, self._saved_sighandler)

self._saved_sighandler = None

def __enter__(self):
Expand Down Expand Up @@ -1263,15 +1270,24 @@ def attach_loop(self, loop):
# The reason to do it here is that attach_loop() is called from
# unix policy only for the main thread.
# Main thread is required for subscription on SIGCHLD signal
if loop is None or self._saved_sighandler is not None:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aeros I would think through what should happen if either loop is None or self._saved_sighandler is not None. For example, the previous implementation didn't use the loop argument, so there is a risk that code will stop working for people who were passing loop=None.

return

self._loop = loop
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of why I was saying to remove this line is to honor the code comment 7 lines above:

# Don't save the loop but initialize itself if called first time

self._saved_sighandler = signal.getsignal(signal.SIGCHLD)
if self._saved_sighandler is None:
self._saved_sighandler = signal.signal(signal.SIGCHLD, self._sig_chld)
if self._saved_sighandler is None:
logger.warning("Previous SIGCHLD handler was set by non-Python code, "
"restore to default handler on watcher close.")
self._saved_sighandler = signal.SIG_DFL
logger.warning("Previous SIGCHLD handler was set by non-Python code, "
"restore to default handler on watcher close.")
self._saved_sighandler = signal.SIG_DFL

# Set SA_RESTART to limit EINTR occurrences.
signal.siginterrupt(signal.SIGCHLD, False)
if self._callbacks:
warnings.warn(
'A loop is being detached '
'from a child watcher with pending handlers',
RuntimeWarning)

# This also sets up the wakeup file descriptor.
loop.add_signal_handler(signal.SIGCHLD, self._sig_chld)

def _do_waitpid_all(self):
for pid in list(self._callbacks):
Expand Down Expand Up @@ -1314,7 +1330,7 @@ def _do_waitpid(self, expected_pid):
expected_pid, returncode)
loop.call_soon_threadsafe(callback, pid, returncode, *args)

def _sig_chld(self, signum, frame):
def _sig_chld(self, *args):
try:
self._do_waitpid_all()
except (SystemExit, KeyboardInterrupt):
Expand Down
3 changes: 2 additions & 1 deletion Lib/test/test_asyncio/test_subprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -672,12 +672,13 @@ def setUp(self):
policy.set_child_watcher(watcher)

def tearDown(self):
super().tearDown()
policy = asyncio.get_event_loop_policy()
watcher = policy.get_child_watcher()
policy.set_child_watcher(None)
watcher.attach_loop(None)
watcher.close()
# Since setUp() does super().setUp() first, do tearDown() last.
super().tearDown()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change causes the CI to fail.

Copy link
Copy Markdown
Contributor

@aeros aeros Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling super().tearDown() last here should not be causing dangling threads, so I think this could be pointing to a more fundamental issue.

In a previous PR, I added an internal _join_threads() method to ThreadedChildWatcher, which specifically cleans up all non-daemon threads. This was there to ensure all threads are joined when the watcher is closed. However, in add_child_handler() the _do_waitpid() threads that are created are set as daemon, meaning they aren't joined in _join_threads() since it checks if they're not daemon.

Based on the error message in Travis, it definitely seems to be those waitpid threads created in ThreadedChildWatcher causing the hangs:

0:11:41 load avg: 3.89 [375/424/1] test_asyncio failed (env changed) (2 min 3 sec) -- running: test_multiprocessing_spawn (1 min 2 sec)
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>

As a result, I think a viable fix might be just removing the and not thread.daemon check in ThreadedChildWatcher._join_threads(), to have it also clean up the waitpid threads when the watcher is closed. Based on my local testing, this seems to resolve the issue, but test-with-buildbots should probably be used to make sure it doesn't cause other side effects on other platforms.

A viable alternative might be changing add_child_handler() to instead spawn non-daemon threads. However, I suspect changing the internal _join_threads() method would be less likely to be disruptive to user code and causing side effects because of reliance on those threads being daemon.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall exactly why I made this particular tearDown() change for the purposes of this PR. (There was probably some failure I was trying to see if I could get by quickly.) In any case, it should be incidental to the MultiLoopChildWatcher hang. So if any changes are needed for the asyncio tests related to tearDown(), it should be done in a separate PR IMO independent of the MultiLoopChildWatcher hang issue. In other words, this eventual PR shouldn't contain this change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aeros Just to clarify what I said above, if you feel your ThreadedChildWatcher change would help the test harness in general, I would recommend making that change in a separate PR to reduce the number of moving parts in the changes needed to address the MultiLoopChildWatcher hang.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll look into opening a separate PR to address it. However, if this PR is continued upon (either by yourself or me), I'd suggest reverting the super().tearDown() location change made.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I guess there are a few steps to be done. First, you can do your ThreadedChildWatcher change in a separate PR, and super().tearDown() can be reverted in this PR. If however, after resuming work on this PR, it looks like a change to the teardown logic might be needed for some reason, I would still suggest that the tearDown() change also be done in a separate PR (since it affects many of the watcher tests and not just MultiLoopChildWatcher, and so would be good to isolate that change).


class SubprocessThreadedWatcherTests(SubprocessWatcherMixin,
test_utils.TestCase):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Fix rare cases with ``MultiLoopChildWatcher`` where the event loop can
fail to awaken in response to a :py:data:`SIGCHLD` signal.