Skip to content

Comments

Fix AsyncioConnection race conditions causing EBADF errors (#614)#697

Open
dkropachev wants to merge 1 commit intomasterfrom
fix/asyncio-close-race-614
Open

Fix AsyncioConnection race conditions causing EBADF errors (#614)#697
dkropachev wants to merge 1 commit intomasterfrom
fix/asyncio-close-race-614

Conversation

@dkropachev
Copy link
Collaborator

@dkropachev dkropachev commented Feb 12, 2026

Summary

  • Fix close() to wait for _close() completion from non-event-loop threads, eliminating the window where is_closed=True but the socket fd is still open (root cause of EBADF)
  • Set last_error on server EOF in handle_read() so factory() detects dead connections instead of returning them
  • Add is_closed/is_defunct guard in push() to reject writes on closed connections
  • Treat BrokenPipeError/ConnectionResetError in handle_write() as clean peer disconnections instead of defuncting, and skip defunct() in both I/O handlers if the connection is already shutting down

Test plan

  • New unit tests in tests/unit/io/test_asyncio_race_614.py cover all four race conditions using real socket pairs
  • Full unit test suite passes with no regressions
  • Integration tests with TLS and node restart scenarios

@dkropachev dkropachev force-pushed the fix/asyncio-close-race-614 branch 3 times, most recently from 46524ac to 7b815c8 Compare February 16, 2026 13:09
@dkropachev dkropachev self-assigned this Feb 16, 2026
@dkropachev dkropachev marked this pull request as ready for review February 16, 2026 13:26
@dkropachev dkropachev force-pushed the fix/asyncio-close-race-614 branch from 2b9d555 to 85d08e1 Compare February 17, 2026 04:52
@dkropachev dkropachev force-pushed the fix/asyncio-close-race-614 branch from 85d08e1 to 776d2d3 Compare February 17, 2026 12:55
Fix race conditions in AsyncioConnection that cause "[Errno 9] Bad
file descriptor" errors during node restarts, especially with TLS:

1. close() now waits for _close() to complete when called from outside
   the event loop thread, eliminating the window where is_closed=True
   but the socket fd is still open.

2. handle_read() sets last_error on server EOF so factory() detects dead
   connections instead of returning them to callers.

3. handle_write() treats peer disconnections as clean close instead of
   defuncting, and both I/O handlers skip defunct() if the connection
   is already shutting down.

Peer-disconnect detection is extracted into _is_peer_disconnect() helper
covering platform-specific behaviors:

- Windows: ProactorEventLoop raises plain OSError with winerror 10054
  (WSAECONNRESET) or 10053 (WSAECONNABORTED) instead of
  ConnectionResetError. Detection uses ConnectionError base class plus
  winerror check.

- macOS: Raises OSError(57) ENOTCONN when writing to a
  peer-disconnected socket, which is not a ConnectionError subclass.
  Detection uses errno-based checks for ENOTCONN, ESHUTDOWN,
  ECONNRESET, and ECONNABORTED.

- Windows _close(): ProactorEventLoop does not support
  remove_reader/remove_writer (raises NotImplementedError). These calls
  are wrapped so the socket is always closed regardless, and try/finally
  ensures connected_event is always set even if cleanup fails.
@dkropachev dkropachev force-pushed the fix/asyncio-close-race-614 branch from d73e85d to 9915064 Compare February 17, 2026 20:40
@dkropachev dkropachev requested a review from Lorak-mmk February 17, 2026 20:40
@fruch
Copy link

fruch commented Feb 22, 2026

@dkropachev would recommand testing SSL with this fix, we know SSL was failing/stuck with asyncio connection, hence all the fuss with it being a default or not

@Lorak-mmk
Copy link

we know SSL was failing/stuck with asyncio connection

It was not failing because of race conditions, but because setting up SSL socket has a totally different API in asyncio, and it was just never implemented here.

Comment on lines +205 to +209
try:
self._socket.close()
except OSError:
# Ignore if socket is already closed
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the socket be already closed? self._socket.close() is only called in this one place. _close is only called from close, which checks and sets is_closed, so _close has no way of running multiple times.

Comment on lines +199 to +204
except Exception:
# It is not critical if it fails, driver can keep working,
# but it should not be happening, so logged as error
log.error("Unexpected error removing reader for %s",
self.endpoint, exc_info=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previous version you were catching OsError, and explained that it is thrown when the socket is closed. Again, I don't see how this is possible, because the only place where the socket is being closed is after this code.

Comment on lines +174 to +175
fd = self._socket.fileno()
if fd >= 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? When can fd be 0?

Comment on lines +16 to +21
# Errno values that indicate the remote peer has disconnected.
_PEER_DISCONNECT_ERRNOS = frozenset((
errno.ENOTCONN, errno.ESHUTDOWN,
errno.ECONNRESET, errno.ECONNABORTED,
errno.EBADF,
))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be a ConnectionError in all those cases?

Comment on lines +224 to +226
if self.is_closed or self.is_defunct:
raise ConnectionShutdown(
"Connection to %s is already closed" % self.endpoint)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, push is only called from Connection::send_msg. send_msg already starts with:

        if self.is_defunct:
            raise ConnectionShutdown("Connection to %s is defunct" % self.endpoint)
        elif self.is_closed:
            raise ConnectionShutdown("Connection to %s is closed" % self.endpoint)
        elif not self._socket_writable:
            raise ConnectionBusy("Connection %s is overloaded" % self.endpoint)

So what does this check give us?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants