fix: MCP tools intermittently unavailable after hibernation (#928)#996
Open
threepointone wants to merge 1 commit intomainfrom
Open
fix: MCP tools intermittently unavailable after hibernation (#928)#996threepointone wants to merge 1 commit intomainfrom
threepointone wants to merge 1 commit intomainfrom
Conversation
Fix a race where MCP tools could be unavailable in onChatMessage after Durable Object hibernation by tracking background connection work and providing a way to await it.
- Add MCPClientManager._pendingConnections, _trackConnection(), and waitForConnections({ timeout? }) to await all in-flight connection/discovery promises (uses allSettled and optional timeout). Pending entries are cleaned when settled or removed on close.
- Agent now tracks background establishConnection promises from OAuth callback via this.mcp._trackConnection so callers can wait for those restores.
- Add waitForMcpConnections config to AIChatAgent (false by default). When enabled (true or { timeout }), AIChatAgent waits for mcp.waitForConnections() before calling onChatMessage.
- Add tests and E2E coverage (new test agents, wait-connections tests, and wrangler test bindings) to validate behavior and timeouts.
This change preserves prior non-blocking behavior by default while offering opt-in safety for callers that require MCP tools to be ready.
🦋 Changeset detectedLatest commit: fcd7544 The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #928 — MCP tools are intermittently unavailable in
onChatMessageafter Durable Object hibernation.The problem
When an
AIChatAgentwakes from hibernation,onStart()callsrestoreConnectionsFromStorage()which fires MCP server connections in the background (deliberately not awaited, to avoid blocking the DO). If a WebSocket message arrives before those connections finish,getAITools()returns an incomplete or empty tool set because the connections are still in"connecting"state.The user sees this as:
This is a race condition: sometimes connections finish before
onChatMessageruns, sometimes they do not.API
MCPClientManager.waitForConnections(options?)Package:
agentsNew method on
MCPClientManagerthat awaits all in-flight connection and discovery operations.Promise.allSettledinternally — never rejects, even if individual connections failPromise.racewithclearTimeoutcleanup — no leaked timersAIChatAgent.waitForMcpConnectionsPackage:
@cloudflare/ai-chatNew opt-in property on
AIChatAgentthat automatically waits before processing chat messages.For lower-level control, call
this.mcp.waitForConnections()directly insideonChatMessageinstead.Design decisions
1. Opt-in, not default
The wait is off by default (
waitForMcpConnections = false). This preserves existing behavior — agents without MCP servers or agents that manage timing themselves are unaffected. Making it default-on would add latency to every chat message for all users, even those without MCP servers.2. Track-and-wait pattern instead of blocking restore
An alternative was to
awaitthe connections directly inrestoreConnectionsFromStorage. We rejected this because:onStart(), delaying the entire DO wake-upwaitForConnections()when it needs tools to be readyThe implementation uses a
_pendingConnectionsMap that tracks promises with automatic cleanup:_trackConnection(serverId, promise)— wraps the promise with.finally()that removes it from the map when settledcloseConnection()andcloseAllConnections()clean up the map sowaitForConnections()does not block on closed servers3. Wait only on chat messages, not all WebSocket messages
The
waitForMcpConnectionswait runs only forCF_AGENT_USE_CHAT_REQUESTmessages, not for state updates, RPC calls, or other WebSocket traffic. This avoids unnecessary latency on non-chat interactions after reconnect.4. OAuth servers are excluded from tracking
Servers with
auth_urlset (OAuth flow in progress) are placed in"authenticating"state and are not tracked as pending connections. They require user interaction to complete, so waiting on them would block indefinitely. ThehandleMcpOAuthCallbackpath tracks its ownestablishConnectioncall separately.Changes by file
Core fix (
packages/agents)src/mcp/client.ts_pendingConnectionsmap,_trackConnection(),waitForConnections(). Restore now tracks via_trackConnection. Cleanup incloseConnection/closeAllConnections.src/index.tshandleMcpOAuthCallbacknow tracks itsestablishConnectioncall via_trackConnection.AIChatAgent integration (
packages/ai-chat)src/index.tswaitForMcpConnectionsproperty. Wait logic placed inside the chat-request branch ofonMessage(not on all messages).Tests
tests/mcp/client-manager.test.tstests/mcp/wait-connections-e2e.test.tsonStart().tests/agents/wait-connections.tshibernationRoundTrip()andhibernationRoundTripNoWait()that exerciseonStart → restoreConnectionsFromStorage → waitForConnections.ai-chat/tests/wait-mcp-connections.test.tstrue,{ timeout },falsevariants all process messages correctly.ai-chat/tests/worker.ts+wrangler.jsoncWaitMcpTrueAgent,WaitMcpTimeoutAgent,WaitMcpFalseAgent) + DO bindings.Reviewer notes
_trackConnectionis semi-public (noprivatekeyword) becauseAgent.handleMcpOAuthCallbackinindex.tscallsthis.mcp._trackConnection()from outside the class. Marked@internalin JSDoc. Open to making this a proper public method if preferred.timerId!non-null assertion inwaitForConnections— the Promise constructor runs synchronously sotimerIdis always assigned beforeclearTimeout. TypeScript cannot verify this. Could restructure to avoid the assertion if desired.onStart()directly on the stub rather than going through actual WebSocket disconnect/reconnect. This is because@cloudflare/vitest-pool-workersdoes not support triggering real hibernation cycles. The tests prove theonStart → _trackConnection → waitForConnectionspipeline works, which is the critical integration seam.agents(patch) and@cloudflare/ai-chat(patch).How users fix issue #928
Before (intermittent failures):
After (option A — declarative):
After (option B — imperative):