Skip to content

fix: MCP tools intermittently unavailable after hibernation (#928)#996

Open
threepointone wants to merge 1 commit intomainfrom
chat-wait-for-mcp
Open

fix: MCP tools intermittently unavailable after hibernation (#928)#996
threepointone wants to merge 1 commit intomainfrom
chat-wait-for-mcp

Conversation

@threepointone
Copy link
Contributor

Summary

Fixes #928 — MCP tools are intermittently unavailable in onChatMessage after Durable Object hibernation.

The problem

When an AIChatAgent wakes from hibernation, onStart() calls restoreConnectionsFromStorage() which fires MCP server connections in the background (deliberately not awaited, to avoid blocking the DO). If a WebSocket message arrives before those connections finish, getAITools() returns an incomplete or empty tool set because the connections are still in "connecting" state.

The user sees this as:

[getAITools] WARNING: Reading tools from connection aik32tRf in state "connecting". Tools may not be loaded yet.

This is a race condition: sometimes connections finish before onChatMessage runs, sometimes they do not.


API

MCPClientManager.waitForConnections(options?)

Package: agents

New method on MCPClientManager that awaits all in-flight connection and discovery operations.

// Wait indefinitely for all connections to settle
await this.mcp.waitForConnections();

// Wait up to 10 seconds, then proceed regardless
await this.mcp.waitForConnections({ timeout: 10_000 });
  • Uses Promise.allSettled internally — never rejects, even if individual connections fail
  • Returns once every pending connection has connected+discovered, failed, or timed out
  • Resolves immediately if there are no pending connections
  • Safe to call concurrently from multiple callers (each snapshots the same pending promises)
  • Timeout uses Promise.race with clearTimeout cleanup — no leaked timers

AIChatAgent.waitForMcpConnections

Package: @cloudflare/ai-chat

New opt-in property on AIChatAgent that automatically waits before processing chat messages.

class MyAgent extends AIChatAgent<Env> {
  // Wait indefinitely for all MCP connections before onChatMessage
  waitForMcpConnections = true;

  // Or: wait up to 10 seconds
  waitForMcpConnections = { timeout: 10_000 };

  // Default: false (non-blocking, existing behavior preserved)
  waitForMcpConnections = false;
}

For lower-level control, call this.mcp.waitForConnections() directly inside onChatMessage instead.


Design decisions

1. Opt-in, not default

The wait is off by default (waitForMcpConnections = false). This preserves existing behavior — agents without MCP servers or agents that manage timing themselves are unaffected. Making it default-on would add latency to every chat message for all users, even those without MCP servers.

2. Track-and-wait pattern instead of blocking restore

An alternative was to await the connections directly in restoreConnectionsFromStorage. We rejected this because:

  • It would block onStart(), delaying the entire DO wake-up
  • Multiple callers might need the connections at different times
  • The track-and-wait pattern is more composable — any code path can call waitForConnections() when it needs tools to be ready

The implementation uses a _pendingConnections Map that tracks promises with automatic cleanup:

  • _trackConnection(serverId, promise) — wraps the promise with .finally() that removes it from the map when settled
  • If a server is re-tracked (e.g., reconnect), the old promise's cleanup checks identity before deleting, preventing a newer promise from being orphaned
  • closeConnection() and closeAllConnections() clean up the map so waitForConnections() does not block on closed servers

3. Wait only on chat messages, not all WebSocket messages

The waitForMcpConnections wait runs only for CF_AGENT_USE_CHAT_REQUEST messages, not for state updates, RPC calls, or other WebSocket traffic. This avoids unnecessary latency on non-chat interactions after reconnect.

4. OAuth servers are excluded from tracking

Servers with auth_url set (OAuth flow in progress) are placed in "authenticating" state and are not tracked as pending connections. They require user interaction to complete, so waiting on them would block indefinitely. The handleMcpOAuthCallback path tracks its own establishConnection call separately.


Changes by file

Core fix (packages/agents)

File Change
src/mcp/client.ts Added _pendingConnections map, _trackConnection(), waitForConnections(). Restore now tracks via _trackConnection. Cleanup in closeConnection/closeAllConnections.
src/index.ts handleMcpOAuthCallback now tracks its establishConnection call via _trackConnection.

AIChatAgent integration (packages/ai-chat)

File Change
src/index.ts Added waitForMcpConnections property. Wait logic placed inside the chat-request branch of onMessage (not on all messages).

Tests

File What it tests
tests/mcp/client-manager.test.ts 8 new unit tests: immediate resolve, tracked settle, mixed success/failure, cleanup, timeout, early finish, concurrent callers, promise replacement identity.
tests/mcp/wait-connections-e2e.test.ts 8 E2E tests against real DO stubs with SQLite: no-servers, restore-wait, race-condition demo, OAuth skip, timeout, 3 true hibernation round-trip tests through onStart().
tests/agents/wait-connections.ts Test agent with hibernationRoundTrip() and hibernationRoundTripNoWait() that exercise onStart → restoreConnectionsFromStorage → waitForConnections.
ai-chat/tests/wait-mcp-connections.test.ts 3 config plumbing tests: true, { timeout }, false variants all process messages correctly.
ai-chat/tests/worker.ts + wrangler.jsonc 3 new test agents (WaitMcpTrueAgent, WaitMcpTimeoutAgent, WaitMcpFalseAgent) + DO bindings.

Reviewer notes

  • _trackConnection is semi-public (no private keyword) because Agent.handleMcpOAuthCallback in index.ts calls this.mcp._trackConnection() from outside the class. Marked @internal in JSDoc. Open to making this a proper public method if preferred.
  • timerId! non-null assertion in waitForConnections — the Promise constructor runs synchronously so timerId is always assigned before clearTimeout. TypeScript cannot verify this. Could restructure to avoid the assertion if desired.
  • The hibernation round-trip E2E tests use onStart() directly on the stub rather than going through actual WebSocket disconnect/reconnect. This is because @cloudflare/vitest-pool-workers does not support triggering real hibernation cycles. The tests prove the onStart → _trackConnection → waitForConnections pipeline works, which is the critical integration seam.
  • The changeset covers both agents (patch) and @cloudflare/ai-chat (patch).

How users fix issue #928

Before (intermittent failures):

async onChatMessage(onFinish, options) {
  const mcpTools = this.mcp.getAITools(); // ← sometimes empty after hibernation
}

After (option A — declarative):

class MyAgent extends AIChatAgent<Env> {
  waitForMcpConnections = true; // or { timeout: 10_000 }

  async onChatMessage(onFinish, options) {
    const mcpTools = this.mcp.getAITools(); // ← always complete
  }
}

After (option B — imperative):

async onChatMessage(onFinish, options) {
  await this.mcp.waitForConnections();
  const mcpTools = this.mcp.getAITools(); // ← always complete
}

Fix a race where MCP tools could be unavailable in onChatMessage after Durable Object hibernation by tracking background connection work and providing a way to await it.

- Add MCPClientManager._pendingConnections, _trackConnection(), and waitForConnections({ timeout? }) to await all in-flight connection/discovery promises (uses allSettled and optional timeout). Pending entries are cleaned when settled or removed on close.
- Agent now tracks background establishConnection promises from OAuth callback via this.mcp._trackConnection so callers can wait for those restores.
- Add waitForMcpConnections config to AIChatAgent (false by default). When enabled (true or { timeout }), AIChatAgent waits for mcp.waitForConnections() before calling onChatMessage.
- Add tests and E2E coverage (new test agents, wait-connections tests, and wrangler test bindings) to validate behavior and timeouts.

This change preserves prior non-blocking behavior by default while offering opt-in safety for callers that require MCP tools to be ready.
@changeset-bot
Copy link

changeset-bot bot commented Feb 26, 2026

🦋 Changeset detected

Latest commit: fcd7544

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
agents Patch
@cloudflare/ai-chat Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link

pkg-pr-new bot commented Feb 26, 2026

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/agents@996
npm i https://pkg.pr.new/cloudflare/agents/@cloudflare/ai-chat@996
npm i https://pkg.pr.new/cloudflare/agents/@cloudflare/codemode@996
npm i https://pkg.pr.new/cloudflare/agents/hono-agents@996

commit: fcd7544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP tools intermittently unavailable in onChatMessage

1 participant