Summary
Sessions can become permanently wedged because events.jsonl ends up containing an assistant.message with a tool_use block whose matching tool.execution_complete was never written. On the next resume / send, the CLI reconstructs the API request from events.jsonl, includes the orphan tool_use, and Anthropic's API rejects the request with HTTP 400 forever:
CAPIError: 400 {"type":"error","error":{"type":"invalid_request_error",
"message":"messages.NNN: `tool_use` ids were found without `tool_result`
blocks immediately after: toolu_vrtx_XXXXXXXXXXXXX..."
There is no in-CLI recovery path. The user is stuck unless they manually truncate events.jsonl. This bit a real user — full day of failed retries before they noticed they were wedged, then manual file surgery to recover.
There are actually two bugs in one report, both observable in the same session, and either one being fixed would close the user-facing wedge. Filing them together because the evidence is shared.
Evidence — real wedged session
Session id (local, redacted): e8e06ede-.... Forensics from events.jsonl:
line 2447: assistant.message → contains tool_use toolu_vrtx_014C85KjmZVR5L9od4V7wNUg
line 2448: tool.execution_start (for the same toolCallId)
line 2449: hook.start (preToolUse)
line 2450: hook.end
line 2451: assistant.message ← ❌ NEW assistant message, no tool.execution_complete
for toolCallId 014C85KjmZVR5L9od4V7wNUg ever fired
... and ~360 subsequent lines, every one a failed retry against the API
Every API call from line 2451 onward replays the same orphan and gets rejected. User worked around manually by:
cp events.jsonl events.jsonl.backup
head -n 2446 events.jsonl.backup > events.jsonl
After truncation the session resumed cleanly.
Bug A — write side: orphan tool_use in events.jsonl
A tool.execution_start is not guaranteed to be followed by a matching tool.execution_complete for the same toolCallId. The agent loop somehow continues to the next iteration without recording the completion. From the evidence, plausible causes (don't know which applies — pick your poison):
- Tool execution silently failed / was killed. Tool process crashed, PTY died, or the runner aborted, but the agent loop didn't notice and proceeded.
- Race between event-writer and agent loop. Agent advances before the writer flushes
execution_complete for the in-flight tool.
- Hook abort swallowed.
hook.end on line 2450 looks clean, but if a later hook or guard rejects the result emit, you'd see this exact pattern (start written, complete never reaches disk).
Acceptance criteria
- Every
tool.execution_start is GUARANTEED to be followed by exactly one tool.execution_complete for the same toolCallId, regardless of what happens to the tool (success, failure, timeout, kill, hook abort, exception in agent loop).
- Failure paths write
execution_complete with an explicit error status + message in the data payload, rather than nothing.
- Probably needs a watchdog /
finally-block in the tool-runner so any exit path writes the completion.
Suggested tests
- Inject a synchronous tool failure mid-execution →
execution_complete still fires with error status.
- Kill the tool subprocess mid-execution →
execution_complete fires (watchdog timeout).
- Throw from a hook in
postToolUse → execution_complete still fires.
Bug B — read side: API request includes orphan tool_use without synthesized tool_result
Defense in depth. Even if Bug A's invariant slips once in the future, the CLI should never send Anthropic a message array where a tool_use block has no matching tool_result immediately after. The current read path trusts events.jsonl and ships the orphan straight through, which is why a single bad write wedges the session permanently instead of just losing one turn.
Acceptance criteria
When reconstructing the API message array from events.jsonl, every tool_use block in an assistant message that does not have a matching tool_result in the following turn boundary gets a synthetic tool_result injected, e.g.:
{
"type": "tool_result",
"tool_use_id": "toolu_vrtx_XXX",
"is_error": true,
"content": "Tool execution was interrupted; result was not recorded."
}
This makes the API call succeed (or fail informatively to the model) even on a malformed events.jsonl. The next turn will tell the model "your tool got interrupted, decide what to do" instead of locking the user out of their session.
Suggested test
- Pre-create an
events.jsonl with an orphan tool_use (no matching complete). Open the session. Verify the next API call goes through and contains a synthetic tool_result for the orphan.
Why both matter
- Bug A alone prevents future wedges but doesn't help anyone already wedged.
- Bug B alone unwedges everyone but lets the underlying data-loss bug persist silently.
- Together they're complete: write-side correctness + read-side resilience.
Workaround for affected users
Until either fix lands, the recovery is manual events.jsonl surgery:
# 1. From the API error, get the orphan tool_use id (toolu_vrtx_XXX).
# 2. Locate the session:
grep -l "toolu_vrtx_XXX" ~/.copilot/session-state/*/events.jsonl
# 3. Back up + truncate to one line before the orphan tool.execution_start:
SESSION_DIR=~/.copilot/session-state/<id>
EVENTS="$SESSION_DIR/events.jsonl"
cp "$EVENTS" "$EVENTS.backup"
ORPHAN_LINE=$(grep -n '"tool.execution_start".*"toolCallId":"toolu_vrtx_XXX"' "$EVENTS" | cut -d: -f1)
head -n $((ORPHAN_LINE - 1)) "$EVENTS.backup" > "$EVENTS"
# If anything goes wrong: mv "$EVENTS.backup" "$EVENTS"
You lose the orphaned turn (and any subsequent failed-retry turns) but the session resumes.
Environment
agency copilot CLI (PTY + ACP modes both potentially affected — write-side bug is in the agent loop, common to both)
- Anthropic API backend via Vertex (
toolu_vrtx_* prefix)
- macOS, but on-disk format is platform-agnostic
Related downstream tracking
- IDE-side recovery UI is being built so users don't have to drop to a shell (separate effort, doesn't replace the upstream fix).
Happy to share the full wedged events.jsonl privately if useful for repro.
Summary
Sessions can become permanently wedged because
events.jsonlends up containing anassistant.messagewith atool_useblock whose matchingtool.execution_completewas never written. On the next resume / send, the CLI reconstructs the API request fromevents.jsonl, includes the orphantool_use, and Anthropic's API rejects the request with HTTP 400 forever:There is no in-CLI recovery path. The user is stuck unless they manually truncate
events.jsonl. This bit a real user — full day of failed retries before they noticed they were wedged, then manual file surgery to recover.There are actually two bugs in one report, both observable in the same session, and either one being fixed would close the user-facing wedge. Filing them together because the evidence is shared.
Evidence — real wedged session
Session id (local, redacted):
e8e06ede-.... Forensics fromevents.jsonl:Every API call from line 2451 onward replays the same orphan and gets rejected. User worked around manually by:
cp events.jsonl events.jsonl.backup head -n 2446 events.jsonl.backup > events.jsonlAfter truncation the session resumed cleanly.
Bug A — write side: orphan
tool_useinevents.jsonlA
tool.execution_startis not guaranteed to be followed by a matchingtool.execution_completefor the sametoolCallId. The agent loop somehow continues to the next iteration without recording the completion. From the evidence, plausible causes (don't know which applies — pick your poison):execution_completefor the in-flight tool.hook.endon line 2450 looks clean, but if a later hook or guard rejects the result emit, you'd see this exact pattern (start written, complete never reaches disk).Acceptance criteria
tool.execution_startis GUARANTEED to be followed by exactly onetool.execution_completefor the sametoolCallId, regardless of what happens to the tool (success, failure, timeout, kill, hook abort, exception in agent loop).execution_completewith an explicit error status + message in the data payload, rather than nothing.finally-block in the tool-runner so any exit path writes the completion.Suggested tests
execution_completestill fires with error status.execution_completefires (watchdog timeout).postToolUse→execution_completestill fires.Bug B — read side: API request includes orphan
tool_usewithout synthesizedtool_resultDefense in depth. Even if Bug A's invariant slips once in the future, the CLI should never send Anthropic a message array where a
tool_useblock has no matchingtool_resultimmediately after. The current read path trustsevents.jsonland ships the orphan straight through, which is why a single bad write wedges the session permanently instead of just losing one turn.Acceptance criteria
When reconstructing the API message array from
events.jsonl, everytool_useblock in an assistant message that does not have a matchingtool_resultin the following turn boundary gets a synthetictool_resultinjected, e.g.:{ "type": "tool_result", "tool_use_id": "toolu_vrtx_XXX", "is_error": true, "content": "Tool execution was interrupted; result was not recorded." }This makes the API call succeed (or fail informatively to the model) even on a malformed
events.jsonl. The next turn will tell the model "your tool got interrupted, decide what to do" instead of locking the user out of their session.Suggested test
events.jsonlwith an orphantool_use(no matching complete). Open the session. Verify the next API call goes through and contains a synthetictool_resultfor the orphan.Why both matter
Workaround for affected users
Until either fix lands, the recovery is manual events.jsonl surgery:
You lose the orphaned turn (and any subsequent failed-retry turns) but the session resumes.
Environment
agency copilotCLI (PTY + ACP modes both potentially affected — write-side bug is in the agent loop, common to both)toolu_vrtx_*prefix)Related downstream tracking
Happy to share the full wedged
events.jsonlprivately if useful for repro.