Skip to content

Orphan tool_use in events.jsonl wedges sessions permanently (write-side + read-side) #3366

@shachaf-ashkenazi

Description

@shachaf-ashkenazi

Summary

Sessions can become permanently wedged because events.jsonl ends up containing an assistant.message with a tool_use block whose matching tool.execution_complete was never written. On the next resume / send, the CLI reconstructs the API request from events.jsonl, includes the orphan tool_use, and Anthropic's API rejects the request with HTTP 400 forever:

CAPIError: 400 {"type":"error","error":{"type":"invalid_request_error",
"message":"messages.NNN: `tool_use` ids were found without `tool_result`
blocks immediately after: toolu_vrtx_XXXXXXXXXXXXX..."

There is no in-CLI recovery path. The user is stuck unless they manually truncate events.jsonl. This bit a real user — full day of failed retries before they noticed they were wedged, then manual file surgery to recover.

There are actually two bugs in one report, both observable in the same session, and either one being fixed would close the user-facing wedge. Filing them together because the evidence is shared.

Evidence — real wedged session

Session id (local, redacted): e8e06ede-.... Forensics from events.jsonl:

line 2447: assistant.message  → contains tool_use toolu_vrtx_014C85KjmZVR5L9od4V7wNUg
line 2448: tool.execution_start (for the same toolCallId)
line 2449: hook.start (preToolUse)
line 2450: hook.end
line 2451: assistant.message  ← ❌ NEW assistant message, no tool.execution_complete
                                  for toolCallId 014C85KjmZVR5L9od4V7wNUg ever fired
... and ~360 subsequent lines, every one a failed retry against the API

Every API call from line 2451 onward replays the same orphan and gets rejected. User worked around manually by:

cp events.jsonl events.jsonl.backup
head -n 2446 events.jsonl.backup > events.jsonl

After truncation the session resumed cleanly.

Bug A — write side: orphan tool_use in events.jsonl

A tool.execution_start is not guaranteed to be followed by a matching tool.execution_complete for the same toolCallId. The agent loop somehow continues to the next iteration without recording the completion. From the evidence, plausible causes (don't know which applies — pick your poison):

  1. Tool execution silently failed / was killed. Tool process crashed, PTY died, or the runner aborted, but the agent loop didn't notice and proceeded.
  2. Race between event-writer and agent loop. Agent advances before the writer flushes execution_complete for the in-flight tool.
  3. Hook abort swallowed. hook.end on line 2450 looks clean, but if a later hook or guard rejects the result emit, you'd see this exact pattern (start written, complete never reaches disk).

Acceptance criteria

  • Every tool.execution_start is GUARANTEED to be followed by exactly one tool.execution_complete for the same toolCallId, regardless of what happens to the tool (success, failure, timeout, kill, hook abort, exception in agent loop).
  • Failure paths write execution_complete with an explicit error status + message in the data payload, rather than nothing.
  • Probably needs a watchdog / finally-block in the tool-runner so any exit path writes the completion.

Suggested tests

  • Inject a synchronous tool failure mid-execution → execution_complete still fires with error status.
  • Kill the tool subprocess mid-execution → execution_complete fires (watchdog timeout).
  • Throw from a hook in postToolUseexecution_complete still fires.

Bug B — read side: API request includes orphan tool_use without synthesized tool_result

Defense in depth. Even if Bug A's invariant slips once in the future, the CLI should never send Anthropic a message array where a tool_use block has no matching tool_result immediately after. The current read path trusts events.jsonl and ships the orphan straight through, which is why a single bad write wedges the session permanently instead of just losing one turn.

Acceptance criteria

When reconstructing the API message array from events.jsonl, every tool_use block in an assistant message that does not have a matching tool_result in the following turn boundary gets a synthetic tool_result injected, e.g.:

{
  "type": "tool_result",
  "tool_use_id": "toolu_vrtx_XXX",
  "is_error": true,
  "content": "Tool execution was interrupted; result was not recorded."
}

This makes the API call succeed (or fail informatively to the model) even on a malformed events.jsonl. The next turn will tell the model "your tool got interrupted, decide what to do" instead of locking the user out of their session.

Suggested test

  • Pre-create an events.jsonl with an orphan tool_use (no matching complete). Open the session. Verify the next API call goes through and contains a synthetic tool_result for the orphan.

Why both matter

  • Bug A alone prevents future wedges but doesn't help anyone already wedged.
  • Bug B alone unwedges everyone but lets the underlying data-loss bug persist silently.
  • Together they're complete: write-side correctness + read-side resilience.

Workaround for affected users

Until either fix lands, the recovery is manual events.jsonl surgery:

# 1. From the API error, get the orphan tool_use id (toolu_vrtx_XXX).
# 2. Locate the session:
grep -l "toolu_vrtx_XXX" ~/.copilot/session-state/*/events.jsonl

# 3. Back up + truncate to one line before the orphan tool.execution_start:
SESSION_DIR=~/.copilot/session-state/<id>
EVENTS="$SESSION_DIR/events.jsonl"
cp "$EVENTS" "$EVENTS.backup"
ORPHAN_LINE=$(grep -n '"tool.execution_start".*"toolCallId":"toolu_vrtx_XXX"' "$EVENTS" | cut -d: -f1)
head -n $((ORPHAN_LINE - 1)) "$EVENTS.backup" > "$EVENTS"
# If anything goes wrong: mv "$EVENTS.backup" "$EVENTS"

You lose the orphaned turn (and any subsequent failed-retry turns) but the session resumes.

Environment

  • agency copilot CLI (PTY + ACP modes both potentially affected — write-side bug is in the agent loop, common to both)
  • Anthropic API backend via Vertex (toolu_vrtx_* prefix)
  • macOS, but on-disk format is platform-agnostic

Related downstream tracking

  • IDE-side recovery UI is being built so users don't have to drop to a shell (separate effort, doesn't replace the upstream fix).

Happy to share the full wedged events.jsonl privately if useful for repro.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions