Orphan tool_use in events.jsonl wedges sessions permanently (write-side + read-side)

## Summary

Sessions can become permanently wedged because `events.jsonl` ends up containing an `assistant.message` with a `tool_use` block whose matching `tool.execution_complete` was never written. On the next resume / send, the CLI reconstructs the API request from `events.jsonl`, includes the orphan `tool_use`, and Anthropic's API rejects the request with HTTP 400 forever:

```
CAPIError: 400 {"type":"error","error":{"type":"invalid_request_error",
"message":"messages.NNN: `tool_use` ids were found without `tool_result`
blocks immediately after: toolu_vrtx_XXXXXXXXXXXXX..."
```

There is no in-CLI recovery path. The user is stuck unless they manually truncate `events.jsonl`. This bit a real user — full day of failed retries before they noticed they were wedged, then manual file surgery to recover.

There are actually **two bugs in one report**, both observable in the same session, and either one being fixed would close the user-facing wedge. Filing them together because the evidence is shared.

## Evidence — real wedged session

Session id (local, redacted): `e8e06ede-...`. Forensics from `events.jsonl`:

```
line 2447: assistant.message  → contains tool_use toolu_vrtx_014C85KjmZVR5L9od4V7wNUg
line 2448: tool.execution_start (for the same toolCallId)
line 2449: hook.start (preToolUse)
line 2450: hook.end
line 2451: assistant.message  ← ❌ NEW assistant message, no tool.execution_complete
                                  for toolCallId 014C85KjmZVR5L9od4V7wNUg ever fired
... and ~360 subsequent lines, every one a failed retry against the API
```

Every API call from line 2451 onward replays the same orphan and gets rejected. User worked around manually by:

```bash
cp events.jsonl events.jsonl.backup
head -n 2446 events.jsonl.backup > events.jsonl
```

After truncation the session resumed cleanly.

## Bug A — write side: orphan `tool_use` in `events.jsonl`

A `tool.execution_start` is not guaranteed to be followed by a matching `tool.execution_complete` for the same `toolCallId`. The agent loop somehow continues to the next iteration without recording the completion. From the evidence, plausible causes (don't know which applies — pick your poison):

1. **Tool execution silently failed / was killed.** Tool process crashed, PTY died, or the runner aborted, but the agent loop didn't notice and proceeded.
2. **Race between event-writer and agent loop.** Agent advances before the writer flushes `execution_complete` for the in-flight tool.
3. **Hook abort swallowed.** `hook.end` on line 2450 looks clean, but if a later hook or guard rejects the result emit, you'd see this exact pattern (start written, complete never reaches disk).

### Acceptance criteria

- Every `tool.execution_start` is GUARANTEED to be followed by exactly one `tool.execution_complete` for the same `toolCallId`, regardless of what happens to the tool (success, failure, timeout, kill, hook abort, exception in agent loop).
- Failure paths write `execution_complete` with an explicit error status + message in the data payload, rather than nothing.
- Probably needs a watchdog / `finally`-block in the tool-runner so any exit path writes the completion.

### Suggested tests

- Inject a synchronous tool failure mid-execution → `execution_complete` still fires with error status.
- Kill the tool subprocess mid-execution → `execution_complete` fires (watchdog timeout).
- Throw from a hook in `postToolUse` → `execution_complete` still fires.

## Bug B — read side: API request includes orphan `tool_use` without synthesized `tool_result`

Defense in depth. Even if Bug A's invariant slips once in the future, the CLI should never send Anthropic a message array where a `tool_use` block has no matching `tool_result` immediately after. The current read path trusts `events.jsonl` and ships the orphan straight through, which is why a single bad write wedges the session **permanently** instead of just losing one turn.

### Acceptance criteria

When reconstructing the API message array from `events.jsonl`, every `tool_use` block in an assistant message that does not have a matching `tool_result` in the following turn boundary gets a synthetic `tool_result` injected, e.g.:

```json
{
  "type": "tool_result",
  "tool_use_id": "toolu_vrtx_XXX",
  "is_error": true,
  "content": "Tool execution was interrupted; result was not recorded."
}
```

This makes the API call succeed (or fail informatively to the model) even on a malformed `events.jsonl`. The next turn will tell the model "your tool got interrupted, decide what to do" instead of locking the user out of their session.

### Suggested test

- Pre-create an `events.jsonl` with an orphan `tool_use` (no matching complete). Open the session. Verify the next API call goes through and contains a synthetic `tool_result` for the orphan.

## Why both matter

- **Bug A alone** prevents future wedges but doesn't help anyone already wedged.
- **Bug B alone** unwedges everyone but lets the underlying data-loss bug persist silently.
- **Together** they're complete: write-side correctness + read-side resilience.

## Workaround for affected users

Until either fix lands, the recovery is manual events.jsonl surgery:

```bash
# 1. From the API error, get the orphan tool_use id (toolu_vrtx_XXX).
# 2. Locate the session:
grep -l "toolu_vrtx_XXX" ~/.copilot/session-state/*/events.jsonl

# 3. Back up + truncate to one line before the orphan tool.execution_start:
SESSION_DIR=~/.copilot/session-state/<id>
EVENTS="$SESSION_DIR/events.jsonl"
cp "$EVENTS" "$EVENTS.backup"
ORPHAN_LINE=$(grep -n '"tool.execution_start".*"toolCallId":"toolu_vrtx_XXX"' "$EVENTS" | cut -d: -f1)
head -n $((ORPHAN_LINE - 1)) "$EVENTS.backup" > "$EVENTS"
# If anything goes wrong: mv "$EVENTS.backup" "$EVENTS"
```

You lose the orphaned turn (and any subsequent failed-retry turns) but the session resumes.

## Environment

- `agency copilot` CLI (PTY + ACP modes both potentially affected — write-side bug is in the agent loop, common to both)
- Anthropic API backend via Vertex (`toolu_vrtx_*` prefix)
- macOS, but on-disk format is platform-agnostic

## Related downstream tracking

- IDE-side recovery UI is being built so users don't have to drop to a shell (separate effort, doesn't replace the upstream fix).

Happy to share the full wedged `events.jsonl` privately if useful for repro.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orphan tool_use in events.jsonl wedges sessions permanently (write-side + read-side) #3366

Summary

Evidence — real wedged session

Bug A — write side: orphan `tool_use` in `events.jsonl`

Acceptance criteria

Suggested tests

Bug B — read side: API request includes orphan `tool_use` without synthesized `tool_result`

Acceptance criteria

Suggested test

Why both matter

Workaround for affected users

Environment

Related downstream tracking

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Orphan tool_use in events.jsonl wedges sessions permanently (write-side + read-side) #3366

Description

Summary

Evidence — real wedged session

Bug A — write side: orphan tool_use in events.jsonl

Acceptance criteria

Suggested tests

Bug B — read side: API request includes orphan tool_use without synthesized tool_result

Acceptance criteria

Suggested test

Why both matter

Workaround for affected users

Environment

Related downstream tracking

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug A — write side: orphan `tool_use` in `events.jsonl`

Bug B — read side: API request includes orphan `tool_use` without synthesized `tool_result`