Skip to content

feat(coder/modules/agentapi): add state persistence#736

Open
mafredri wants to merge 9 commits intomainfrom
agentapi-state-4nzp
Open

feat(coder/modules/agentapi): add state persistence#736
mafredri wants to merge 9 commits intomainfrom
agentapi-state-4nzp

Conversation

@mafredri
Copy link
Member

AgentAPI can save and restore conversation state across workspace restarts.
The base module exports env vars (AGENTAPI_STATE_FILE, AGENTAPI_SAVE_STATE,
AGENTAPI_LOAD_STATE, AGENTAPI_PID_FILE) that the binary reads directly.
No consumer module start scripts need changes.

New variables:

  • enable_state_persistence (bool, default true)
  • state_file_path (string, defaults to $HOME/<module_dir_name>/state.json)
  • pid_file_path (string, defaults to $HOME/<module_dir_name>/agentapi.pid)

State persistence requires agentapi >= v0.12.0. A shared version_at_least
function in scripts/lib.sh gates both the env var exports in main.sh and
SIGUSR1 in the shutdown script. Old binaries get a warning and graceful
skip instead of breakage.

Shutdown script now performs a three-phase shutdown:

  1. SIGUSR1 to trigger state save (gated on version + persistence enabled)
  2. Log snapshot capture (existing behavior, now fault-tolerant via subshell)
  3. SIGTERM for graceful termination with wait loop

Also bumps agentapi module version to 2.2.0 and claude-code to 4.7.6.

Closes coder/internal#1257
Refs coder/internal#1256
Refs #696

log "Sending SIGUSR1 to AgentAPI (pid $agentapi_pid) to save state"
kill -USR1 "$agentapi_pid" || true
# Allow time for state save to complete before proceeding.
sleep 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know the path at which the state file should be written, we can sleep a bit longer and validate that it was written.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be an old one so we'd have to check timestamp.

But really, we don't care about confirming that it was written here as the TERM + wait covers that. It's more of an early invoke in case the workspace gets killed before we complete all the work here. The sleep is there to allow the serialization to initiate before we grab the messages via API call below.

@mafredri mafredri force-pushed the agentapi-state-4nzp branch 3 times, most recently from 636bc8c to 955441d Compare February 20, 2026 11:50
@mafredri mafredri changed the title feat(registry/coder/modules/agentapi): add state persistence feat(coder/modules/agentapi): add state persistence Feb 20, 2026
@mafredri mafredri force-pushed the agentapi-state-4nzp branch 5 times, most recently from 001c89d to 8a32b14 Compare February 20, 2026 12:00
@mafredri mafredri marked this pull request as ready for review February 20, 2026 12:00

# Source shared utilities (written by the coder_script wrapper).
# shellcheck source=lib.sh
source /tmp/agentapi-lib.sh
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Following existing pattern here but why don't we write all scripts to the provided module_dir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mafredri It honestly might be better to do this. As for why I am not entirely sure. I know that some modules do write logs and config files to the module dir.

It might be worth me going through and standardizing this across all of the modules.

agentapi_pid=$(cat "$PID_FILE_PATH" 2> /dev/null || echo "")
fi

# State persistence is only enabled when the binary supports it (>= v0.12.0).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: We'll have to confirm this ends up being the version (cc: @35C4n0r)

AgentAPI can save and restore conversation state across workspace restarts.
The base module exports env vars (AGENTAPI_STATE_FILE, AGENTAPI_SAVE_STATE,
AGENTAPI_LOAD_STATE, AGENTAPI_PID_FILE) that the binary reads directly.
No consumer module start scripts need changes.

New variables:
- enable_state_persistence (bool, default true)
- state_file_path (string, defaults to $HOME/<module_dir_name>/state.json)
- pid_file_path (string, defaults to $HOME/<module_dir_name>/agentapi.pid)

State persistence requires agentapi >= v0.12.0. A shared version_at_least
function in scripts/lib.sh gates both the env var exports in main.sh and
SIGUSR1 in the shutdown script. The version is queried from the real binary
(agentapi --version) rather than the Terraform variable, so it works
correctly when install_agentapi is false.

Shutdown script now performs a three-phase shutdown:
1. SIGUSR1 to trigger state save (gated on version + persistence enabled)
2. Log snapshot capture (existing behavior, now fault-tolerant via subshell)
3. SIGTERM for graceful termination with wait loop

Also bumps agentapi module version to 2.2.0.

Refs: internal#1257, internal#1256, registry#696
Copy link
Contributor

@DevelopmentCats DevelopmentCats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from what has already been pointed out.

@35C4n0r You might have some better insight on this since agentapi.

Copy link
Collaborator

@35C4n0r 35C4n0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM!
I'd test it once before approving.

default = true
}

variable "state_file_path" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, it'd be nice if we set the default for state_file_path and pid_file_path in the locals block below, rather than in the main.sh.

Copy link
Member Author

@mafredri mafredri Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but we currently build the module path dynamically in scripts (module_path="$HOME/${MODULE_DIR_NAME}"). To avoid inconsistency or broken changes in the future, we should have only one way to construct these paths, either script (maybe lib.sh?) or move to terraform. Might be a better fit for a refactor.

It's unfortunate that in terraform we lack the ability to do runtime evaluation, nor do we generally have a "safe" way to expose these to the shell script (quoting edge cases, etc, something I'd like to improve one day).

No consumer modules ship agentapi >= v0.12.0 yet, so the feature
was silently skipped at runtime while emitting a misleading warning
on every workspace start. Modules can opt in explicitly when ready.
@mafredri
Copy link
Member Author

FYI I flipped enable_state_persistence to default false. The feature would be silently skipped anyway while emitting a confusing warning on every start. I think it's better if modules opt in when they bump (as claude-code does in #749).

Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

mafredri and others added 5 commits February 24, 2026 12:40
The subshell EXIT trap references $tmpdir which is local to
capture_task_log_snapshot. When the trap fires during subshell
exit, the variable is out of scope, causing a nounset error
after the snapshot posts successfully.
It seems that agentapi can block for longer than 5s sometimes:

```
Shutting down AgentAPI
Sending SIGUSR1 to AgentAPI (pid 44876) to save state
Fetching messages from AgentAPI on port 3284
curl: (28) Operation timed out after 5001 milliseconds with 0 bytes received
Error: Failed to fetch messages from AgentAPI (may not be running)
Error: Cannot capture log snapshot without messages
Log snapshot capture failed, continuing shutdown
Sending SIGTERM to AgentAPI (pid 44876)
Warning: AgentAPI (pid 44876) still running after 5s
Shutdown complete
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Registry: AgentAPI base module shutdown script and state persistence

4 participants