feat(coder/modules/agentapi): add state persistence#736
feat(coder/modules/agentapi): add state persistence#736
Conversation
| log "Sending SIGUSR1 to AgentAPI (pid $agentapi_pid) to save state" | ||
| kill -USR1 "$agentapi_pid" || true | ||
| # Allow time for state save to complete before proceeding. | ||
| sleep 1 |
There was a problem hiding this comment.
If we know the path at which the state file should be written, we can sleep a bit longer and validate that it was written.
There was a problem hiding this comment.
It could be an old one so we'd have to check timestamp.
But really, we don't care about confirming that it was written here as the TERM + wait covers that. It's more of an early invoke in case the workspace gets killed before we complete all the work here. The sleep is there to allow the serialization to initiate before we grab the messages via API call below.
636bc8c to
955441d
Compare
001c89d to
8a32b14
Compare
|
|
||
| # Source shared utilities (written by the coder_script wrapper). | ||
| # shellcheck source=lib.sh | ||
| source /tmp/agentapi-lib.sh |
There was a problem hiding this comment.
Review: Following existing pattern here but why don't we write all scripts to the provided module_dir?
There was a problem hiding this comment.
@mafredri It honestly might be better to do this. As for why I am not entirely sure. I know that some modules do write logs and config files to the module dir.
It might be worth me going through and standardizing this across all of the modules.
| agentapi_pid=$(cat "$PID_FILE_PATH" 2> /dev/null || echo "") | ||
| fi | ||
|
|
||
| # State persistence is only enabled when the binary supports it (>= v0.12.0). |
There was a problem hiding this comment.
Review: We'll have to confirm this ends up being the version (cc: @35C4n0r)
AgentAPI can save and restore conversation state across workspace restarts. The base module exports env vars (AGENTAPI_STATE_FILE, AGENTAPI_SAVE_STATE, AGENTAPI_LOAD_STATE, AGENTAPI_PID_FILE) that the binary reads directly. No consumer module start scripts need changes. New variables: - enable_state_persistence (bool, default true) - state_file_path (string, defaults to $HOME/<module_dir_name>/state.json) - pid_file_path (string, defaults to $HOME/<module_dir_name>/agentapi.pid) State persistence requires agentapi >= v0.12.0. A shared version_at_least function in scripts/lib.sh gates both the env var exports in main.sh and SIGUSR1 in the shutdown script. The version is queried from the real binary (agentapi --version) rather than the Terraform variable, so it works correctly when install_agentapi is false. Shutdown script now performs a three-phase shutdown: 1. SIGUSR1 to trigger state save (gated on version + persistence enabled) 2. Log snapshot capture (existing behavior, now fault-tolerant via subshell) 3. SIGTERM for graceful termination with wait loop Also bumps agentapi module version to 2.2.0. Refs: internal#1257, internal#1256, registry#696
8a32b14 to
1435939
Compare
DevelopmentCats
left a comment
There was a problem hiding this comment.
LGTM aside from what has already been pointed out.
@35C4n0r You might have some better insight on this since agentapi.
35C4n0r
left a comment
There was a problem hiding this comment.
Code LGTM!
I'd test it once before approving.
| default = true | ||
| } | ||
|
|
||
| variable "state_file_path" { |
There was a problem hiding this comment.
For consistency, it'd be nice if we set the default for state_file_path and pid_file_path in the locals block below, rather than in the main.sh.
There was a problem hiding this comment.
I considered this, but we currently build the module path dynamically in scripts (module_path="$HOME/${MODULE_DIR_NAME}"). To avoid inconsistency or broken changes in the future, we should have only one way to construct these paths, either script (maybe lib.sh?) or move to terraform. Might be a better fit for a refactor.
It's unfortunate that in terraform we lack the ability to do runtime evaluation, nor do we generally have a "safe" way to expose these to the shell script (quoting edge cases, etc, something I'd like to improve one day).
No consumer modules ship agentapi >= v0.12.0 yet, so the feature was silently skipped at runtime while emitting a misleading warning on every workspace start. Modules can opt in explicitly when ready.
|
FYI I flipped |
The subshell EXIT trap references $tmpdir which is local to capture_task_log_snapshot. When the trap fires during subshell exit, the variable is out of scope, causing a nounset error after the snapshot posts successfully.
It seems that agentapi can block for longer than 5s sometimes: ``` Shutting down AgentAPI Sending SIGUSR1 to AgentAPI (pid 44876) to save state Fetching messages from AgentAPI on port 3284 curl: (28) Operation timed out after 5001 milliseconds with 0 bytes received Error: Failed to fetch messages from AgentAPI (may not be running) Error: Cannot capture log snapshot without messages Log snapshot capture failed, continuing shutdown Sending SIGTERM to AgentAPI (pid 44876) Warning: AgentAPI (pid 44876) still running after 5s Shutdown complete ```
AgentAPI can save and restore conversation state across workspace restarts.
The base module exports env vars (AGENTAPI_STATE_FILE, AGENTAPI_SAVE_STATE,
AGENTAPI_LOAD_STATE, AGENTAPI_PID_FILE) that the binary reads directly.
No consumer module start scripts need changes.
New variables:
State persistence requires agentapi >= v0.12.0. A shared version_at_least
function in scripts/lib.sh gates both the env var exports in main.sh and
SIGUSR1 in the shutdown script. Old binaries get a warning and graceful
skip instead of breakage.
Shutdown script now performs a three-phase shutdown:
Also bumps agentapi module version to 2.2.0 and claude-code to 4.7.6.
Closes coder/internal#1257
Refs coder/internal#1256
Refs #696