diff --git a/database_api/model_planitem.py b/database_api/model_planitem.py index 6bb39b2a..6e417511 100644 --- a/database_api/model_planitem.py +++ b/database_api/model_planitem.py @@ -37,6 +37,7 @@ class PlanState(enum.Enum): processing = 2 completed = 3 failed = 4 + stopped = 5 class PlanItem(db.Model): diff --git a/docs/mcp/autonomous_agent_guide.md b/docs/mcp/autonomous_agent_guide.md index 55294e95..6ac7d8bd 100644 --- a/docs/mcp/autonomous_agent_guide.md +++ b/docs/mcp/autonomous_agent_guide.md @@ -68,6 +68,7 @@ Poll every 5 minutes. State transitions: - `processing` → Pipeline is running - `completed` → Report is ready - `failed` → Terminal error (use `plan_resume` or `plan_retry`) +- `stopped` → User called `plan_stop` (use `plan_resume` to continue or `plan_retry` to restart) ### Step 6: Handle failures diff --git a/docs/mcp/planexe_mcp_interface.md b/docs/mcp/planexe_mcp_interface.md index bbc74b01..76245eb5 100644 --- a/docs/mcp/planexe_mcp_interface.md +++ b/docs/mcp/planexe_mcp_interface.md @@ -15,7 +15,7 @@ The plan is a **project plan**: a DAG of steps (Luigi pipeline stages) that prod Implementors should expose the following to agents so they understand what PlanExe does: - **What:** PlanExe turns a plain-English goal into a strategic project-plan draft (20+ sections) in ~10–20 min. Sections include executive summary, interactive Gantt charts, investor pitch, SWOT, governance, team profiles, work breakdown, scenario comparison, expert criticism, and adversarial sections (premortem, self-audit, premise attacks) that stress-test the plan. The output is a draft to refine, not an executable or final document — but it surfaces hard questions the prompter may not have considered. -- **Required interaction order:** Call `example_plans` (optional) and `example_prompts` first. Optional before `plan_create`: call `model_profiles` to inspect profile guidance and available models in each profile. Then complete a non-tool step: formulate a detailed prompt as flowing prose (not structured markdown), typically ~300-800 words, using the examples as a baseline; include objective, scope, constraints, timeline, stakeholders, budget/resources, and success criteria; get user approval. Only after approval, call `plan_create`. Then poll `plan_status` (about every 5 minutes); use `plan_download` (mcp_local helper) or `plan_file_info` (mcp_cloud tool) when complete (`pending`/`processing` = keep polling, `completed` = download now, `failed` = terminal). If a plan fails before completing all steps, call `plan_resume` to continue from where it left off without discarding completed work. Use `plan_retry` for a full restart (plan must be in failed state). Both accept the failed `plan_id` and optional `model_profile` (default `baseline`). To stop, call `plan_stop` with the `plan_id` from `plan_create`. +- **Required interaction order:** Call `example_plans` (optional) and `example_prompts` first. Optional before `plan_create`: call `model_profiles` to inspect profile guidance and available models in each profile. Then complete a non-tool step: formulate a detailed prompt as flowing prose (not structured markdown), typically ~300-800 words, using the examples as a baseline; include objective, scope, constraints, timeline, stakeholders, budget/resources, and success criteria; get user approval. Only after approval, call `plan_create`. Then poll `plan_status` (about every 5 minutes); use `plan_download` (mcp_local helper) or `plan_file_info` (mcp_cloud tool) when complete (`pending`/`processing` = keep polling, `completed` = download now, `failed` = terminal error, `stopped` = user called plan_stop). If a plan fails or is stopped before completing all steps, call `plan_resume` to continue from where it left off without discarding completed work. Use `plan_retry` for a full restart (plan must be in failed or stopped state). Both accept the `plan_id` and optional `model_profile` (default `baseline`). To stop, call `plan_stop` with the `plan_id` from `plan_create`. - **Output:** Self-contained interactive HTML report (~700KB) with collapsible sections and interactive Gantt charts — open in a browser. The zip contains the intermediary pipeline files (md, json, csv) that fed the report. ### 1.3 Scope of this document @@ -71,10 +71,10 @@ The interface is designed to support: The MCP specification defines two different mechanisms: -- **MCP tools** (e.g. plan_create, plan_status, plan_stop, plan_retry, plan_resume): the server exposes named tools; the client calls them and receives a response. PlanExe's interface is **tool-based**: the agent calls plan_create → receives plan_id → polls plan_status → optionally calls plan_resume or plan_retry on failed → uses plan_file_info (and optionally plan_download via mcp_local). This document specifies those tools. +- **MCP tools** (e.g. plan_create, plan_status, plan_stop, plan_retry, plan_resume): the server exposes named tools; the client calls them and receives a response. PlanExe's interface is **tool-based**: the agent calls plan_create → receives plan_id → polls plan_status → optionally calls plan_resume or plan_retry on failed/stopped → uses plan_file_info (and optionally plan_download via mcp_local). This document specifies those tools. - **MCP tasks protocol** ("Run as task" in some UIs): a separate mechanism where the client can run a tool "as a task" using RPC methods such as tasks/run, tasks/get, tasks/result, tasks/cancel, tasks/list, so the tool runs in the background and the client polls for results. -PlanExe **does not** use or advertise the MCP tasks protocol. Implementors and clients should use the **tools only**. Do not enable "Run as task" for PlanExe; many clients (e.g. Cursor) and the Python MCP SDK do not support the tasks protocol properly. Intended flow: optionally call `example_plans`; call `example_prompts`; optionally call `model_profiles`; perform the non-tool prompt drafting/approval step; call `plan_create`; poll `plan_status`; if failed call `plan_resume` to continue or `plan_retry` for a full restart (optional); then call `plan_file_info` (or `plan_download` via mcp_local) when completed. +PlanExe **does not** use or advertise the MCP tasks protocol. Implementors and clients should use the **tools only**. Do not enable "Run as task" for PlanExe; many clients (e.g. Cursor) and the Python MCP SDK do not support the tasks protocol properly. Intended flow: optionally call `example_plans`; call `example_prompts`; optionally call `model_profiles`; perform the non-tool prompt drafting/approval step; call `plan_create`; poll `plan_status`; if failed or stopped call `plan_resume` to continue or `plan_retry` for a full restart (optional); then call `plan_file_info` (or `plan_download` via mcp_local) when completed. --- @@ -99,7 +99,7 @@ A single execution attempt inside a plan. **Key properties** -- state: pending | processing | completed | failed +- state: pending | processing | completed | failed | stopped - progress_percentage: computed progress percentage (float) - started_at, ended_at @@ -137,13 +137,16 @@ The public MCP `state` field is aligned with `PlanItem.state`: - processing (picked up by a worker) - completed - failed +- stopped (user called `plan_stop`) ### 5.2 Allowed transitions - pending → processing when picked up by a worker - processing → completed via normal success - processing → failed via error +- processing → stopped via `plan_stop` - failed → pending when `plan_retry` or `plan_resume` is accepted +- stopped → pending when `plan_retry` or `plan_resume` is accepted ### 5.3 Invalid transitions @@ -332,10 +335,11 @@ Returns plan status and progress. Used for progress bars and UI states. **Pollin - `processing`: picked up by a worker and in progress. Keep polling. - `completed`: terminal success. Download artifacts now. - `failed`: terminal error. Do not keep polling for completion. +- `stopped`: user called `plan_stop`. Consider `plan_resume` to continue or `plan_retry` to restart. **Terminal states** -- `completed`, `failed` +- `completed`, `failed`, `stopped` **Response** @@ -441,7 +445,7 @@ Retries a plan that is currently in `failed` state. Resume a failed plan without discarding completed intermediary files. Plan generation restarts from the first incomplete step, skipping all steps that already produced output files. -Use `plan_resume` when `plan_status` shows `failed` and plan generation was interrupted before completing all steps (network drop, timeout, `plan_stop`, worker crash). For a full restart or to change `model_profile`, use `plan_retry` instead. +Use `plan_resume` when `plan_status` shows `failed` or `stopped` and plan generation was interrupted before completing all steps (network drop, timeout, `plan_stop`, worker crash). For a full restart or to change `model_profile`, use `plan_retry` instead. **Request** @@ -566,7 +570,7 @@ Recommended practice for MCP clients: Additional semantics: - Every `plan_create` call creates a new independent plan with a new `plan_id`. -- `plan_retry` and `plan_resume` reuse the existing failed `plan_id` (they do not create a new plan id). +- `plan_retry` and `plan_resume` reuse the existing failed or stopped `plan_id` (they do not create a new plan id). - The server does not deduplicate “same prompt” requests into a single shared plan. - Keep your own plan registry/client state if you run multiple plans concurrently. @@ -610,8 +614,8 @@ Cloud/core tool codes: - `INVALID_TOOL`: unknown MCP tool name. - `INTERNAL_ERROR`: uncaught server error. - `PLAN_NOT_FOUND`: plan_id not found. -- `PLAN_NOT_FAILED`: plan_retry called for a plan that is not in failed state. -- `PLAN_NOT_RESUMABLE`: plan_resume called for a plan that is not in failed state. +- `PLAN_NOT_FAILED`: plan_retry called for a plan that is not in failed or stopped state. +- `PLAN_NOT_RESUMABLE`: plan_resume called for a plan that is not in failed or stopped state. - `PIPELINE_VERSION_MISMATCH`: plan_resume snapshot was created by a different pipeline version; use plan_retry instead. - `INVALID_USER_API_KEY`: provided user_api_key is invalid. - `USER_API_KEY_REQUIRED`: deployment requires user_api_key for plan_create. @@ -636,8 +640,8 @@ Local proxy specific codes: - `USER_API_KEY_REQUIRED` - `INSUFFICIENT_CREDITS` - `INVALID_TOOL` -- For `PLAN_NOT_FAILED`: call `plan_retry` only after `plan_status.state == failed`. -- For `PLAN_NOT_RESUMABLE`: call `plan_resume` only after `plan_status.state == failed`. +- For `PLAN_NOT_FAILED`: call `plan_retry` only after `plan_status.state` is `failed` or `stopped`. +- For `PLAN_NOT_RESUMABLE`: call `plan_resume` only after `plan_status.state` is `failed` or `stopped`. - For `PIPELINE_VERSION_MISMATCH`: the snapshot is incompatible with the current pipeline; use `plan_retry` for a clean restart. - For `PLAN_NOT_FOUND`: verify plan_id source and stop polling that id. - For `generation_failed`: treat as terminal failure and surface plan progress_message to user. diff --git a/docs/proposals/111-promising-directions.md b/docs/proposals/111-promising-directions.md index e3f4a7e0..fbfdcb8d 100644 --- a/docs/proposals/111-promising-directions.md +++ b/docs/proposals/111-promising-directions.md @@ -24,9 +24,10 @@ Agents need PlanExe runs to complete reliably without human intervention. A fail | **103** | Pipeline Hardening for Local Models | Fix silent truncation and context-window overflows. Critical for agents running local models where failures are subtle | | **113** | LLM Error Traceability | ✅ **Implemented (PR #237)**. `LLMChatError` replaces generic `ValueError` across 38 call sites. Root cause preserved for error classification; `error_id` UUID enables log-to-metrics cross-referencing. Agents can programmatically diagnose failures | | **101** | Luigi Resume Enhancements | Webhook hooks on task completion/failure — agents can subscribe to events instead of polling | -| **114-I1** | Stopped vs Failed State | `plan_stop` and worker crashes both produce `failed` — agents can't distinguish user-initiated stops from actual errors. Add `stop_reason` field or a new `stopped` state | -| **114-I2** | Failure Diagnostics in `plan_status` | When a plan fails, no `failure_reason`, `failed_step`, or `last_error` is returned. Biggest observability gap — agents can only say "it failed" without explaining why. Extends #113 to the MCP consumer surface | +| **114-I1** | Stopped vs Failed State | ✅ **Implemented**. Dedicated `PlanState.stopped` enum value — `plan_stop` transitions to `stopped`, not `failed`. Agents can now distinguish user-initiated stops from actual errors. `plan_retry` and `plan_resume` accept both `failed` and `stopped` | +| **114-I2** | Failure Diagnostics in `plan_status` | When a plan fails, no `failure_reason`, `failed_step`, `last_error`, or `recoverable` flag is returned. Biggest observability gap — agents can only say "it failed" without explaining why or recommending resume vs retry. Extends #113 to the MCP consumer surface | | **114-I7** | Stalled-Plan Detection | No `last_progress_at` or `last_llm_call_at` timestamps. Agents can't distinguish "slow step" from "stuck worker". Complements #87 §8 | +| **114-I10** | Silent Partial Failures in Completed Plans | A plan can reach `completed` with empty or stub-quality sections (e.g. 2/8 experts responding). No `quality_summary` in `plan_status` — agents can't tell if `completed` means "all sections produced quality output" or just "all steps ran." Trust gap for autonomous workflows | --- @@ -107,9 +108,10 @@ Phase 1: Reliable foundation (nearly complete) ├─ #110 Usage metrics ✅ (PR #219, #236, #237) ├─ #113 Error traceability ✅ (PR #237) ├─ #58 Prompt boost ⚙️ (open PR #222) - ├─ #114-I1 Stopped vs failed state ← next priority + ├─ #114-I1 Stopped vs failed state ✅ ├─ #114-I2 Failure diagnostics in plan_status ← next priority (biggest gap) - └─ #114-I7 Stalled-plan detection + ├─ #114-I7 Stalled-plan detection + └─ #114-I10 Silent partial failures in completed plans Phase 2: Agent-native interface (next) ├─ #86 Remove agent friction points ✅ (PR #223) diff --git a/docs/proposals/114-mcp-interface-feedback-stress-test.md b/docs/proposals/114-mcp-interface-feedback-stress-test.md index 346b1ac3..1ffc3a57 100644 --- a/docs/proposals/114-mcp-interface-feedback-stress-test.md +++ b/docs/proposals/114-mcp-interface-feedback-stress-test.md @@ -1,15 +1,18 @@ # Proposal 114: MCP Interface Issues from 10-Plan Agent Stress Test **Date:** 2026-03-11 -**Status:** Proposal -**Source:** Feedback from Claude (Opus 4.6) after a 10-plan stress-testing session operated through Claude Code CLI. +**Status:** Proposal (I1, I4 implemented; remainder open) +**Source:** Feedback from Claude (Opus 4.6) after two stress-testing sessions (12 plans total) operated through Claude Code CLI. **Scope:** MCP interface design, observability, and lifecycle management. Does not cover plan content quality or pipeline output improvements. --- ## 1. Context -A 10-plan stress test was conducted through Claude Code CLI using the PlanExe MCP interface. +Two stress-testing sessions were conducted through Claude Code CLI using the PlanExe MCP interface: + +- **Session 1** (10 plans): Surfaced core MCP interface friction points — state ambiguity, missing diagnostics, lifecycle gaps. +- **Session 2** (2 plans, with stop/resume cycling): Validated the `stopped` state implementation and identified remaining gaps. The testing surfaced concrete MCP interface friction points that affect agent operators — issues with state management, observability, lifecycle hygiene, and workflow ergonomics. This proposal catalogues the MCP-specific issues and maps them against existing proposals. @@ -19,13 +22,15 @@ The testing surfaced concrete MCP interface friction points that affect agent op ### I1 — `failed` state conflates user-stop and actual failure +**Status:** Implemented (Option A — dedicated `PlanState.stopped` DB state). Supersedes PR #244's Option B (`stop_reason` field). + **Problem:** When `plan_stop` is called, the plan transitions to `failed`. When a worker crashes, it also transitions to `failed`. The agent operator cannot distinguish between: - User-initiated stop (nothing went wrong) - Actual failure (network drop, model error, worker crash) This matters because the correct response differs: user stop suggests `plan_resume`; actual failure may need `plan_retry` or investigation. -**Overlap:** Proposal 87 (plan_resume) acknowledges this in §4 ("A future improvement could introduce a distinct `stopped` state") but defers it as out of scope. +**Overlap:** Proposal 87 (plan_resume) acknowledged this in §4 — now updated to reflect the implementation. **Options:** @@ -36,7 +41,9 @@ This matters because the correct response differs: user stop suggests `plan_resu Option B is less disruptive and can be added to `plan_status` without changing state enum values. -**Affected files:** `mcp_cloud/db_queries.py`, `mcp_cloud/handlers.py` (plan_status response), DB model (new column or use existing `parameters` JSONB). +**Implemented approach:** Option A — a dedicated `PlanState.stopped` enum value (value 5). `plan_stop` now transitions plans to the `stopped` state instead of `failed`. The `stop_reason` field introduced in PR #244 (Option B) has been removed — the state itself communicates intent. `plan_retry` and `plan_resume` accept both `failed` and `stopped` states. DB migration adds `'stopped'` to the PostgreSQL enum type (both `taskstate` for pre-rename databases and `planstate` for fresh databases — the Python class was renamed from `TaskState` to `PlanState` in proposal 74 but the PostgreSQL type name was not changed). The worker's post-pipeline finalization also transitions to `stopped` (not `failed`) when `stop_requested` is true. + +**Affected files:** `database_api/model_planitem.py`, `mcp_cloud/db_setup.py`, `worker_plan_database/app.py`, `mcp_cloud/db_queries.py`, `mcp_cloud/handlers.py`, `mcp_cloud/sse.py`, `mcp_cloud/tool_models.py`, `mcp_cloud/schemas.py`, `mcp_local/planexe_mcp_local.py`, `frontend_multi_user/src/app.py`, `frontend_multi_user/src/planexe_modelviews.py`, `frontend_multi_user/templates/plan_iframe.html`, `frontend_multi_user/templates/run_via_database.html`, `frontend_multi_user/templates/index.html`, `frontend_multi_user/templates/account.html`, `mcp_cloud/tests/test_plan_status_tool.py`, `docs/mcp/planexe_mcp_interface.md`, `docs/mcp/autonomous_agent_guide.md`, `docs/proposals/87-plan-resume-mcp-tool.md`, `docs/proposals/111-promising-directions.md`. --- @@ -57,11 +64,14 @@ During the stress test, Plan 1 (20f1cfac) stalled at 5.5% with zero diagnostic i "state": "failed", "failure_reason": "model_error", "failed_step": "016-expert_criticism", - "last_error": "openrouter-gemini-2.0-flash-001 returned invalid_json" + "last_error": "openrouter-gemini-2.0-flash-001 returned invalid_json", + "recoverable": true } ``` -**Implementation path:** The worker already logs errors internally. When transitioning to `failed`, write `failure_reason`, `failed_step`, and `last_error` to the plan's DB record (e.g. in the `parameters` JSONB column or new dedicated columns). `plan_status` handler surfaces these fields when `state == "failed"`. +The `recoverable` boolean would let the agent immediately suggest `plan_resume` (transient/recoverable) vs `plan_retry` (fundamental/non-recoverable) without guessing. + +**Implementation path:** The worker already logs errors internally. When transitioning to `failed`, write `failure_reason`, `failed_step`, `last_error`, and `recoverable` to the plan's DB record (e.g. in the `parameters` JSONB column or new dedicated columns). `plan_status` handler surfaces these fields when `state == "failed"`. **Affected files:** `worker_plan_database/app.py` (error capture on failure), `mcp_cloud/db_queries.py` (store failure info), `mcp_cloud/handlers.py` (include in plan_status response). @@ -75,7 +85,7 @@ During the stress test, Plan 1 (20f1cfac) stalled at 5.5% with zero diagnostic i | Tool | Behavior | |------|----------| -| `plan_delete` | Permanently remove a plan from the user's list. Only allowed for terminal states (`failed`, `completed`). | +| `plan_delete` | Permanently remove a plan from the user's list. Only allowed for terminal states (`completed`, `failed`, `stopped`). | | `plan_archive` | Soft-delete: plan is hidden from `plan_list` but remains in DB for auditing. | `plan_archive` is safer for a billing system where plan records may need to be retained. @@ -156,7 +166,7 @@ This would eliminate the need for `plan_status` polling entirely when SSE is ava ``` plan_wait(plan_id, timeout_seconds=1200): Blocks until the plan reaches a terminal state -(completed, failed) or the timeout expires. Returns the final plan_status response. +(completed, failed, stopped) or the timeout expires. Returns the final plan_status response. ``` **Implementation:** Server-side long-poll using the existing SSE infrastructure. The handler subscribes to the plan's state changes and returns when terminal or timeout. @@ -182,11 +192,35 @@ This enables prompt iteration tracking without changing existing behavior for pl --- +### I10 — Silent partial failures in completed plans + +**Problem:** A plan can reach `completed` with sections that are empty or stub-quality (e.g. 6/8 experts returning no feedback in the expert criticism step). There is no signal in `plan_status` or `plan_file_info` that the output has quality gaps. The agent has to read individual files to discover this. `completed` means "all 110 steps ran" — not "all sections produced quality output." + +This is a trust gap: the agent cannot confidently tell the user "your plan is ready" without caveats, because it has no visibility into per-section quality. + +**Proposed addition:** A `quality_summary` in the completed plan status or file info: + +```json +{ + "sections_complete": 108, + "sections_partial": 2, + "partial_details": [ + {"step": "016-expert_criticism", "note": "2/8 experts provided feedback"} + ] +} +``` + +**Implementation path:** The worker already knows which steps produced output. A post-pipeline validation pass could check key sections for minimum output quality (e.g. non-empty, expected structure present) and write a summary to the DB. `plan_status` surfaces it when `state == "completed"`. + +**Affected files:** `worker_plan_database/app.py` (quality validation pass), DB model (quality summary column), `mcp_cloud/handlers.py` (include in completed plan_status response). + +--- + ## 3. Cross-Reference with Existing Proposals | Issue | Existing Proposal | Gap | |-------|-------------------|-----| -| I1 (stopped vs failed) | 87 §4 (deferred) | Needs its own decision — Option A vs B | +| I1 (stopped vs failed) | 87 §4 (deferred) | **Implemented** (Option A — dedicated `PlanState.stopped`) | | I2 (failure diagnostics) | 113 (logs only) | Not surfaced to MCP consumer | | I3 (plan_delete) | None | New | | I4 (idempotency) | None | **Implemented** (PR #242) | @@ -195,6 +229,7 @@ This enables prompt iteration tracking without changing existing behavior for pl | I7 (stall detection) | 87 §8 (partial) | No explicit stall timestamps | | I8 (plan_wait) | None | New | | I9 (prompt iteration) | None | New | +| I10 (silent partial failures) | None | New | --- @@ -208,7 +243,7 @@ Several items here refine or extend issues already tracked in Proposal 70's chec | I5 (rich SSE) | 70 §5.1 (SSE implemented) | Extension — SSE works but events are thin | | I6 (download TTL) | 70 (signed tokens done) | Refinement — increase TTL | -If accepted, I1–I4 and I7–I9 should be added to Proposal 70's quick-win checklist as new line items. +If accepted, I1–I4 and I7–I10 should be added to Proposal 70's quick-win checklist as new line items. --- @@ -217,11 +252,46 @@ If accepted, I1–I4 and I7–I9 should be added to Proposal 70's quick-win chec | Priority | Issues | Rationale | |----------|--------|-----------| | P1 | I2 (failure diagnostics) | Biggest observability gap. Agent cannot help users debug failures without this. | -| P1 | I1 (stopped vs failed) | Small change (Option B), high diagnostic value. Prerequisite for good plan_resume UX. | +| ~~P1~~ | ~~I1 (stopped vs failed)~~ | **Implemented** (Option A). Dedicated `PlanState.stopped` enum value — `plan_stop` transitions to `stopped`, not `failed`. | | P2 | I7 (stall detection) | Prevents agents from waiting indefinitely on stuck plans. | | P2 | I6 (download TTL) | Low effort, reduces friction. | | P2 | I5 (rich SSE events) | Eliminates polling for SSE-capable clients. | | P2 | I3 (plan_delete) | Hygiene for multi-plan sessions. | | ~~P3~~ | ~~I4 (idempotency)~~ | **Implemented** (PR #242). Server-side auto-dedup on `(user_id, prompt, model_profile)` within a time window. | +| P2 | I10 (silent partial failures) | Agent cannot trust `completed` means quality output. Undermines confidence in the entire workflow. | | P3 | I8 (plan_wait) | Nice-to-have for shell-less agents. | | P3 | I9 (prompt iteration) | Nice-to-have for iteration workflows. | + +--- + +## 6. Agent Perception (after 12 plans across two sessions) + +**Overall rating: 8.5/10** (up from 8/10 after session 1). The improvement comes from the `stopped` state — a small change with outsized impact on agent confidence. + +### What works well + +1. **Tool descriptions are best-in-class.** The agent operates PlanExe without external documentation, solely from tool descriptions. They specify call order, state contract, timing expectations, error codes, and troubleshooting guidance. +2. **State machine is now clean.** Every state has a single obvious next action: `pending` → wait, `processing` → poll, `completed` → download, `stopped` → suggest resume, `failed` → investigate. No state requires guessing. +3. **SSE completion detection is reliable.** Across 12 plans and multiple stop/resume cycles, the stream always auto-closed on terminal states as expected. +4. **Resume preserves progress.** After stopping at step 12/110, resume continued from step 12, not step 1. The `resume_count` field makes the history visible without the agent needing to track it. +5. **Deduplication prevents accidents.** Same prompt + model_profile within a short window returns the existing plan with `deduplicated=true`. +6. **Progress reporting is honest.** The tool description explicitly warns that `progress_percentage` is not linear and shouldn't be used for time estimates. + +### The stop/resume cycle + +Session 2 tested: create → stop → resume → stop → resume → complete. At no point was the agent confused about state or what to suggest. The `resume_count` field confirmed history without shadow state. The mark of a good state machine — the agent doesn't need to maintain its own bookkeeping. + +### Remaining trust gap + +The agent trusts the state machine to report accurately, trusts resume to preserve progress, trusts SSE to close on terminal states, and trusts tool descriptions to be current. The remaining trust gap is around `completed` plans — `completed` means "all steps ran," not "all sections produced quality output." The interface doesn't distinguish between these yet (see I10). + +### Evolution between sessions + +| Aspect | Session 1 | Session 2 | Direction | +|--------|-----------|-----------|-----------| +| Stop state | `failed` (ambiguous) | `stopped` (clear) | Fixed | +| Resume tracking | No history | `resume_count` field | New | +| Tool descriptions | Excellent | Updated with `stopped` state | Improved | +| Failure diagnostics | Missing | Still missing | Unchanged | +| Silent partial failures | Not surfaced | Not surfaced | Unchanged | +| Plan cleanup | No delete | No delete | Unchanged | diff --git a/docs/proposals/87-plan-resume-mcp-tool.md b/docs/proposals/87-plan-resume-mcp-tool.md index 2bbcd64e..150d5333 100644 --- a/docs/proposals/87-plan-resume-mcp-tool.md +++ b/docs/proposals/87-plan-resume-mcp-tool.md @@ -106,7 +106,7 @@ This path is not used by `worker_plan_database` (which runs the pipeline directl | `failed` | ✅ Yes | Terminal state, run dir likely intact | | `stopped` (via `plan_stop`) | ✅ Yes | Intentionally halted, run dir intact | -**Note:** `plan_stop` currently transitions to `failed`. A future improvement could introduce a distinct `stopped` state to disambiguate user-initiated halts from error failures (out of scope for this proposal). +**Note:** `plan_stop` now transitions to the dedicated `stopped` state (implemented in Proposal 114-I1, Option A), disambiguating user-initiated halts from error failures. --- @@ -273,7 +273,7 @@ This requires the worker to write the current task name to a lightweight status ## 9. Out of Scope -- Introducing a distinct `stopped` state (separate from `failed`) — deferred +- ~~Introducing a distinct `stopped` state (separate from `failed`)~~ — implemented (Proposal 114-I1) - Progress scrubbing on resume (reporting accurate % based on remaining tasks) — deferred - `plan_resume` for `mcp_local` (local HTTP worker via `worker_plan/app.py`) — the HTTP worker already supports this via `submit_or_retry="retry"`; a follow-on can wire it up - Resume after partial artifact upload failure — deferred diff --git a/frontend_multi_user/src/app.py b/frontend_multi_user/src/app.py index 77f910d8..5df97f83 100644 --- a/frontend_multi_user/src/app.py +++ b/frontend_multi_user/src/app.py @@ -496,6 +496,21 @@ def _ensure_step_count_columns() -> None: if "current_step" not in columns: conn.execute(text("ALTER TABLE task_item ADD COLUMN IF NOT EXISTS current_step VARCHAR(128)")) + def _ensure_stopped_state() -> None: + """Add 'stopped' value to the planstate/taskstate enum type (idempotent). + + The PostgreSQL enum type is named ``taskstate`` in databases + created before the TaskState → PlanState Python rename + (proposal 74). Fresh databases will have ``planstate``. + We try both names. + """ + with self.db.engine.begin() as conn: + for type_name in ("taskstate", "planstate"): + try: + conn.execute(text(f"ALTER TYPE {type_name} ADD VALUE IF NOT EXISTS 'stopped'")) + except Exception as exc: + logger.debug("ALTER TYPE %s: %s", type_name, exc) + def _ensure_planitem_indexes() -> None: insp = inspect(self.db.engine) table_names = set(insp.get_table_names()) @@ -553,6 +568,7 @@ def _create_tables_with_retry(attempts: int = 5, delay_seconds: float = 2.0) -> _ensure_user_account_columns() _ensure_multi_api_key_columns() _ensure_step_count_columns() + _ensure_stopped_state() _ensure_planitem_indexes() _seed_initial_records() return @@ -2146,6 +2162,7 @@ def index(): self.db.session.query( PlanItem.id, PlanItem.state, + PlanItem.stop_requested, func.substr(PlanItem.prompt, 1, 240).label("prompt_preview"), ) .filter(uid_filter) @@ -2165,6 +2182,7 @@ def index(): self.db.session.query( PlanItem.id, PlanItem.state, + PlanItem.stop_requested, ) .filter(uid_filter) .order_by(PlanItem.timestamp_created.desc()) @@ -2178,10 +2196,11 @@ def index(): prompt_text = self._load_prompt_preview_safe(task.id) else: prompt_text = (prompt_preview or "").strip() or "[Prompt unavailable]" + state = task.state if isinstance(task.state, PlanState) else None recent_tasks.append( SimpleNamespace( id=str(task.id), - state=task.state if isinstance(task.state, PlanState) else None, + state=state, prompt=prompt_text, ) ) @@ -3213,6 +3232,7 @@ def plan(): PlanItem.id, PlanItem.timestamp_created, PlanItem.state, + PlanItem.stop_requested, func.substr(PlanItem.prompt, 1, 240).label("prompt_preview"), ) .filter(uid_filter) @@ -3232,6 +3252,7 @@ def plan(): PlanItem.id, PlanItem.timestamp_created, PlanItem.state, + PlanItem.stop_requested, ) .filter(uid_filter) .order_by(PlanItem.timestamp_created.desc()) @@ -3246,11 +3267,12 @@ def plan(): prompt_text = self._load_prompt_preview_safe(task.id) else: prompt_text = (prompt_preview or "").strip() or "[Prompt unavailable]" + state_name = task.state.name if isinstance(task.state, PlanState) else "pending" rows.append({ "id": str(task.id), "created_compact": created_compact, "created_relative": self._format_relative_time(ts), - "status": task.state.name if isinstance(task.state, PlanState) else "pending", + "status": state_name, "prompt": prompt_text, }) return render_template("plan_list.html", plan_rows=rows) @@ -3342,7 +3364,7 @@ def plan_stop(): task.stop_requested = True task.stop_requested_timestamp = datetime.now(UTC) if task.state in (PlanState.pending, PlanState.processing): - task.state = PlanState.failed + task.state = PlanState.stopped task.progress_message = "Stop requested by user." self.db.session.commit() return redirect(url_for('plan', id=run_id)) @@ -3357,8 +3379,8 @@ def plan_retry(): if not current_user.is_admin and str(task.user_id) != str(current_user.id): return jsonify({"error": "Forbidden"}), 403 - if task.state == PlanState.processing and not bool(task.stop_requested): - return jsonify({"error": "Task is currently processing. Stop it first before retrying."}), 409 + if task.state not in (PlanState.failed, PlanState.stopped) and not bool(task.stop_requested): + return jsonify({"error": "Task is not in a retryable state. Stop it first before retrying."}), 409 raw_profile = request.form.get("model_profile") selected_model_profile = normalize_model_profile(raw_profile).value @@ -3401,7 +3423,7 @@ def plan_resume(): if not current_user.is_admin and str(task.user_id) != str(current_user.id): abort(403) - if task.state != PlanState.failed: + if task.state not in (PlanState.failed, PlanState.stopped): return redirect(url_for('plan', id=run_id)) # Reject resume if the snapshot was created by a different pipeline version. diff --git a/frontend_multi_user/src/planexe_modelviews.py b/frontend_multi_user/src/planexe_modelviews.py index e1935f5b..73d0fc4a 100644 --- a/frontend_multi_user/src/planexe_modelviews.py +++ b/frontend_multi_user/src/planexe_modelviews.py @@ -136,6 +136,11 @@ class PlanItemView(AdminOnlyModelView): column_filters = ['state', 'timestamp_created', 'user_id'] column_formatters = { 'id': lambda v, c, m, p: str(m.id)[:8] if m.id else '', + 'state': lambda v, c, m, p: ( + Markup('stopped') + if m.state and m.state.name == 'stopped' + else (m.state.name if m.state else '') + ), 'prompt': lambda v, c, m, p: m.prompt[:100] + '...' if m.prompt and len(m.prompt) > 100 else m.prompt, 'view_plan': lambda v, c, m, p: Markup( f'View' diff --git a/frontend_multi_user/templates/account.html b/frontend_multi_user/templates/account.html index c035fffd..2abdd8b9 100644 --- a/frontend_multi_user/templates/account.html +++ b/frontend_multi_user/templates/account.html @@ -425,6 +425,7 @@ .status-badge.completed, .status-badge.paid { background: #dcfce7; color: #166534; } .status-badge.pending { background: #fef3c7; color: #92400e; } .status-badge.failed, .status-badge.cancelled { background: #fee2e2; color: #991b1b; } + .status-badge.stopped { background: #ffedd5; color: #9a3412; } .empty-hint { text-align: center; padding: 24px 16px; diff --git a/frontend_multi_user/templates/index.html b/frontend_multi_user/templates/index.html index f2144233..39c9ad40 100644 --- a/frontend_multi_user/templates/index.html +++ b/frontend_multi_user/templates/index.html @@ -278,6 +278,7 @@ .task-status.processing { background: #f59e0b; } .task-status.pending { background: #94a3b8; } .task-status.failed { background: #ef4444; } + .task-status.stopped { background: #ea580c; } .task-prompt { flex: 1; font-size: 0.9rem; diff --git a/frontend_multi_user/templates/plan_iframe.html b/frontend_multi_user/templates/plan_iframe.html index 46ca04dc..c912f554 100644 --- a/frontend_multi_user/templates/plan_iframe.html +++ b/frontend_multi_user/templates/plan_iframe.html @@ -186,6 +186,7 @@ .plan-state { display: inline-flex; align-items: center; + gap: 0.3em; margin-top: 6px; font-size: 0.78rem; font-weight: 600; @@ -540,23 +541,11 @@
- Status: - {% if task.state %} - {{ task.state.name | lower }} - {% else %} - pending - {% endif %} - {% if task.progress_percentage is not none %} - · {{ "%.0f"|format(task.progress_percentage) }}% - {% endif %} - {% if task.progress_message %} - · {{ task.progress_message }} - {% endif %} - + Status: {% if task.state %}{{ task.state.name | lower }}{% else %}pending{% endif %}{% if task.progress_percentage is not none %} · {{ "%.0f"|format(task.progress_percentage) }}%{% endif %}{% if task.progress_message and not (task.state and task.state.name == 'stopped') %} · {{ task.progress_message }}{% endif %}