feat(ce-work-beta): add beta Codex delegation mode#476
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4304970c34
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3f7069e77
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3f7069e77
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
7c9dbb8 to
21dabf4
Compare
Adds optional `delegate:codex` mode to ce:work that delegates code implementation to the Codex CLI (`codex exec`) using concrete bash templates. Replaces ce-work-beta's prose-based delegation which caused non-deterministic CLI invocations. Key additions: - Argument parsing with `delegate:codex`/`delegate:local` tokens and resolution chain (argument > local.md > default off) - Pre-delegation gates: environment guard, availability check, one-time consent flow with sandbox mode selection (yolo/full-auto) - XML-tagged prompt template following gpt-5-4-prompting best practices - Multi-signal result classification (CLI fail/task fail/partial/verify fail/success) with rollback-to-HEAD safety - Circuit breaker: 3 consecutive failures -> standard mode fallback - Serial execution enforced, swarm mode mutual exclusion - Frontend Design Guidance ported from ce-work-beta - ce-work-beta delegation section marked superseded - `Execution target: external-delegate` removed from ce:plan Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Remove stale "external delegation" per-unit posture from ce:plan Execution note examples — ce:work reads delegation from the global resolution chain, not unit metadata. 2. Fix delegation fallback to re-enter standard strategy selection. Pre-delegation checks now run inside the routing gate before strategy choice, so disabling delegation falls through to the normal inline/serial/parallel table instead of silently defaulting to inline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore stable ce:work as the non-delegating execution path during the Codex beta rollout and move the active delegation contract back to ce:work-beta. Also add a promotion checklist doc covering the workflow and contract changes required when ce:work-beta is later promoted to stable.
…acked files Preflight now uses `git diff --quiet HEAD` instead of `git status --short` so untracked workspace dirs and .context/ scratch don't falsely block delegation. Rollback uses path-scoped `git clean -fd -- <unit files>` instead of bare `git clean -fd` which would nuke all untracked files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…antics The old `test -n || test -n` returned exit code 1 on the happy path (neither var set), which a literal agent could misread as a failed pre-check and disable delegation in eligible environments. Rewrote as an explicit if/else so pass/fail lives in the variable value, not the exit code. Also refined the AGENTS.md shell-chaining rule to distinguish action chaining (bad) from boolean conditions in if/while guards (fine). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves ~270 lines of delegation workflow (pre-checks, prompt template, execution loop, result classification) to references/codex-delegation-workflow.md and ~25 lines of swarm mode to references/swarm-mode.md. SKILL.md body drops from ~776 to ~514 lines — a 34% reduction in per-tool-call context cost for non-delegation runs. New in the delegation reference: - Batched execution model (all units in one batch, split at ~5) - Codex owns VERIFY (test-fix loop inside delegation) - Platform gate (Claude Code only) - Run-ID namespaced scratch files for concurrent safety - work_delegation_decision setting (auto/ask) with user-facing prompts - Between-batch checkpoints (flow through by default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provides a discoverable template with all available settings (work_delegate, work_codex_consent, work_codex_sandbox, work_delegation_decision) so users can copy to .claude/compound-engineering.local.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… prompt Adds a <testing> section to the delegation prompt template that carries the Test Scenario Completeness guidance to Codex (cover happy path, edge cases, error paths, integration). Closes the test quality gap observed in evals (Codex produced 57-85% as many tests without this guidance). Also updates <verify> to require running ALL test files in a single command rather than per-file — catches cross-file contamination like mocked globals leaking between test files in the same bun process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compounded learning from 6 iterations of delegation evals covering token crossover points, batching strategy, prompt engineering, skill body size as multiplicative cost driver, and user choice considerations. Key findings: delegation breaks even at ~5-7 units and becomes cheaper at 10+. Skill body size dominates cost (multiplicative across all tool calls). Extract conditional content >50 lines to reference files. Also fixes verify section line break in contract test assertion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents how plan structure directly enables delegation decisions — file lists enable batching rules, test scenarios feed Codex prompts, verification commands enable Codex's self-check loop. Delegation works with unstructured plans but makes conservative choices without signals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… net Expands the "delegate verify to Codex" pattern with the reasoning: trust the delegate's self-report, protect against systematic failure with the circuit breaker (3 consecutive failures -> standard mode), and verify the whole at Phase 3 before shipping. Three layered catches replace the redundant per-batch orchestrator verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The doc covers economics, architecture, prompt engineering, plan quality, safety model, and user choice — not just economics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds actual token counts from all iterations (not just percentages), wall clock time comparison, test coverage cost, and the iteration evolution table showing the body-size regression and recovery. The economics section now tells the complete story with raw numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Renames for consistency across all delegation settings: - work_codex_consent -> work_delegate_consent - work_codex_sandbox -> work_delegate_sandbox - work_delegation_decision -> work_delegate_decision All four settings now share the work_delegate_* prefix: work_delegate, work_delegate_consent, work_delegate_sandbox, work_delegate_decision Updated across: SKILL.md, delegation workflow reference, example local.md, best practices doc, requirements, plan, and contract tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unrecognized setting values (e.g., work_delegate: gemini) now fall through to hard defaults instead of producing undefined behavior. Each setting documents its recognized values inline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shorter, more natural. Users think in terms of "codex mode" not "delegate to codex." The mode: prefix is generic enough for future delegates. Deactivation becomes mode:local. Fuzzy activation phrases unchanged (use codex, codex mode, etc.). Updated across: SKILL.md, contract tests, requirements, and plan docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This reverts commit 4c5da21.
Adds work_delegate_model (default: gpt-5.4) and work_delegate_effort (default: high) to local.md settings. Both are passed explicitly to codex exec via -m and -c 'model_reasoning_effort="..."' flags. Model is passthrough (any valid codex model name). Effort is validated against 5 values: minimal, low, medium, high, xhigh. Invalid values fall through to defaults. Also fixes --yolo to --dangerously-bypass-approvals-and-sandbox (the documented flag name) and adds quoting guidance for the -c flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odex exec Launches codex exec with run_in_background (no timeout ceiling) then polls every 10 seconds in a foreground bash loop to keep the agent's turn active. User sees "Waiting for Codex..." during execution and cannot interfere with the working tree. Fixes the 10-minute Bash timeout ceiling that would kill long-running batches where Codex is iterating on test fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In testing, agents merged the background launch and polling loop into one Bash call, hitting the 2-minute timeout. The skill now explicitly labels Step A (launch, run_in_background: true on the tool parameter) and Step B (poll, separate foreground calls) and warns that shell & is not equivalent to the tool parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion Adds verification_summary to the result schema and output contract so Codex reports what tests it ran. Adds a result handoff display step so the orchestrator shows the user a summary, files changed, verification outcome, and issues before committing or rolling back — addressing feedback that delegation completion was opaque.
…ocal.yaml Migrates delegation settings from .claude/compound-engineering.local.md (YAML frontmatter in markdown) to .compound-engineering/config.local.yaml (plain YAML). The new path is platform-agnostic and aligns with the planned config storage redesign.
Use explicit CODEX_AVAILABLE/CODEX_NOT_FOUND sentinel output instead of relying on command -v's path-or-empty result, which agents misread. Include install hints (npm, brew) in the fallback message.
955f5f8 to
3018848
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3018848ada
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Show resolved
Hide resolved
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Outdated
Show resolved
Hide resolved
- Bind sandbox_mode resolved state into $SANDBOX_MODE before launch snippet - Add --output-schema compatibility warning to work_delegate_model description - Add create-directory/create-file/merge instructions to decline branch
…m reading Use backtick pre-resolution to inline config.local.yaml at skill load time in Claude Code (zero tool calls). Include explicit fallback for other platforms to use their native file-read tool. Agents previously had vague "open and parse" instructions that led to inconsistent reads.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a042b23b27
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Show resolved
Hide resolved
- Update example config model comment to match --output-schema constraint - Add polling failure termination for background process crash or timeout
Tested gpt-5.3-codex with --output-schema and it works fine — the limitation from Codex CLI bug #4181 is no longer accurate. Reverts model setting to passthrough for any valid model name.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b63691938
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Show resolved
Hide resolved
The example file is a stopgap that will be superseded by the config storage redesign (/ce-setup creates the file interactively). Removing to avoid maintaining a file that drifts from the skill's own docs.
Swarm mode is no longer a direction we want to pursue. Removes the section from SKILL.md, deletes the swarm-mode reference file, and drops the contract test. ce:work stable still has its copy — that will be removed separately.
Removes swarm mode from ce:work, deletes the orchestrating-swarms and slfg skills, and updates README tables and contract tests. Swarm mode added coordination overhead without delivering proportional value.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46e36e7d58
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90c81b0321
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Show resolved
Hide resolved
- Add missing work_delegate key to config keys list - Replace chained availability check with if/else per AGENTS.md - Fix polling timeout math (30 rounds × 60s = 30min, not 5min)
Move the availability probe from a runtime if/else to a ! backtick pre-resolution that runs at skill load time. Breaks the pre-resolution exception out into its own AGENTS.md checklist item with an example, since it was buried at the end of a long bullet and easy to miss.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df61f93809
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Outdated
Show resolved
Hide resolved
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Outdated
Show resolved
Hide resolved
- Handle zero-exit missing-result case in polling termination - Add explicit state transition for ask-mode delegation opt-out
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99f32219cf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Outdated
Show resolved
Hide resolved
- Warn about untracked file collisions at planned paths before delegation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6971fab189
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
Outdated
Show resolved
Hide resolved
- Set delegation_active false when auto mode routes trivial units locally
Config pre-resolution now uses git rev-parse --show-toplevel instead of CWD-relative paths, with a fallback to the main repo root via git-common-dir for worktrees where gitignored files don't exist. Added "Reading Config Files from Skills" section to AGENTS.md documenting both gotchas (CWD resolution, worktree gitignore).
Summary
Adds Codex delegation as a beta-only execution path in
ce:work-beta. When activated, implementation units from a plan are batched and sent to the Codex CLI (codex exec) instead of being implemented inline by Claude. The orchestrating Claude agent retains control of planning, review, git operations, and result classification.Also removes swarm mode (
orchestrating-swarmsskill,slfgskill, and swarm sections fromce:work) — the coordination overhead wasn't delivering proportional value.Credit to @mvanhorn for the original push on delegation and @huntharo for permissioning thought partnership.
Delegation Architecture
The skill is structured as an orchestrator/executor split. Claude handles everything except writing code; Codex handles code implementation within a structured prompt.
flowchart TD A[ce:work-beta invoked] --> B{delegation_active?} B -- no --> C[Standard execution] B -- yes --> D[Pre-checks: platform, env, CLI, consent] D -- any fail --> C D -- all pass --> E[Batch units] E --> F[Write prompt + result schema] F --> G[codex exec in background] G --> H[Poll for result] H --> I{Result classification} I -- success --> J[Display handoff summary] J --> K[Commit batch] I -- partial --> L[Display summary, finish locally] I -- failure --> M[Display reason, rollback] M --> N{Circuit breaker: 3 failures?} N -- yes --> C N -- no --> F K --> O{More batches?} O -- yes --> F O -- no --> P[Cleanup, Phase 3]Batched execution: All units in one batch for plans <=5 units; split at phase boundaries for larger plans. Reduces orchestration from O(N) codex calls to O(batches).
Codex owns verification: Codex runs tests and fixes failures within the delegation. The orchestrator classifies the structured result but does not re-verify independently. Safety net: circuit breaker (3 consecutive failures triggers standard mode fallback).
Result handoff: After each batch, the orchestrator displays a summary block showing what Codex did, files changed, verification outcome, and issues — so delegation isn't a black box.
Prompt template: Includes
<testing>(test scenario completeness guidance),<verify>(run all tests in one command to catch cross-file contamination),<constraints>(no git commits, scoped changes), and<output_contract>(structured JSON result schema withverification_summary).Configuration
Settings live in
.compound-engineering/config.local.yaml(platform-agnostic, plain YAML). The skill pre-resolves the config file at load time via!backtick in Claude Code, with a native file-read fallback for other platforms.Activation: argument flag (
delegate:codex) > config file > hard default (false). First-time use triggers a one-time consent flow with sandbox mode selection.Evaluation Results
Crossover at ~5-7 units. Users may still choose delegation below crossover for cost arbitrage (Codex tokens are cheaper than Claude tokens).
How to Test
Beta skill — manual invocation only, does not affect stable
ce:work.Prerequisites: Codex CLI installed (
npm install -g @openai/codexorbrew install codex).Test Plan
bun test— 634 pass, 0 failbun run release:validate— 51 agents, 42 skills--output-schemaverified working withgpt-5.3-codex(previously documented as unsupported)Post-Deploy Monitoring & Validation
No additional operational monitoring required. Beta skill requires manual invocation and does not affect the stable
ce:workpath.