Skip to content

feat(ce-work-beta): add beta Codex delegation mode#476

Open
tmchow wants to merge 37 commits intomainfrom
feat/codex-delegation-work
Open

feat(ce-work-beta): add beta Codex delegation mode#476
tmchow wants to merge 37 commits intomainfrom
feat/codex-delegation-work

Conversation

@tmchow
Copy link
Copy Markdown
Collaborator

@tmchow tmchow commented Apr 1, 2026

Summary

Adds Codex delegation as a beta-only execution path in ce:work-beta. When activated, implementation units from a plan are batched and sent to the Codex CLI (codex exec) instead of being implemented inline by Claude. The orchestrating Claude agent retains control of planning, review, git operations, and result classification.

Also removes swarm mode (orchestrating-swarms skill, slfg skill, and swarm sections from ce:work) — the coordination overhead wasn't delivering proportional value.

Credit to @mvanhorn for the original push on delegation and @huntharo for permissioning thought partnership.

Delegation Architecture

The skill is structured as an orchestrator/executor split. Claude handles everything except writing code; Codex handles code implementation within a structured prompt.

flowchart TD
  A[ce:work-beta invoked] --> B{delegation_active?}
  B -- no --> C[Standard execution]
  B -- yes --> D[Pre-checks: platform, env, CLI, consent]
  D -- any fail --> C
  D -- all pass --> E[Batch units]
  E --> F[Write prompt + result schema]
  F --> G[codex exec in background]
  G --> H[Poll for result]
  H --> I{Result classification}
  I -- success --> J[Display handoff summary]
  J --> K[Commit batch]
  I -- partial --> L[Display summary, finish locally]
  I -- failure --> M[Display reason, rollback]
  M --> N{Circuit breaker: 3 failures?}
  N -- yes --> C
  N -- no --> F
  K --> O{More batches?}
  O -- yes --> F
  O -- no --> P[Cleanup, Phase 3]
Loading

Batched execution: All units in one batch for plans <=5 units; split at phase boundaries for larger plans. Reduces orchestration from O(N) codex calls to O(batches).

Codex owns verification: Codex runs tests and fixes failures within the delegation. The orchestrator classifies the structured result but does not re-verify independently. Safety net: circuit breaker (3 consecutive failures triggers standard mode fallback).

Result handoff: After each batch, the orchestrator displays a summary block showing what Codex did, files changed, verification outcome, and issues — so delegation isn't a black box.

Prompt template: Includes <testing> (test scenario completeness guidance), <verify> (run all tests in one command to catch cross-file contamination), <constraints> (no git commits, scoped changes), and <output_contract> (structured JSON result schema with verification_summary).

Configuration

Settings live in .compound-engineering/config.local.yaml (platform-agnostic, plain YAML). The skill pre-resolves the config file at load time via ! backtick in Claude Code, with a native file-read fallback for other platforms.

work_delegate: codex
work_delegate_consent: true
work_delegate_sandbox: yolo       # yolo | full-auto
work_delegate_decision: auto      # auto | ask
work_delegate_model: gpt-5.4     # any valid codex model
work_delegate_effort: high        # minimal | low | medium | high | xhigh

Activation: argument flag (delegate:codex) > config file > hard default (false). First-time use triggers a one-time consent flow with sandbox mode selection.

Evaluation Results

Plan size Units Delegate tokens Standard tokens Overhead
Small 1-3 51-63k 38-42k +34-50%
Medium 4 54k 53k +2%
Large 7 62k 62k +1%
Extra-large 10 54k 62k -13%

Crossover at ~5-7 units. Users may still choose delegation below crossover for cost arbitrage (Codex tokens are cheaper than Claude tokens).

How to Test

Beta skill — manual invocation only, does not affect stable ce:work.

claude --plugin-dir /path/to/compound-engineering-plugin/plugins/compound-engineering

/ce:work-beta delegate:codex path/to/your-plan.md

Prerequisites: Codex CLI installed (npm install -g @openai/codex or brew install codex).

Test Plan

  • bun test — 634 pass, 0 fail
  • bun run release:validate — 51 agents, 42 skills
  • Contract tests cover delegation argument parsing, settings resolution, routing gate, result handoff, and config path
  • --output-schema verified working with gpt-5.3-codex (previously documented as unsupported)

Post-Deploy Monitoring & Validation

No additional operational monitoring required. Beta skill requires manual invocation and does not affect the stable ce:work path.


Compound Engineering
Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4304970c34

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@tmchow tmchow marked this pull request as draft April 1, 2026 03:24
@tmchow tmchow changed the title feat(ce-work): add Codex delegation mode feat(ce-work-beta): add beta Codex delegation mode Apr 1, 2026
@tmchow
Copy link
Copy Markdown
Collaborator Author

tmchow commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3f7069e77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@tmchow
Copy link
Copy Markdown
Collaborator Author

tmchow commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3f7069e77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@tmchow tmchow force-pushed the feat/codex-delegation-work branch 6 times, most recently from 7c9dbb8 to 21dabf4 Compare April 8, 2026 19:23
tmchow and others added 17 commits April 8, 2026 14:14
Adds optional `delegate:codex` mode to ce:work that delegates code
implementation to the Codex CLI (`codex exec`) using concrete bash
templates. Replaces ce-work-beta's prose-based delegation which caused
non-deterministic CLI invocations.

Key additions:
- Argument parsing with `delegate:codex`/`delegate:local` tokens and
  resolution chain (argument > local.md > default off)
- Pre-delegation gates: environment guard, availability check, one-time
  consent flow with sandbox mode selection (yolo/full-auto)
- XML-tagged prompt template following gpt-5-4-prompting best practices
- Multi-signal result classification (CLI fail/task fail/partial/verify
  fail/success) with rollback-to-HEAD safety
- Circuit breaker: 3 consecutive failures -> standard mode fallback
- Serial execution enforced, swarm mode mutual exclusion
- Frontend Design Guidance ported from ce-work-beta
- ce-work-beta delegation section marked superseded
- `Execution target: external-delegate` removed from ce:plan

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Remove stale "external delegation" per-unit posture from ce:plan
   Execution note examples — ce:work reads delegation from the global
   resolution chain, not unit metadata.

2. Fix delegation fallback to re-enter standard strategy selection.
   Pre-delegation checks now run inside the routing gate before strategy
   choice, so disabling delegation falls through to the normal
   inline/serial/parallel table instead of silently defaulting to inline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore stable ce:work as the non-delegating execution path during the
Codex beta rollout and move the active delegation contract back to
ce:work-beta.

Also add a promotion checklist doc covering the workflow and contract
changes required when ce:work-beta is later promoted to stable.
…acked files

Preflight now uses `git diff --quiet HEAD` instead of `git status --short`
so untracked workspace dirs and .context/ scratch don't falsely block
delegation. Rollback uses path-scoped `git clean -fd -- <unit files>`
instead of bare `git clean -fd` which would nuke all untracked files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…antics

The old `test -n || test -n` returned exit code 1 on the happy path
(neither var set), which a literal agent could misread as a failed
pre-check and disable delegation in eligible environments.

Rewrote as an explicit if/else so pass/fail lives in the variable
value, not the exit code. Also refined the AGENTS.md shell-chaining
rule to distinguish action chaining (bad) from boolean conditions
in if/while guards (fine).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves ~270 lines of delegation workflow (pre-checks, prompt template,
execution loop, result classification) to references/codex-delegation-workflow.md
and ~25 lines of swarm mode to references/swarm-mode.md. SKILL.md body
drops from ~776 to ~514 lines — a 34% reduction in per-tool-call context
cost for non-delegation runs.

New in the delegation reference:
- Batched execution model (all units in one batch, split at ~5)
- Codex owns VERIFY (test-fix loop inside delegation)
- Platform gate (Claude Code only)
- Run-ID namespaced scratch files for concurrent safety
- work_delegation_decision setting (auto/ask) with user-facing prompts
- Between-batch checkpoints (flow through by default)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provides a discoverable template with all available settings
(work_delegate, work_codex_consent, work_codex_sandbox,
work_delegation_decision) so users can copy to
.claude/compound-engineering.local.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… prompt

Adds a <testing> section to the delegation prompt template that carries
the Test Scenario Completeness guidance to Codex (cover happy path, edge
cases, error paths, integration). Closes the test quality gap observed in
evals (Codex produced 57-85% as many tests without this guidance).

Also updates <verify> to require running ALL test files in a single
command rather than per-file — catches cross-file contamination like
mocked globals leaking between test files in the same bun process.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compounded learning from 6 iterations of delegation evals covering
token crossover points, batching strategy, prompt engineering, skill
body size as multiplicative cost driver, and user choice considerations.

Key findings: delegation breaks even at ~5-7 units and becomes cheaper
at 10+. Skill body size dominates cost (multiplicative across all tool
calls). Extract conditional content >50 lines to reference files.

Also fixes verify section line break in contract test assertion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents how plan structure directly enables delegation decisions —
file lists enable batching rules, test scenarios feed Codex prompts,
verification commands enable Codex's self-check loop. Delegation works
with unstructured plans but makes conservative choices without signals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… net

Expands the "delegate verify to Codex" pattern with the reasoning:
trust the delegate's self-report, protect against systematic failure
with the circuit breaker (3 consecutive failures -> standard mode),
and verify the whole at Phase 3 before shipping. Three layered catches
replace the redundant per-batch orchestrator verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The doc covers economics, architecture, prompt engineering, plan quality,
safety model, and user choice — not just economics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds actual token counts from all iterations (not just percentages),
wall clock time comparison, test coverage cost, and the iteration
evolution table showing the body-size regression and recovery. The
economics section now tells the complete story with raw numbers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Renames for consistency across all delegation settings:
- work_codex_consent  -> work_delegate_consent
- work_codex_sandbox  -> work_delegate_sandbox
- work_delegation_decision -> work_delegate_decision

All four settings now share the work_delegate_* prefix:
  work_delegate, work_delegate_consent,
  work_delegate_sandbox, work_delegate_decision

Updated across: SKILL.md, delegation workflow reference, example
local.md, best practices doc, requirements, plan, and contract tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unrecognized setting values (e.g., work_delegate: gemini) now fall
through to hard defaults instead of producing undefined behavior.
Each setting documents its recognized values inline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shorter, more natural. Users think in terms of "codex mode" not
"delegate to codex." The mode: prefix is generic enough for future
delegates. Deactivation becomes mode:local.

Fuzzy activation phrases unchanged (use codex, codex mode, etc.).

Updated across: SKILL.md, contract tests, requirements, and plan docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tmchow and others added 7 commits April 8, 2026 14:14
Adds work_delegate_model (default: gpt-5.4) and work_delegate_effort
(default: high) to local.md settings. Both are passed explicitly to
codex exec via -m and -c 'model_reasoning_effort="..."' flags.

Model is passthrough (any valid codex model name). Effort is validated
against 5 values: minimal, low, medium, high, xhigh. Invalid values
fall through to defaults.

Also fixes --yolo to --dangerously-bypass-approvals-and-sandbox (the
documented flag name) and adds quoting guidance for the -c flag.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odex exec

Launches codex exec with run_in_background (no timeout ceiling) then
polls every 10 seconds in a foreground bash loop to keep the agent's
turn active. User sees "Waiting for Codex..." during execution and
cannot interfere with the working tree.

Fixes the 10-minute Bash timeout ceiling that would kill long-running
batches where Codex is iterating on test fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In testing, agents merged the background launch and polling loop into
one Bash call, hitting the 2-minute timeout. The skill now explicitly
labels Step A (launch, run_in_background: true on the tool parameter)
and Step B (poll, separate foreground calls) and warns that shell &
is not equivalent to the tool parameter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

Adds verification_summary to the result schema and output contract so
Codex reports what tests it ran. Adds a result handoff display step so
the orchestrator shows the user a summary, files changed, verification
outcome, and issues before committing or rolling back — addressing
feedback that delegation completion was opaque.
…ocal.yaml

Migrates delegation settings from .claude/compound-engineering.local.md
(YAML frontmatter in markdown) to .compound-engineering/config.local.yaml
(plain YAML). The new path is platform-agnostic and aligns with the
planned config storage redesign.
Use explicit CODEX_AVAILABLE/CODEX_NOT_FOUND sentinel output instead of
relying on command -v's path-or-empty result, which agents misread.
Include install hints (npm, brew) in the fallback message.
@tmchow tmchow force-pushed the feat/codex-delegation-work branch from 955f5f8 to 3018848 Compare April 8, 2026 21:14
@tmchow tmchow marked this pull request as ready for review April 8, 2026 21:15
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3018848ada

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tmchow added 2 commits April 8, 2026 19:05
- Bind sandbox_mode resolved state into $SANDBOX_MODE before launch snippet
- Add --output-schema compatibility warning to work_delegate_model description
- Add create-directory/create-file/merge instructions to decline branch
…m reading

Use backtick pre-resolution to inline config.local.yaml at skill load
time in Claude Code (zero tool calls). Include explicit fallback for
other platforms to use their native file-read tool. Agents previously
had vague "open and parse" instructions that led to inconsistent reads.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a042b23b27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tmchow added 2 commits April 8, 2026 19:57
- Update example config model comment to match --output-schema constraint
- Add polling failure termination for background process crash or timeout
Tested gpt-5.3-codex with --output-schema and it works fine —
the limitation from Codex CLI bug #4181 is no longer accurate.
Reverts model setting to passthrough for any valid model name.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b63691938

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tmchow added 3 commits April 8, 2026 20:22
The example file is a stopgap that will be superseded by the config
storage redesign (/ce-setup creates the file interactively). Removing
to avoid maintaining a file that drifts from the skill's own docs.
Swarm mode is no longer a direction we want to pursue. Removes the
section from SKILL.md, deletes the swarm-mode reference file, and
drops the contract test. ce:work stable still has its copy — that
will be removed separately.
Removes swarm mode from ce:work, deletes the orchestrating-swarms and
slfg skills, and updates README tables and contract tests. Swarm mode
added coordination overhead without delivering proportional value.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46e36e7d58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90c81b0321

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tmchow added 2 commits April 8, 2026 22:45
- Add missing work_delegate key to config keys list
- Replace chained availability check with if/else per AGENTS.md
- Fix polling timeout math (30 rounds × 60s = 30min, not 5min)
Move the availability probe from a runtime if/else to a ! backtick
pre-resolution that runs at skill load time. Breaks the pre-resolution
exception out into its own AGENTS.md checklist item with an example,
since it was buried at the end of a long bullet and easy to miss.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df61f93809

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Handle zero-exit missing-result case in polling termination
- Add explicit state transition for ask-mode delegation opt-out
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99f32219cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Warn about untracked file collisions at planned paths before delegation
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6971fab189

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tmchow added 2 commits April 8, 2026 23:35
- Set delegation_active false when auto mode routes trivial units locally
Config pre-resolution now uses git rev-parse --show-toplevel instead of
CWD-relative paths, with a fallback to the main repo root via
git-common-dir for worktrees where gitignored files don't exist.
Added "Reading Config Files from Skills" section to AGENTS.md
documenting both gotchas (CWD resolution, worktree gitignore).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant