Skip to content

feat(review): retry transient acpx failures + pass --prompt-retries#88

Open
coletebou wants to merge 1 commit into
openclaw:mainfrom
coletebou:pr/retry-transient
Open

feat(review): retry transient acpx failures + pass --prompt-retries#88
coletebou wants to merge 1 commit into
openclaw:mainfrom
coletebou:pr/retry-transient

Conversation

@coletebou
Copy link
Copy Markdown

Summary

reviewFeature has a single attempt — any provider.review() failure marks the feature errored, even when the cause is a transient stream cut. acpx itself supports --prompt-retries <n> but clawpatch never sets it.

This PR adds two cheap retry layers, both opt-out via env var:

  1. acpx-level: runAcpxJson passes --prompt-retries <n> (default 1; env CLAWPATCH_ACPX_PROMPT_RETRIES; 0 omits the flag).
  2. clawpatch-level: runProviderReviewWithRetry wraps provider.review and retries ONCE on ClawpatchError && code === "malformed-output". Default 1; env CLAWPATCH_REVIEW_RETRIES.

Excluded codes (deterministic, won't retry): provider-auth, unsupported-provider, agent-refused, agent-cancelled.

The new upstream validateReviewOutput() step runs AFTER the retry succeeds — validation failures are NOT retried (they're per-finding bugs, not transient).

Why

Run 20260517T190759-3c9e9e had 37 malformed-output + 15 timeout errors. Most are recoverable on a single retry. With these layers landed, conservative estimate is ~50% of the transient errors clear without operator intervention.

Files

  • src/provider.tsacpxPromptRetries() helper + --prompt-retries arg in runAcpxJson; both exported via __testing
  • src/app.tsrunProviderReviewWithRetry, reviewRetries() knob, feature-retry progress event between attempts
  • src/provider.test.ts + src/app.test.ts — 26 new cases (5 + 4 + 7 + 4 + 6)

Validation

  • pnpm format:check — clean
  • pnpm typecheck — clean
  • pnpm lint — clean
  • pnpm build — clean
  • pnpm test — 567 passed, 1 skipped

Notes

Worst-case latency on a stuck feature with both retry layers maxed: 2 acpx tries × 2 clawpatch tries × 3 min default timeout = 12 min. CLAWPATCH_ACPX_TIMEOUT_MS is the safety valve. With default 10-jobs concurrency the bound is acceptable.

Two-layer retry for transient acpx review failures.

Layer 1 (acpx): runAcpxJson now passes --prompt-retries <n> (default 1) so
the underlying agent can recover stream cuts and partial JSON internally
without the whole feature failing. Override via
CLAWPATCH_ACPX_PROMPT_RETRIES; 0 disables.

Layer 2 (clawpatch): reviewFeature wraps provider.review() in a single
retry that fires only on ClawpatchError with code === "malformed-output".
Override via CLAWPATCH_REVIEW_RETRIES; 0 disables. Deterministic failures
(provider-auth, unsupported-provider, agent-refused, agent-cancelled) and
provider-failure (already covered by layer 1) are not retried at this
layer.

The retry is inside reviewFeature so the work-stealing loop in runReview
does not double-claim the feature lock. Emits a "feature-retry" progress
event between attempts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant