fix(provider): partition invalid review findings instead of failing the whole feature#84
Open
coletebou wants to merge 2 commits into
Open
fix(provider): partition invalid review findings instead of failing the whole feature#84coletebou wants to merge 2 commits into
coletebou wants to merge 2 commits into
Conversation
…g the whole feature
When a single finding in a provider review response had a `category`
outside the `findingCategories` enum (e.g. a hallucinated "quality"
label), `reviewOutputSchema.parse(output)` in each provider's `review()`
threw and `reviewFeature`'s outer try/catch marked the whole feature
errored. All sibling findings on that feature were lost. Six features
in run 20260517T184420-d52a72 lost every finding this way.
This change introduces `parseReviewOutput(output)` in `src/provider.ts`
which:
1. Tries `reviewOutputSchema.safeParse` on the whole document (fast
path; common case).
2. On per-element failure, validates the container shape loosely
(findings: unknown[], inspected strict) and partitions the
`findings` array element-by-element using the new
`reviewFindingSchema`. Successes go through; failures are
recorded as `{path, message, sample}` entries in
`droppedFindings`.
3. Throws `ClawpatchError("malformed-output")` only when the
container shape itself is invalid (e.g. `findings` is not an
array) — preserving today's fail-loud contract for genuinely
malformed provider output.
`reviewFindingSchema` and `reviewInspectedSchema` are now exported
from `src/types.ts` so the partitioner can reuse them; the composed
`reviewOutputSchema` keeps its existing shape (no breaking change to
consumers).
The four provider `review()` methods (codex, opencode, acpx, grok)
now call `parseReviewOutput` instead of `reviewOutputSchema.parse`.
The provider `Provider.review` return type is now
`PartitionedReviewOutput`, which is `ReviewOutput` plus an additive
`droppedFindings` field.
`reviewFeature` in `src/app.ts` threads `droppedFindings` back to the
review-command worker loop, which appends each as a non-fatal
`{message, code: "schema-drop"}` entry on the run's `errors[]`. The
run-failure threshold is now gated on `errors.some(e => e.code !==
"schema-drop")` so successful runs with dropped findings still
complete (and surface the drops in `run.errors[]` for triage)
instead of being silently flipped to "failed".
Tests:
- parseReviewOutput preserves all findings when every category is
valid (fast path)
- parseReviewOutput keeps valid siblings when one finding has an
invalid category
- parseReviewOutput throws ClawpatchError when findings is not an
array
- parseReviewOutput truncates oversized samples to 200 characters
`validateReviewOutput` (added in 0.3.0) throws `ClawpatchError("malformed-output")`
on the FIRST evidence violation it finds — line range out of bounds, quote
not present, evidence file not in the prompt's included files. The outer
try/catch in `reviewFeature` then loses every sibling finding on the
feature and marks the whole feature errored.
Same problem the schema-layer partition (parseReviewOutput) addresses,
just one layer later. This commit extends the partition pattern:
- Add `validateReviewOutputPartitioned` in `src/review-validation.ts`.
It runs the same per-evidence checks (now factored into a private
`validateFinding` helper that both APIs call) but catches
malformed-output ClawpatchErrors per finding and accumulates them
into a `droppedFindings` array. The original `validateReviewOutput`
keeps its throw-on-first-failure contract for any caller that wants
it.
- Add a `layer?: "schema" | "validation"` discriminator to
`DroppedFinding` so operators can tell the two failure modes apart
in `run.errors`. parseReviewOutput stamps `"schema"`; the validation
partition stamps `"validation"`.
- Wire reviewFeature in `src/app.ts` to call
validateReviewOutputPartitioned and concatenate its drops onto the
schema-layer drops already plumbed through. The orchestrator now
emits `code: "validation-drop"` (distinct from `"schema-drop"`) for
the new bucket and the fatal-errors filter excludes both.
Behavior change for callers: the `Provider.review` return type and
the schema-layer partition are unchanged. Adding new error codes is
additive — anyone filtering on `"schema-drop"` keeps working but now
silently misses validation drops. Update the filter to also exclude
`"validation-drop"` to recover the prior "non-fatal drops" semantics.
Tests: four new cases in `src/review-validation.test.ts` covering the
all-clean, partial-drop, included-paths, and empty-evidence flows.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Today, when an LLM returns a single review finding whose
categoryis outside the 10-value enum (e.g. "quality", "correctness"),reviewOutputSchema.parse(output)throws and the whole feature loses every finding it produced — even the valid sibling findings. We hit this 6 times in a 100-feature run and 25+ times in a 1000-feature run.Upstream now also runs
validateReviewOutput()after parse, which throws on the FIRST bad evidence (line range / quote / file inclusion). Same all-or-nothing failure mode at a different layer.This PR partitions at both layers so one bad finding can no longer sink its siblings.
Layer 1 (commit 1) —
parseReviewOutput:reviewOutputSchema.safeParse.reviewFindingSchema.{validFindings, droppedFindings: [{path, message, sample}]}.review()providers (codex/opencode/acpx/grok) now use this helper.Layer 2 (commit 2) —
validateReviewOutputPartitioned:validateReviewOutputper-finding, catchingClawpatchError("malformed-output").layer: "schema" | "validation"discriminator onDroppedFinding.Drops are persisted to the run's
errorsarray as{message, code: "schema-drop" | "validation-drop"}. The run-level fatal-error gate filters these out, so a dropped finding never marks a run "failed".Why
Run
20260517T190759-3c9e9e(1000 features, 78 errors). 26 of those errors were single bad findings sinking their whole feature. After this PR, they'd land ascode: "schema-drop"or"validation-drop"and the valid siblings would persist normally.Files
src/types.ts— extractedreviewFindingSchema+reviewInspectedSchemasrc/provider.ts— newparseReviewOutput+DroppedFindingtypes; threaded through 4 providers + 2 mockssrc/app.ts—reviewFeaturecarriesdroppedFindings; worker loop persists as non-fatal errorssrc/review-validation.ts— newvalidateReviewOutputPartitionedsrc/provider.test.ts+src/review-validation.test.ts— 8 new casesValidation
pnpm format:check— cleanpnpm typecheck— cleanpnpm lint— cleanpnpm build— cleanpnpm test— 549 passed, 1 skipped (was 545 pre-PR)Coordination