From 86c03af018cebf8de386f488ee2c6f698903559c Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 19:49:17 +0200 Subject: [PATCH 01/65] test: add compaction benchmark harness Add a Docker-runnable offline benchmark for compaction behavior with pressure-style synthetic scenarios, scoped current/history/recall assertions, and assertion mode for selected compactors. This creates RED probes for exact state recovery, recall recovery, stale-current leakage, bulk offloading, and cache-churn signals before broader cache-aware compaction work. Validation: node --check on benchmark files; git diff --check; docker build -t pi-vcc-bench .; docker benchmark descriptive/jsonl/assertion runs. --- .dockerignore | 10 + Dockerfile | 22 + README.md | 44 ++ bench/compaction/README.md | 161 ++++++++ bench/compaction/offline-runner.ts | 610 ++++++++++++++++++++++++++++ bench/compaction/synthetic-cases.ts | 256 ++++++++++++ scripts/bench-compaction.ts | 66 +++ 7 files changed, 1169 insertions(+) create mode 100644 .dockerignore create mode 100644 Dockerfile create mode 100644 bench/compaction/README.md create mode 100644 bench/compaction/offline-runner.ts create mode 100644 bench/compaction/synthetic-cases.ts create mode 100644 scripts/bench-compaction.ts diff --git a/.dockerignore b/.dockerignore new file mode 100644 index 0000000..4e20976 --- /dev/null +++ b/.dockerignore @@ -0,0 +1,10 @@ +.git +node_modules +dist +*.tsbuildinfo +bun.lock +bench-results*.jsonl +bench-results*.json +.pi* +research +docs diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 0000000..8e00cfa --- /dev/null +++ b/Dockerfile @@ -0,0 +1,22 @@ +# syntax=docker/dockerfile:1 + +# renovate: datasource=docker depName=oven/bun versioning=semver +ARG BUN_VERSION=1.3.13 + +FROM oven/bun:${BUN_VERSION} AS source +WORKDIR /app + +COPY --link package.json README.md ./ +COPY --link src ./src +COPY --link bench ./bench +COPY --link scripts ./scripts + +FROM oven/bun:${BUN_VERSION} AS final +ENV NODE_ENV=production + +COPY --link --from=source --chown=1000:1000 /app /app +WORKDIR /app +USER bun + +ENTRYPOINT ["bun", "scripts/bench-compaction.ts"] +CMD ["--jsonl"] diff --git a/README.md b/README.md index 66c184a..d6e92a7 100644 --- a/README.md +++ b/README.md @@ -191,6 +191,50 @@ Typical workflow: **search → find relevant entry indices → expand those indi 5. **Format** — render into bracketed sections + transcript 6. **Merge** — if previous summary exists: sticky sections merge, volatile sections replace, transcript rolls +## Compaction benchmark + +An offline benchmark harness lives under `bench/compaction`. It replays pressure-style synthetic long-session scenarios through multiple compactors and records continuation-oriented metrics: exact state recovery, current-state recovery, recall recovery, prompt size, layer churn, longest common prefix, stale-fact leakage, and recall-only offload leakage. + +Run all offline compactors: + +```bash +bun scripts/bench-compaction.ts +``` + +Emit one JSON record per compaction cycle: + +```bash +bun scripts/bench-compaction.ts --jsonl > bench-results.jsonl +``` + +Limit the comparison to selected compactors: + +```bash +bun scripts/bench-compaction.ts --compactors pi-vcc,cache-aware-layered +``` + +Run the same benchmark in Docker: + +```bash +docker build -t pi-vcc-bench . +docker run --rm pi-vcc-bench +``` + +Pass benchmark arguments after the image name: + +```bash +docker run --rm pi-vcc-bench --compactors pi-vcc,cache-aware-layered +``` + +Use assertion mode when checking a selected compactor against the current benchmark gates: + +```bash +bun scripts/bench-compaction.ts --compactors pi-vcc --assert +docker run --rm pi-vcc-bench --compactors pi-vcc --assert +``` + +Assertion failures are expected for current baselines while these RED scenarios document known gaps. The default benchmark is deterministic and does not call model providers. Provider-reported cached-token and latency measurements should be added as an opt-in benchmark because they require credentials and can create billable requests. + ## Config Config lives at `~/.pi/agent/pi-vcc-config.json` (auto-scaffolded on first load with safe defaults): diff --git a/bench/compaction/README.md b/bench/compaction/README.md new file mode 100644 index 0000000..f739e2e --- /dev/null +++ b/bench/compaction/README.md @@ -0,0 +1,161 @@ +# Compaction Benchmark + +This benchmark evaluates conversation compaction as a continuation system, not only as a compression routine. It focuses on whether a compacted agent state preserves recoverable work while keeping cacheable prompt prefixes stable. + +The design borrows the pressure-test loop used for skill validation: first make the current behavior fail in a controlled scenario, then implement the smallest compaction change that fixes the observed failure, and rerun the same scenario plus nearby variants. + +## Evaluation loop + +Use the benchmark as a RED-GREEN-REFACTOR loop for compaction behavior: + +1. **RED**: run the current compactor and record exact failures such as missing identifiers, stale current facts, bulky active text, or unstable early layers. +2. **GREEN**: add the smallest targeted compaction change that fixes the observed failure. +3. **REFACTOR**: pressure-test adjacent cases so the fix does not only satisfy one string probe. +4. **ITERATE**: keep the failing scenario in the benchmark and repeat until the desired compactor passes or the intended semantics need to change. + +Do not implement broad cache-aware layering only from design intuition. Add or keep a failing probe for each behavior the implementation is meant to improve. + +## Compactors under comparison + +The runner uses a common offline interface: + +- `pi-vcc`: current deterministic `compile()` output. +- `full-rewrite-checkpoint`: deterministic stand-in for a regenerated structured summary plus transcript, without external recall. +- `cache-aware-layered`: deterministic layered prototype that separates stable schema, durable memory, structured checkpoint, rolling transcript, raw tail, and recall pointers. + +LLM-backed compactors can be added behind the same interface. Live model calls should be kept separate from the default offline run so local validation remains cheap and deterministic. + +## Benchmark levels + +The current harness covers the first level and some cache-churn signals. Later levels should be added before using benchmark results to claim end-to-end agent quality. + +1. **Offline state probes** + - exact active terms + - current-state terms + - recall-only terms + - forbidden current-state terms + - terms that must stay out of active prompt text + - layer churn and longest common prefix + +2. **Micro-continuation probes** + - compacted context plus a tiny disposable fixture + - agent gets a one-to-three action budget + - pass/fail by expected command, file, or decision + +3. **Hermetic Pi replay** + - isolated `PI_CODING_AGENT_DIR` + - actual compaction hook and session context construction + - optional default-model and small-model continuation probes + +4. **Live provider cache probes** + - provider-reported cached and uncached tokens + - latency to first token and total latency + - effective input cost over the next few turns + +## Scenario shape + +Each synthetic case contains: + +- an ordered message transcript +- one or more compaction points to replay repeated compactions +- exact terms that should remain somewhere in active prompt state +- exact terms that should be in current-state layers, not only historical transcript or raw tail +- exact terms that may be absent from active state but must be recoverable from recall +- terms that must not appear in current-state layers after corrections or branch-sensitive updates +- terms that must stay out of active prompt text because recall should carry them +- continuation terms that indicate the agent can resume the next action + +Real Pi sessions can be added later as fixtures or sampled from local session JSONL files, but synthetic cases provide gold expectations for regressions. + +## Scoped assertions + +The runner distinguishes scopes so historical fidelity is not confused with current state: + +- `activeTerms`: must appear anywhere in the active compacted prompt. +- `currentTerms`: must appear in current-state layers. +- `recallTerms`: must be recoverable from recall corpus search. +- `forbiddenTerms`: must not appear anywhere in the active compacted prompt. +- `forbiddenCurrentTerms`: must not appear in current-state layers, but may exist in historical transcript/tail or recall corpus. +- `activeAbsentTerms`: must not appear in active prompt text; they are expected to live in recall only. + +This matters for corrections. For example, an old preference may remain in historical transcript, but it must not remain in durable memory or the current checkpoint after a user correction. + +## Metrics + +Each compaction cycle records: + +- active state size in characters and approximate tokens +- current-state size in characters and approximate tokens +- compaction latency +- longest common prefix with the previous compacted prompt +- first changed layer and changed layer names when a compactor exposes layers +- active exact-term recall against gold terms +- current-state exact-term recall against gold terms +- forbidden active and current-state leakage +- active leakage of terms expected to be recall-only +- recall top-k recovery for externalized terms +- continuation-term recovery + +The cache-oriented metrics are offline approximations. They do not replace provider-reported cached-token accounting, but they highlight prompt churn that is likely to hurt prefix-based caching. + +## Running + +Run all offline compactors: + +```bash +bun scripts/bench-compaction.ts +``` + +Emit one JSON record per compaction cycle: + +```bash +bun scripts/bench-compaction.ts --jsonl > bench-results.jsonl +``` + +Limit the comparison to selected compactors: + +```bash +bun scripts/bench-compaction.ts --compactors pi-vcc,cache-aware-layered +``` + +Run assertion mode. This exits non-zero if any selected compactor misses active/current/recall/continuation expectations or leaks forbidden/offloaded terms: + +```bash +bun scripts/bench-compaction.ts --compactors pi-vcc --assert +``` + +Run the same checks in Docker: + +```bash +docker build -t pi-vcc-bench . +docker run --rm pi-vcc-bench +docker run --rm pi-vcc-bench --compactors pi-vcc --assert +``` + +Assertion failures are expected for current baselines while the RED scenarios are documenting known gaps. Use selected compactors when checking one implementation at a time. + +## Interpreting results + +A useful compactor should: + +- preserve exact identifiers, file paths, evidence handles, constraints, blockers, and next actions +- keep current state separate from historical transcript and raw tail +- avoid retaining corrected stale facts in current-state layers +- keep stable layers byte-identical across ordinary compactions +- move bulky re-fetchable details behind recall pointers without losing top-k recoverability +- reduce active prompt size without shifting too much cost into uncached post-compaction turns + +Shorter output is not sufficient if continuation or recall probes fail. + +## Future live-provider extension + +A live cache probe should replay the same compacted prompts against providers that report cache usage and capture: + +- cached input tokens +- uncached input tokens +- cache-write tokens +- latency to first token +- total request latency +- effective input cost over the next few turns + +That extension should be opt-in because it depends on credentials, provider-specific cache semantics, and billable requests. diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts new file mode 100644 index 0000000..25ff733 --- /dev/null +++ b/bench/compaction/offline-runner.ts @@ -0,0 +1,610 @@ +import { performance } from "node:perf_hooks"; +import type { Message } from "@mariozechner/pi-ai"; +import { compile } from "../../src/core/summarize"; +import { buildSections } from "../../src/core/build-sections"; +import { RECALL_NOTE } from "../../src/core/format"; +import { normalize } from "../../src/core/normalize"; +import { renderMessage } from "../../src/core/render-entries"; +import { clip, textOf } from "../../src/core/content"; +import { summarizeToolResultForPrompt } from "../../src/core/tool-result-summary"; +import { syntheticCompactionCases, type CompactionBenchmarkCase, type ExpectedTerm } from "./synthetic-cases"; + +export type LayerRole = "static" | "current" | "history" | "recall"; + +export interface LayerSnapshot { + name: string; + role: LayerRole; + text: string; +} + +export interface RecallDocument { + id: string; + text: string; +} + +export interface CompactorResult { + activePromptState: string; + layers: LayerSnapshot[]; + recallCorpus: RecallDocument[]; + stats: { + compactionMs: number; + estimatedInputTokens?: number; + estimatedOutputTokens?: number; + }; +} + +export interface CompactorContext { + /** Messages newly summarized in this compaction cycle. */ + messages: Message[]; + /** Full replay prefix available up to this compaction point. */ + allMessages: Message[]; + previous?: CompactorResult; + cycle: number; +} + +export interface OfflineCompactor { + name: string; + compact(context: CompactorContext): CompactorResult; +} + +export interface TermProbeResult { + label: string; + term: string; + applicable: boolean; + found: boolean; +} + +export interface RecallProbeResult extends TermProbeResult { + query: string; + topHitIds: string[]; +} + +export interface CycleMetrics { + caseId: string; + compactor: string; + cycle: number; + compactionPoint: number; + activeChars: number; + activeTokensEst: number; + currentChars: number; + currentTokensEst: number; + compactionMs: number; + lcpTokensWithPrevious: number | null; + lcpTokenRatioWithPrevious: number | null; + firstChangedLayer: string | null; + changedLayers: string[]; + activeTermRecall: number | null; + currentTermRecall: number | null; + recallTermHitRate: number | null; + continuationTermRecall: number | null; + forbiddenLeakCount: number; + forbiddenCurrentLeakCount: number; + activeAbsentLeakCount: number; + missingActiveTerms: string[]; + missingCurrentTerms: string[]; + missingRecallTerms: string[]; + leakedForbiddenTerms: string[]; + leakedForbiddenCurrentTerms: string[]; + leakedActiveAbsentTerms: string[]; + layerSizes: Record; +} + +export interface BenchmarkRunResult { + cycles: CycleMetrics[]; + aggregate: Record; +} + +const SEPARATOR = "\n\n---\n\n"; + +const tokenize = (text: string): string[] => + text.match(/[\p{L}\p{N}_./:-]+|[^\s]/gu) ?? []; + +const estimateTokens = (text: string): number => Math.ceil(text.length / 4); + +const lowerIncludes = (haystack: string, needle: string): boolean => + haystack.toLowerCase().includes(needle.toLowerCase()); + +const lcpTokens = (a: string, b: string): number => { + const aa = tokenize(a); + const bb = tokenize(b); + const limit = Math.min(aa.length, bb.length); + let i = 0; + while (i < limit && aa[i] === bb[i]) i += 1; + return i; +}; + +const renderedDocuments = (messages: Message[]): RecallDocument[] => + messages.map((message, index) => { + const rendered = renderMessage(message, index, true); + return { + id: `${index}:${rendered.role}`, + text: `#${index} [${rendered.role}] ${rendered.summary}`, + }; + }); + +const sourceTextOf = (messages: Message[]): string => + renderedDocuments(messages).map((doc) => doc.text).join("\n"); + +const textForRoles = (result: CompactorResult, roles: LayerRole[]): string => { + const selected = result.layers.filter((layer) => roles.includes(layer.role)); + if (selected.length === 0) return ""; + return selected.map((layer) => `[${layer.name}]\n${layer.text}`).join("\n\n"); +}; + +const termProbe = (terms: ExpectedTerm[] = [], sourceText: string, targetText: string): TermProbeResult[] => + terms.map((term) => { + const applicable = lowerIncludes(sourceText, term.term); + return { + label: term.label, + term: term.term, + applicable, + found: applicable && lowerIncludes(targetText, term.term), + }; + }); + +const leakProbe = (terms: ExpectedTerm[] = [], sourceText: string, targetText: string): TermProbeResult[] => + terms.map((term) => { + const applicable = lowerIncludes(sourceText, term.term); + return { + label: term.label, + term: term.term, + applicable, + found: applicable && lowerIncludes(targetText, term.term), + }; + }); + +const scoreDocument = (doc: string, query: string): number => { + const terms = query + .toLowerCase() + .split(/\s+/) + .map((part) => part.trim()) + .filter(Boolean); + const hay = doc.toLowerCase(); + return terms.reduce((score, term) => score + (hay.includes(term) ? 1 : 0), 0); +}; + +const recallProbe = ( + terms: ExpectedTerm[] = [], + sourceText: string, + corpus: RecallDocument[], +): RecallProbeResult[] => + terms.map((term) => { + const query = term.query ?? term.term; + const applicable = lowerIncludes(sourceText, term.term); + const ranked = corpus + .map((doc) => ({ doc, score: scoreDocument(doc.text, query) })) + .filter((entry) => entry.score > 0) + .sort((a, b) => b.score - a.score) + .slice(0, 5); + const found = applicable && ranked.some((entry) => lowerIncludes(entry.doc.text, term.term)); + return { + label: term.label, + term: term.term, + query, + applicable, + found, + topHitIds: ranked.map((entry) => entry.doc.id), + }; + }); + +const ratioOf = (probes: TermProbeResult[]): number | null => { + const applicable = probes.filter((probe) => probe.applicable); + if (applicable.length === 0) return null; + return applicable.filter((probe) => probe.found).length / applicable.length; +}; + +const summarizeChangedLayers = ( + previous: CompactorResult | undefined, + current: CompactorResult, +): { firstChangedLayer: string | null; changedLayers: string[] } => { + if (!previous) return { firstChangedLayer: null, changedLayers: [] }; + const prevByName = new Map(previous.layers.map((layer) => [layer.name, layer.text])); + const changedLayers = current.layers + .filter((layer) => prevByName.get(layer.name) !== layer.text) + .map((layer) => layer.name); + return { + firstChangedLayer: changedLayers[0] ?? null, + changedLayers, + }; +}; + +const lines = (items: string[]): string => + items.length === 0 ? "- (none)" : items.map((item) => `- ${item}`).join("\n"); + +const stableUnique = (items: string[], limit = 12): string[] => + [...new Set(items.map((item) => item.trim()).filter(Boolean))].sort().slice(0, limit); + +const regexTerms = (text: string, regex: RegExp, limit = 12): string[] => + stableUnique([...text.matchAll(regex)].map((match) => match[0]), limit); + +const recentHumanLines = (messages: Message[], maxLines = 10): string[] => { + const out: string[] = []; + for (const message of messages.slice(-8)) { + if (message.role !== "user" && message.role !== "assistant") continue; + const text = textOf(message.content); + for (const line of text.split("\n")) { + const trimmed = line.trim(); + if (!trimmed) continue; + if (/\b(next step|current blocker|blocker update|continue|correction|hard constraint|decision)\b/i.test(trimmed)) { + out.push(trimmed); + } + } + } + return out.slice(-maxLines); +}; + +const bulkyPointers = (messages: Message[]): string[] => { + const out: string[] = []; + messages.forEach((message, index) => { + if (message.role !== "toolResult") return; + const text = textOf(message.content); + if (text.length < 500) return; + const paths = regexTerms(text, /\/(?:tmp|var|home|workspace)\/[\w./-]+/g, 4); + const signatures = regexTerms(text, /\b[A-Z][A-Z0-9_]{4,}\b(?:\s+request_id=[\w-]+)?/g, 4); + const details = [...paths, ...signatures].join("; ") || clip(text, 120); + out.push(`#${index} ${message.toolName}: ${details}`); + }); + return out; +}; + +const extractDurableMemory = (messages: Message[]): string[] => { + const memory: string[] = []; + for (const message of messages) { + if (message.role !== "user") continue; + const text = textOf(message.content); + for (const line of text.split("\n")) { + const trimmed = line.trim(); + if (!trimmed) continue; + if (/\b(correction|never|always|prefer|use npm test|node --test)\b/i.test(trimmed)) { + memory.push(trimmed); + } + } + } + + const hasNeverYarn = memory.some((item) => /never use yarn/i.test(item)); + const filtered = hasNeverYarn + ? memory.filter((item) => !/prefer yarn test/i.test(item)) + : memory; + return stableUnique(filtered, 10); +}; + +const makeLayeredCheckpoint = (messages: Message[]): LayerSnapshot[] => { + const blocks = normalize(messages); + const data = buildSections({ blocks }); + const source = sourceTextOf(messages); + const paths = regexTerms(source, /(?:^|[\s"'`])(?:\.?\/?[\w.-]+\/)+[\w.-]+(?:\.[\w.-]+)?/g) + .map((path) => path.trim().replace(/^["'`\s]+/, "")); + const identifiers = regexTerms(source, /\b(?:ERR|CACHE|CRITICAL|req|spn|cache|commit)[\w:-]{3,}\b/g, 16); + const commits = regexTerms(source, /\b[0-9a-f]{7,40}\b/g, 8); + + const stableCheckpoint = [ + "Objective:", + lines(data.sessionGoal), + "Hard constraints and decisions:", + lines(regexTerms(source, /(?:Hard constraint|Decision):[^\n]+/gi, 8)), + "Active files and artifacts:", + lines(stableUnique([...data.filesAndChanges, ...paths], 16)), + "Identifiers and evidence handles:", + lines(stableUnique([...identifiers, ...commits], 20)), + ].join("\n"); + + const volatileState = [ + "Outstanding context:", + lines(data.outstandingContext), + "Recent continuation cues:", + lines(recentHumanLines(messages)), + ].join("\n"); + + const transcriptLines = data.briefTranscript.split("\n").filter(Boolean).slice(-50).join("\n"); + const rawTail = messages.slice(-2).map((message, offset) => { + const index = messages.length - 2 + offset; + const rendered = renderMessage(message, index, true); + if (message.role === "toolResult") { + return `#${index} [${rendered.role}] ${summarizeToolResultForPrompt(textOf(message.content))}`; + } + return `#${index} [${rendered.role}] ${clip(rendered.summary, 700)}`; + }).join("\n"); + + const recallPointers = bulkyPointers(messages); + + return [ + { + name: "Layer 0 Static Prefix Contract", + role: "static", + text: [ + "Compacted state schema v1.", + "Keep section names and order stable.", + "Stable facts appear before volatile facts.", + ].join("\n"), + }, + { + name: "Layer 1 Durable Memory", + role: "current", + text: lines(extractDurableMemory(messages)), + }, + { + name: "Layer 2A Stable Checkpoint", + role: "current", + text: stableCheckpoint, + }, + { + name: "Layer 2B Volatile State", + role: "current", + text: volatileState, + }, + { + name: "Layer 3 Rolling Brief Transcript", + role: "history", + text: transcriptLines || "- (none)", + }, + { + name: "Layer 4 Raw Recent Tail", + role: "history", + text: rawTail || "- (none)", + }, + { + name: "Layer 5 Recall Pointers", + role: "recall", + text: lines(recallPointers), + }, + ]; +}; + +const renderLayers = (layers: LayerSnapshot[]): string => + layers.map((layer) => `[${layer.name}]\n${layer.text}`).join("\n\n"); + +const splitPiVccSummary = (summary: string): LayerSnapshot[] => { + if (!summary.trim()) return []; + const parts = summary.split(SEPARATOR).map((part) => part.trim()).filter(Boolean); + if (parts.length === 0) return [{ name: "Pi VCC Current Sections", role: "current", text: summary }]; + + const layers: LayerSnapshot[] = []; + const last = parts[parts.length - 1]; + const hasRecallNote = last === RECALL_NOTE; + const bodyParts = hasRecallNote ? parts.slice(0, -1) : parts; + const current = bodyParts[0] ?? ""; + const history = bodyParts.slice(1).join(SEPARATOR); + + if (current) layers.push({ name: "Pi VCC Current Sections", role: "current", text: current }); + if (history) layers.push({ name: "Pi VCC Brief Transcript", role: "history", text: history }); + if (hasRecallNote) layers.push({ name: "Pi VCC Recall Note", role: "recall", text: RECALL_NOTE }); + return layers.length > 0 ? layers : [{ name: "Pi VCC Current Sections", role: "current", text: summary }]; +}; + +export const offlineCompactors: OfflineCompactor[] = [ + { + name: "pi-vcc", + compact: ({ messages, allMessages, previous }) => { + const start = performance.now(); + const summary = compile({ messages, previousSummary: previous?.activePromptState }); + const elapsed = performance.now() - start; + return { + activePromptState: summary, + layers: splitPiVccSummary(summary), + recallCorpus: renderedDocuments(allMessages), + stats: { + compactionMs: elapsed, + estimatedInputTokens: estimateTokens(sourceTextOf(messages)), + estimatedOutputTokens: estimateTokens(summary), + }, + }; + }, + }, + { + name: "full-rewrite-checkpoint", + compact: ({ allMessages }) => { + const start = performance.now(); + const data = buildSections({ blocks: normalize(allMessages) }); + const current = [ + "Objective:", + lines(data.sessionGoal), + "Files and artifacts:", + lines(data.filesAndChanges), + "Outstanding context:", + lines(data.outstandingContext), + "User preferences:", + lines(data.userPreferences), + ].join("\n"); + const history = data.briefTranscript || "- (none)"; + const layers: LayerSnapshot[] = [ + { name: "Regenerated Current Checkpoint", role: "current", text: current }, + { name: "Regenerated Transcript", role: "history", text: history }, + ]; + const summary = renderLayers(layers); + const elapsed = performance.now() - start; + return { + activePromptState: summary, + layers, + recallCorpus: [], + stats: { + compactionMs: elapsed, + estimatedInputTokens: estimateTokens(sourceTextOf(allMessages)), + estimatedOutputTokens: estimateTokens(summary), + }, + }; + }, + }, + { + name: "cache-aware-layered", + compact: ({ allMessages }) => { + const start = performance.now(); + const layers = makeLayeredCheckpoint(allMessages); + const activePromptState = renderLayers(layers); + const elapsed = performance.now() - start; + return { + activePromptState, + layers, + recallCorpus: renderedDocuments(allMessages), + stats: { + compactionMs: elapsed, + estimatedInputTokens: estimateTokens(sourceTextOf(allMessages)), + estimatedOutputTokens: estimateTokens(activePromptState), + }, + }; + }, + }, +]; + +const forbiddenLeaksOf = ( + terms: Array = [], + sourceText: string, + targetText: string, +): string[] => + terms + .filter((term) => { + const enforce = !term.afterTerm || lowerIncludes(sourceText, term.afterTerm); + return enforce && lowerIncludes(targetText, term.term); + }) + .map((term) => term.label); + +const cycleMetrics = ( + testCase: CompactionBenchmarkCase, + compactor: OfflineCompactor, + cycle: number, + compactionPoint: number, + sourceMessages: Message[], + result: CompactorResult, + previous: CompactorResult | undefined, +): CycleMetrics => { + const sourceText = sourceTextOf(sourceMessages); + const activeText = result.activePromptState; + const currentText = textForRoles(result, ["current"]); + const activeProbes = termProbe(testCase.gold.activeTerms, sourceText, activeText); + const currentProbes = termProbe(testCase.gold.currentTerms ?? [], sourceText, currentText); + const recallProbes = recallProbe(testCase.gold.recallTerms, sourceText, result.recallCorpus); + const continuationProbes = termProbe(testCase.gold.continuationTerms ?? [], sourceText, activeText); + const activeAbsentLeaks = leakProbe(testCase.gold.activeAbsentTerms ?? [], sourceText, activeText) + .filter((probe) => probe.applicable && probe.found); + const leakedForbiddenTerms = forbiddenLeaksOf(testCase.gold.forbiddenTerms, sourceText, activeText); + const leakedForbiddenCurrentTerms = forbiddenLeaksOf(testCase.gold.forbiddenCurrentTerms, sourceText, currentText); + const changed = summarizeChangedLayers(previous, result); + const previousTokens = previous ? tokenize(previous.activePromptState).length : 0; + const currentTokens = tokenize(activeText).length; + const lcp = previous ? lcpTokens(previous.activePromptState, activeText) : null; + const denominator = Math.min(previousTokens, currentTokens); + + return { + caseId: testCase.id, + compactor: compactor.name, + cycle, + compactionPoint, + activeChars: activeText.length, + activeTokensEst: estimateTokens(activeText), + currentChars: currentText.length, + currentTokensEst: estimateTokens(currentText), + compactionMs: Number(result.stats.compactionMs.toFixed(3)), + lcpTokensWithPrevious: lcp, + lcpTokenRatioWithPrevious: lcp === null || denominator === 0 ? null : Number((lcp / denominator).toFixed(4)), + firstChangedLayer: changed.firstChangedLayer, + changedLayers: changed.changedLayers, + activeTermRecall: ratioOf(activeProbes), + currentTermRecall: ratioOf(currentProbes), + recallTermHitRate: ratioOf(recallProbes), + continuationTermRecall: ratioOf(continuationProbes), + forbiddenLeakCount: leakedForbiddenTerms.length, + forbiddenCurrentLeakCount: leakedForbiddenCurrentTerms.length, + activeAbsentLeakCount: activeAbsentLeaks.length, + missingActiveTerms: activeProbes.filter((probe) => probe.applicable && !probe.found).map((probe) => probe.label), + missingCurrentTerms: currentProbes.filter((probe) => probe.applicable && !probe.found).map((probe) => probe.label), + missingRecallTerms: recallProbes.filter((probe) => probe.applicable && !probe.found).map((probe) => probe.label), + leakedForbiddenTerms, + leakedForbiddenCurrentTerms, + leakedActiveAbsentTerms: activeAbsentLeaks.map((term) => term.label), + layerSizes: Object.fromEntries(result.layers.map((layer) => [layer.name, layer.text.length])), + }; +}; + +const mean = (values: number[]): number | null => { + if (values.length === 0) return null; + return values.reduce((sum, value) => sum + value, 0) / values.length; +}; + +const meanRounded = (values: number[]): number => + Number((values.reduce((sum, value) => sum + value, 0) / Math.max(values.length, 1)).toFixed(3)); + +const aggregate = (cycles: CycleMetrics[]): BenchmarkRunResult["aggregate"] => { + const byCompactor = new Map(); + for (const cycle of cycles) { + const bucket = byCompactor.get(cycle.compactor) ?? []; + bucket.push(cycle); + byCompactor.set(cycle.compactor, bucket); + } + + return Object.fromEntries([...byCompactor].map(([name, items]) => { + const nullableMean = (selector: (item: CycleMetrics) => number | null): number | null => { + const values = items.map(selector).filter((value): value is number => value !== null); + const result = mean(values); + return result === null ? null : Number(result.toFixed(4)); + }; + return [name, { + cycles: items.length, + meanActiveTokensEst: meanRounded(items.map((item) => item.activeTokensEst)), + meanCurrentTokensEst: meanRounded(items.map((item) => item.currentTokensEst)), + meanCompactionMs: meanRounded(items.map((item) => item.compactionMs)), + meanActiveTermRecall: nullableMean((item) => item.activeTermRecall), + meanCurrentTermRecall: nullableMean((item) => item.currentTermRecall), + meanRecallTermHitRate: nullableMean((item) => item.recallTermHitRate), + meanContinuationTermRecall: nullableMean((item) => item.continuationTermRecall), + totalForbiddenLeaks: items.reduce((sum, item) => sum + item.forbiddenLeakCount, 0), + totalForbiddenCurrentLeaks: items.reduce((sum, item) => sum + item.forbiddenCurrentLeakCount, 0), + totalActiveAbsentLeaks: items.reduce((sum, item) => sum + item.activeAbsentLeakCount, 0), + meanLcpTokenRatio: nullableMean((item) => item.lcpTokenRatioWithPrevious), + }]; + })); +}; + +export const failedGatesOf = (cycle: CycleMetrics): string[] => { + const failures: string[] = []; + if (cycle.activeTermRecall !== null && cycle.activeTermRecall < 1) failures.push("active-term-recall"); + if (cycle.currentTermRecall !== null && cycle.currentTermRecall < 1) failures.push("current-term-recall"); + if (cycle.recallTermHitRate !== null && cycle.recallTermHitRate < 1) failures.push("recall-hit-rate"); + if (cycle.continuationTermRecall !== null && cycle.continuationTermRecall < 1) failures.push("continuation-term-recall"); + if (cycle.forbiddenLeakCount > 0) failures.push("forbidden-active-leak"); + if (cycle.forbiddenCurrentLeakCount > 0) failures.push("forbidden-current-leak"); + if (cycle.activeAbsentLeakCount > 0) failures.push("active-absent-leak"); + return failures; +}; + +export const runOfflineCompactionBenchmark = (options: { + cases?: CompactionBenchmarkCase[]; + compactors?: OfflineCompactor[]; +} = {}): BenchmarkRunResult => { + const cases = options.cases ?? syntheticCompactionCases; + const compactors = options.compactors ?? offlineCompactors; + const cycles: CycleMetrics[] = []; + + for (const testCase of cases) { + for (const compactor of compactors) { + let previous: CompactorResult | undefined; + let previousPoint = 0; + testCase.compactionPoints.forEach((point, index) => { + const sourceMessages = testCase.messages.slice(0, point); + const cycleMessages = testCase.messages.slice(previousPoint, point); + const result = compactor.compact({ + messages: cycleMessages, + allMessages: sourceMessages, + previous, + cycle: index + 1, + }); + cycles.push(cycleMetrics(testCase, compactor, index + 1, point, sourceMessages, result, previous)); + previous = result; + previousPoint = point; + }); + } + } + + return { cycles, aggregate: aggregate(cycles) }; +}; diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts new file mode 100644 index 0000000..d6c453b --- /dev/null +++ b/bench/compaction/synthetic-cases.ts @@ -0,0 +1,256 @@ +import type { Message } from "@mariozechner/pi-ai"; + +export interface ExpectedTerm { + label: string; + term: string; + /** Optional focused query for recall-style lookup. Defaults to the term. */ + query?: string; +} + +export interface ScopedTerm extends ExpectedTerm { + /** Enforce only after this term has appeared in the replayed source text. */ + afterTerm?: string; +} + +export interface CompactionGold { + /** Terms that should appear somewhere in the active prompt. */ + activeTerms: ExpectedTerm[]; + /** Terms that should appear in current-state layers, not only historical transcript/tail. */ + currentTerms?: ExpectedTerm[]; + /** Terms that should be recoverable from external recall. */ + recallTerms: ExpectedTerm[]; + /** Terms forbidden anywhere in the active prompt. */ + forbiddenTerms?: ScopedTerm[]; + /** Terms forbidden from current-state layers but allowed in historical layers or recall. */ + forbiddenCurrentTerms?: ScopedTerm[]; + /** Terms that must stay out of active prompt text because recall should carry them. */ + activeAbsentTerms?: ExpectedTerm[]; + continuationTerms?: ExpectedTerm[]; +} + +export interface CompactionBenchmarkCase { + id: string; + description: string; + messages: Message[]; + /** Message counts at which to run a compaction cycle. */ + compactionPoints: number[]; + gold: CompactionGold; +} + +const ts = 1_700_000_000_000; +let toolId = 0; + +const assistantBase = { + api: "messages" as any, + provider: "anthropic" as any, + model: "benchmark-fixture", + usage: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, total: 0 }, + timestamp: ts, +}; + +const user = (text: string): Message => ({ role: "user", content: text, timestamp: ts }); + +const assistant = (text: string): Message => ({ + role: "assistant", + content: [{ type: "text", text }], + ...assistantBase, + stopReason: "stop", +}); + +const toolCall = (name: string, args: Record): Message => { + toolId += 1; + return { + role: "assistant", + content: [{ type: "toolCall", id: `bench_tool_${toolId}`, name, arguments: args }], + ...assistantBase, + stopReason: "toolUse", + }; +}; + +const toolResult = (name: string, text: string, isError = false): Message => ({ + role: "toolResult", + toolCallId: `bench_tool_${toolId}`, + toolName: name, + content: [{ type: "text", text }], + isError, + timestamp: ts, +}); + +const noisyLog = (needle: string): string => [ + ...Array.from({ length: 80 }, (_, i) => `debug ${String(i).padStart(2, "0")}: cache warmup shard ok`), + `CRITICAL ${needle}`, + ...Array.from({ length: 80 }, (_, i) => `debug ${String(i + 80).padStart(2, "0")}: retry window unchanged`), +].join("\n"); + +export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ + { + id: "boundary-loss-auth-refresh", + description: "A critical constraint and error signature appear immediately before a compaction cut.", + messages: [ + user("Fix password-reset login. Hard constraint: do not change the public login API."), + assistant("I will inspect the auth refresh path and keep the public login API unchanged."), + toolCall("read", { path: "src/auth/session.ts" }), + toolResult("read", "export function refreshSessionAfterPasswordReset() { return null; }"), + assistant("The likely fix belongs in src/auth/session.ts, not the public login handler."), + toolCall("bash", { command: "bun test tests/auth-refresh.test.ts" }), + toolResult("bash", "FAIL tests/auth-refresh.test.ts\nERR_REFRESH_AFTER_RESET expired refresh token after password reset", true), + user("Continue from here. The next step is to patch refreshSessionAfterPasswordReset, then rerun tests/auth-refresh.test.ts."), + assistant("I will patch refreshSessionAfterPasswordReset and rerun the focused auth-refresh test."), + ], + compactionPoints: [7, 9], + gold: { + activeTerms: [ + { label: "constraint", term: "do not change the public login API" }, + { label: "file", term: "src/auth/session.ts" }, + { label: "identifier", term: "ERR_REFRESH_AFTER_RESET" }, + ], + currentTerms: [ + { label: "constraint", term: "do not change the public login API" }, + { label: "file", term: "src/auth/session.ts" }, + { label: "identifier", term: "ERR_REFRESH_AFTER_RESET" }, + ], + recallTerms: [ + { label: "failing test", term: "tests/auth-refresh.test.ts", query: "auth-refresh" }, + ], + continuationTerms: [ + { label: "next edit", term: "patch refreshSessionAfterPasswordReset" }, + { label: "next validation", term: "rerun tests/auth-refresh.test.ts" }, + ], + }, + }, + { + id: "identifier-provenance", + description: "Similar identifiers make exact provenance and active entity recovery important.", + messages: [ + user("Audit cache invalidation. The target artifact is /tmp/cache-probe-A17.log, not /tmp/cache-probe-A71.log."), + assistant("I will keep the A17 artifact distinct from the A71 decoy and check the cache probe IDs."), + toolCall("read", { path: "/tmp/cache-probe-A17.log" }), + toolResult("read", "probe_id=cache_probe_A17\nspan=spn_cache_keep_91\ncommit=9f3a2b1\nstatus=prefix preserved"), + toolCall("read", { path: "/tmp/cache-probe-A71.log" }), + toolResult("read", "probe_id=cache_probe_A71\nspan=spn_cache_drop_19\nstatus=decoy"), + assistant("Decision: use cache_probe_A17 and span spn_cache_keep_91 as the evidence handle. Ignore cache_probe_A71."), + user("Continue the audit using commit 9f3a2b1 and evidence span spn_cache_keep_91."), + ], + compactionPoints: [6, 8], + gold: { + activeTerms: [ + { label: "artifact", term: "/tmp/cache-probe-A17.log" }, + { label: "probe", term: "cache_probe_A17" }, + { label: "span", term: "spn_cache_keep_91" }, + { label: "commit", term: "9f3a2b1" }, + ], + currentTerms: [ + { label: "artifact", term: "/tmp/cache-probe-A17.log" }, + { label: "probe", term: "cache_probe_A17" }, + { label: "span", term: "spn_cache_keep_91" }, + { label: "commit", term: "9f3a2b1" }, + ], + recallTerms: [ + { label: "decoy provenance", term: "cache_probe_A71", query: "cache_probe_A71" }, + ], + forbiddenCurrentTerms: [ + { label: "decoy as current target", term: "use cache_probe_A71", afterTerm: "Ignore cache_probe_A71" }, + ], + continuationTerms: [ + { label: "continue span", term: "spn_cache_keep_91" }, + ], + }, + }, + { + id: "recall-required-bulk-log", + description: "A bulky log should be externalized while retaining a pointer and recallable exact failure line.", + messages: [ + user("Investigate a flaky compaction benchmark. Store bulky logs as pointers when possible."), + assistant("I will inspect the benchmark log and keep only the evidence handle in active state."), + toolCall("bash", { command: "./run-benchmark > /tmp/pi-vcc-bench-482.log" }), + toolResult("bash", noisyLog("CACHE_MISS_AT_LAYER_2B request_id=req_cache_482"), true), + assistant("The important pointer is /tmp/pi-vcc-bench-482.log. The exact line CACHE_MISS_AT_LAYER_2B request_id=req_cache_482 can be recalled from the log."), + user("Continue with the pointer only; do not paste the whole log back into context."), + ], + compactionPoints: [4, 6], + gold: { + activeTerms: [ + { label: "log pointer", term: "/tmp/pi-vcc-bench-482.log" }, + ], + currentTerms: [ + { label: "log pointer", term: "/tmp/pi-vcc-bench-482.log" }, + ], + recallTerms: [ + { label: "critical line", term: "CACHE_MISS_AT_LAYER_2B request_id=req_cache_482", query: "CACHE_MISS_AT_LAYER_2B req_cache_482" }, + ], + activeAbsentTerms: [ + { label: "early bulky log line", term: "debug 00: cache warmup shard ok" }, + { label: "late bulky log line", term: "debug 120: retry window unchanged" }, + ], + continuationTerms: [ + { label: "pointer discipline", term: "do not paste the whole log" }, + ], + }, + }, + { + id: "correction-stale-memory", + description: "A corrected user preference should replace stale durable memory.", + messages: [ + user("For this repo, prefer yarn test when validating."), + assistant("Noted: yarn test for validation."), + toolCall("bash", { command: "yarn test" }), + toolResult("bash", "yarn: command not found", true), + user("Correction: never use yarn here. Use npm test for broad validation and node --test for focused checks."), + assistant("Understood. I will avoid yarn and use npm test or node --test depending on scope."), + user("Continue and choose the focused validation command first."), + ], + compactionPoints: [4, 7], + gold: { + activeTerms: [ + { label: "corrected preference", term: "never use yarn" }, + { label: "broad validation", term: "npm test" }, + { label: "focused validation", term: "node --test" }, + ], + currentTerms: [ + { label: "corrected preference", term: "never use yarn" }, + { label: "broad validation", term: "npm test" }, + { label: "focused validation", term: "node --test" }, + ], + recallTerms: [ + { label: "failed old tool", term: "yarn: command not found", query: "yarn command not found" }, + ], + forbiddenCurrentTerms: [ + { label: "stale positive preference", term: "prefer yarn test", afterTerm: "Correction: never use yarn here" }, + ], + continuationTerms: [ + { label: "focused command", term: "node --test" }, + ], + }, + }, + { + id: "cache-bust-volatile-next-step", + description: "Stable objective and identifiers remain fixed while only volatile next-step state changes across cycles.", + messages: [ + user("Benchmark cache-aware compaction. Stable objective: preserve Layer 0 and Layer 1 prefixes."), + assistant("Stable checkpoint: objective preserve Layer 0 and Layer 1 prefixes; identifier cache_schema_v3."), + user("Current blocker: first run lacks cached input token accounting."), + assistant("Next step: add offline LCP token metrics for cache_schema_v3."), + user("Blocker update: offline LCP metrics are done; now add recall top-k metrics."), + assistant("Next step: add recall top-k metrics while preserving cache_schema_v3 stable text."), + user("Blocker update: recall top-k metrics are done; now document live provider limits."), + assistant("Next step: document live provider limits without changing Layer 0 or Layer 1 wording."), + ], + compactionPoints: [4, 6, 8], + gold: { + activeTerms: [ + { label: "stable objective", term: "preserve Layer 0 and Layer 1 prefixes" }, + { label: "schema", term: "cache_schema_v3" }, + ], + currentTerms: [ + { label: "stable objective", term: "preserve Layer 0 and Layer 1 prefixes" }, + { label: "schema", term: "cache_schema_v3" }, + ], + recallTerms: [ + { label: "old blocker", term: "first run lacks cached input token accounting", query: "cached input token accounting" }, + ], + continuationTerms: [ + { label: "latest next step", term: "document live provider limits" }, + ], + }, + }, +]; diff --git a/scripts/bench-compaction.ts b/scripts/bench-compaction.ts new file mode 100644 index 0000000..5b85e64 --- /dev/null +++ b/scripts/bench-compaction.ts @@ -0,0 +1,66 @@ +#!/usr/bin/env node +import { failedGatesOf, offlineCompactors, runOfflineCompactionBenchmark } from "../bench/compaction/offline-runner"; + +const args = process.argv.slice(2); + +const argValue = (name: string): string | undefined => { + const inline = args.find((arg) => arg.startsWith(`${name}=`)); + if (inline) return inline.slice(name.length + 1); + const index = args.indexOf(name); + if (index >= 0) return args[index + 1]; + return undefined; +}; + +const hasFlag = (name: string): boolean => args.includes(name); + +const selected = argValue("--compactors") + ?.split(",") + .map((name) => name.trim()) + .filter(Boolean); + +const compactors = selected + ? offlineCompactors.filter((compactor) => selected.includes(compactor.name)) + : offlineCompactors; + +if (selected && compactors.length !== selected.length) { + const found = new Set(compactors.map((compactor) => compactor.name)); + const missing = selected.filter((name) => !found.has(name)); + console.error(`Unknown compactor(s): ${missing.join(", ")}`); + console.error(`Available compactors: ${offlineCompactors.map((compactor) => compactor.name).join(", ")}`); + process.exit(1); +} + +const result = runOfflineCompactionBenchmark({ compactors }); +const failures = result.cycles + .map((cycle) => ({ cycle, gates: failedGatesOf(cycle) })) + .filter((entry) => entry.gates.length > 0); + +if (hasFlag("--jsonl")) { + for (const cycle of result.cycles) { + console.log(JSON.stringify(cycle)); + } +} else { + console.log(JSON.stringify(result, null, 2)); +} + +if (hasFlag("--assert") && failures.length > 0) { + console.error(`\nCompaction benchmark assertions failed: ${failures.length} cycle(s)`); + for (const { cycle, gates } of failures.slice(0, 20)) { + console.error(JSON.stringify({ + caseId: cycle.caseId, + compactor: cycle.compactor, + cycle: cycle.cycle, + gates, + missingActiveTerms: cycle.missingActiveTerms, + missingCurrentTerms: cycle.missingCurrentTerms, + missingRecallTerms: cycle.missingRecallTerms, + leakedForbiddenTerms: cycle.leakedForbiddenTerms, + leakedForbiddenCurrentTerms: cycle.leakedForbiddenCurrentTerms, + leakedActiveAbsentTerms: cycle.leakedActiveAbsentTerms, + })); + } + if (failures.length > 20) { + console.error(`... ${failures.length - 20} additional failing cycle(s) omitted`); + } + process.exit(1); +} From b06cce8baf2b99b9d0db3e4f309404acad41c184 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 19:49:29 +0200 Subject: [PATCH 02/65] fix: preserve compaction evidence handles Add deterministic evidence extraction so compacted current state keeps exact paths, error signatures, IDs, and commit-ish hashes needed for continuation. Large tool errors now retain salient failure lines while omitting low-value log bulk from the active prompt, and corrected preferences supersede stale positive guidance across summary merges. Validation: node --check on changed TypeScript files; git diff --check; Docker benchmark descriptive/jsonl runs; docker run pi-vcc-bench --compactors pi-vcc --assert; docker run pi-vcc-bench --compactors cache-aware-layered --assert; focused Bun tests for build-sections, compile, and format. Full clean Docker bun test still lacks peer/runtime modules (@mariozechner/pi-coding-agent, @sinclair/typebox). --- src/core/brief.ts | 5 ++- src/core/build-sections.ts | 7 +++- src/core/format.ts | 1 + src/core/summarize.ts | 12 ++++-- src/core/tool-result-summary.ts | 35 ++++++++++++++++ src/extract/evidence.ts | 72 +++++++++++++++++++++++++++++++++ src/extract/files.ts | 2 +- src/extract/goals.ts | 9 +++++ src/extract/preferences.ts | 28 ++++++++++++- src/sections.ts | 1 + tests/build-sections.test.ts | 30 ++++++++++++++ tests/compile.test.ts | 12 ++++++ tests/format.test.ts | 1 + 13 files changed, 205 insertions(+), 10 deletions(-) create mode 100644 src/core/tool-result-summary.ts create mode 100644 src/extract/evidence.ts diff --git a/src/core/brief.ts b/src/core/brief.ts index c53ce14..25a3b8b 100644 --- a/src/core/brief.ts +++ b/src/core/brief.ts @@ -1,5 +1,6 @@ import type { NormalizedBlock } from "../types"; -import { clip, firstLine } from "./content"; +import { clip } from "./content"; +import { summarizeToolResultForPrompt } from "./tool-result-summary"; import { extractPath } from "./tool-args"; import { collapseSkillText } from "./skill-collapse"; @@ -181,7 +182,7 @@ export const buildBriefSections = (blocks: NormalizedBlock[]): BriefLine[] => { } case "tool_result": { if (b.isError) { - const body = firstLine(b.text, 150); + const body = summarizeToolResultForPrompt(b.text); // Drop empty/placeholder error bodies — keep the line only if it carries info. if (!body || body === "(no output)") break; const ref = b.sourceIndex != null ? ` (#${b.sourceIndex})` : ""; diff --git a/src/core/build-sections.ts b/src/core/build-sections.ts index 58c4bb1..92d1045 100644 --- a/src/core/build-sections.ts +++ b/src/core/build-sections.ts @@ -1,10 +1,12 @@ import type { NormalizedBlock } from "../types"; -import { clip, clipSentence, firstLine, nonEmptyLines } from "./content"; +import { clip, clipSentence, nonEmptyLines } from "./content"; +import { summarizeToolResultForPrompt } from "./tool-result-summary"; import type { SectionData } from "../sections"; import { extractGoals } from "../extract/goals"; import { extractFiles } from "../extract/files"; import { extractPreferences, dedupPreferencesAgainstGoals } from "../extract/preferences"; import { extractCommits, formatCommits } from "../extract/commits"; +import { extractEvidence, formatEvidence } from "../extract/evidence"; import { buildBriefSections, sectionsToTranscript, stringifyBrief } from "./brief"; export interface BuildSectionsInput { @@ -20,7 +22,7 @@ const extractOutstandingContext = (blocks: NormalizedBlock[]): string[] => { for (const b of tail) { if (b.kind === "tool_result" && b.isError) { - items.push(`[${b.name}] ${firstLine(b.text, 150)}`); + items.push(`[${b.name}] ${summarizeToolResultForPrompt(b.text)}`); continue; } @@ -72,6 +74,7 @@ export const buildSections = (input: BuildSectionsInput): SectionData => { outstandingContext: extractOutstandingContext(blocks), filesAndChanges: formatFileActivity(blocks), commits: formatCommits(extractCommits(blocks)), + evidenceHandles: formatEvidence(extractEvidence(blocks)), userPreferences, briefTranscript: stringifyBrief(briefSections), transcriptEntries: sectionsToTranscript(briefSections), diff --git a/src/core/format.ts b/src/core/format.ts index 0d7d676..882abe9 100644 --- a/src/core/format.ts +++ b/src/core/format.ts @@ -28,6 +28,7 @@ export const formatSummary = (data: SectionData): string => { section("Session Goal", data.sessionGoal), section("Files And Changes", data.filesAndChanges), section("Commits", data.commits), + section("Evidence Handles", data.evidenceHandles), section("Outstanding Context", data.outstandingContext), section("User Preferences", data.userPreferences), ].filter(Boolean); diff --git a/src/core/summarize.ts b/src/core/summarize.ts index 64770a6..586e721 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -4,6 +4,7 @@ import { normalize } from "./normalize"; import { filterNoise } from "./filter-noise"; import { buildSections } from "./build-sections"; import { formatSummary, capBrief, RECALL_NOTE } from "./format"; +import { applyPreferenceCorrections } from "../extract/preferences"; export interface CompileInput { messages: Message[]; @@ -11,7 +12,7 @@ export interface CompileInput { fileOps?: FileOps; } -const HEADER_NAMES = ["Session Goal", "Files And Changes", "Commits", "Outstanding Context", "User Preferences"]; +const HEADER_NAMES = ["Session Goal", "Files And Changes", "Commits", "Evidence Handles", "Outstanding Context", "User Preferences"]; const SEPARATOR = "\n\n---\n\n"; @@ -51,12 +52,15 @@ const mergeHeaderSection = (header: string, prev: string, fresh: string): string return mergeFileLines(prev, fresh); } - // Session Goal, User Preferences: line-level dedup, cap + // Sticky list sections: line-level dedup, cap const isClean = (l: string) => l.startsWith("- ") && !l.includes(" line.replace(/^-\s*/, ""))).map((line) => `- ${line}`) + : combinedRaw; + const CAP = header === "Session Goal" ? 8 : header === "Commits" ? 8 : header === "Evidence Handles" ? 20 : 15; const capped = combined.length > CAP ? combined.slice(-CAP) : combined; if (capped.length === 0) return ""; return `[${header}]\n${capped.join("\n")}`; diff --git a/src/core/tool-result-summary.ts b/src/core/tool-result-summary.ts new file mode 100644 index 0000000..f02def0 --- /dev/null +++ b/src/core/tool-result-summary.ts @@ -0,0 +1,35 @@ +import { clip, firstLine, nonEmptyLines } from "./content"; + +const LARGE_OUTPUT_CHARS = 500; +const LARGE_OUTPUT_LINES = 12; + +const SIGNAL_RE = + /\b(error|fail(?:ed|ing|ure)?|exception|traceback|panic|fatal|critical|assert|timeout|not found|command not found|ERR_[A-Z0-9_]+|[A-Z][A-Z0-9]+(?:_[A-Z0-9]+){1,}|request_id=|req_[\w-]+)\b/i; + +const LOW_VALUE_RE = /^\s*(?:debug|trace|info)\b/i; + +const outputIsLarge = (text: string): boolean => + text.length > LARGE_OUTPUT_CHARS || text.split("\n").length > LARGE_OUTPUT_LINES; + +const salientLine = (text: string): string => { + const lines = nonEmptyLines(text); + const signal = lines.find((line) => SIGNAL_RE.test(line) && !LOW_VALUE_RE.test(line)); + if (signal) return clip(signal, 220); + const nonDebug = lines.find((line) => !LOW_VALUE_RE.test(line)); + if (nonDebug) return clip(nonDebug, 220); + return firstLine(text, 220); +}; + +/** + * Summarize a tool error/result for active prompt state. + * Large outputs keep a salient failure line and omit bulk that remains + * recoverable from raw session history through recall. + */ +export const summarizeToolResultForPrompt = (text: string): string => { + if (!outputIsLarge(text)) return firstLine(text, 180); + const lineCount = text.split("\n").length; + const chars = text.length; + const line = salientLine(text); + const omitted = `large output omitted: ${lineCount} lines, ${chars} chars`; + return line ? `${line} (${omitted})` : `(${omitted})`; +}; diff --git a/src/extract/evidence.ts b/src/extract/evidence.ts new file mode 100644 index 0000000..6c95538 --- /dev/null +++ b/src/extract/evidence.ts @@ -0,0 +1,72 @@ +import type { NormalizedBlock } from "../types"; +import { extractPath } from "../core/tool-args"; + +export interface EvidenceActivity { + paths: Set; + identifiers: Set; + errorSignatures: Set; +} + +const PATH_RE = /(?:^|[\s"'`(=])((?:\.?\/?[\w.-]+\/)+[\w.-]+(?:\.[\w.-]+)?)/g; +const ABS_PATH_RE = /(?:^|[\s"'`(=])(\/(?:tmp|var|home|workspace|app|repo|src|tests?)\/[\w./-]+)/g; +const ERROR_SIGNATURE_RE = /\b(?:ERR_[A-Z0-9_]+|[A-Z][A-Z0-9]+(?:_[A-Z0-9]+){1,})\b/g; +const ID_RE = /\b(?:cache|probe|span|spn|req|request|trace|artifact|bench)[A-Za-z0-9_-]*_[A-Za-z0-9_-]+\b/g; +const COMMIT_RE = /\b[0-9a-f]{7,40}\b/g; + +const addMatches = (set: Set, text: string, regex: RegExp, group = 0) => { + for (const match of text.matchAll(regex)) { + const value = (match[group] ?? match[0]).trim(); + if (value) set.add(value); + } +}; + +const textFromBlock = (block: NormalizedBlock): string => { + if (block.kind === "tool_call") return JSON.stringify(block.args ?? {}); + return "text" in block ? block.text : ""; +}; + +const addEvidenceFromText = (activity: EvidenceActivity, text: string) => { + addMatches(activity.paths, text, ABS_PATH_RE, 1); + addMatches(activity.paths, text, PATH_RE, 1); + addMatches(activity.errorSignatures, text, ERROR_SIGNATURE_RE); + addMatches(activity.identifiers, text, ID_RE); + addMatches(activity.identifiers, text, COMMIT_RE); +}; + +export const extractEvidence = (blocks: NormalizedBlock[]): EvidenceActivity => { + const activity: EvidenceActivity = { + paths: new Set(), + identifiers: new Set(), + errorSignatures: new Set(), + }; + + for (const block of blocks) { + if (block.kind === "tool_call") { + const path = extractPath(block.args); + if (path) activity.paths.add(path); + for (const key of ["command", "cmd", "query", "path", "file", "file_path", "filePath"]) { + const value = block.args[key]; + if (typeof value === "string") addEvidenceFromText(activity, value); + } + continue; + } + + addEvidenceFromText(activity, textFromBlock(block)); + } + + return activity; +}; + +const cap = (set: Set, limit: number): string => { + const values = [...set].sort(); + if (values.length <= limit) return values.join(", "); + return `${values.slice(0, limit).join(", ")} (+${values.length - limit} more)`; +}; + +export const formatEvidence = (activity: EvidenceActivity): string[] => { + const lines: string[] = []; + if (activity.paths.size > 0) lines.push(`Paths: ${cap(activity.paths, 12)}`); + if (activity.errorSignatures.size > 0) lines.push(`Error signatures: ${cap(activity.errorSignatures, 12)}`); + if (activity.identifiers.size > 0) lines.push(`Identifiers: ${cap(activity.identifiers, 16)}`); + return lines; +}; diff --git a/src/extract/files.ts b/src/extract/files.ts index f82c413..e9c8169 100644 --- a/src/extract/files.ts +++ b/src/extract/files.ts @@ -8,7 +8,7 @@ interface FileActivity { } const FILE_READ_TOOLS = new Set([ - "Read", "read_file", "View", + "Read", "read", "read_file", "View", ]); const FILE_WRITE_TOOLS = new Set([ diff --git a/src/extract/goals.ts b/src/extract/goals.ts index bea7ce7..5b0a5d7 100644 --- a/src/extract/goals.ts +++ b/src/extract/goals.ts @@ -8,6 +8,11 @@ const SCOPE_CHANGE_RE = const TASK_RE = /\b(fix|implement|add|create|build|refactor|debug|investigate|update|remove|delete|migrate|deploy|test|write|set up)\b/i; +const PREFERENCE_RE = + /\b(prefer(?:s|red|ring)?|always use|never use|please use|please avoid|do not use|don'?t use)\b/i; +const PREFERENCE_WITH_TASK_RE = + /\b(fix|implement|add|create|build|refactor|debug|investigate|update|remove|delete|migrate|deploy|write|set up)\b/i; + const NOISE_SHORT_RE = /^(ok|yes|no|sure|yeah|yep|go|hi|hey|thx|thanks|ok\b.*|y|n|k)\s*[.!?]*$/i; // Reject lines that are clearly not user goals (pasted output, code, paths, tool dumps) @@ -31,12 +36,16 @@ const stripLeadingBullet = (line: string): string => const MAX_GOAL_CHARS = 200; +const isPreferenceOnly = (text: string): boolean => + PREFERENCE_RE.test(text) && !PREFERENCE_WITH_TASK_RE.test(text); + const isSubstantiveGoal = (text: string): boolean => { const t = text.trim(); if (t.length <= 5) return false; if (t.length > MAX_GOAL_CHARS) return false; if (NOISE_SHORT_RE.test(t)) return false; if (NON_GOAL_RE.test(t)) return false; + if (isPreferenceOnly(t)) return false; return true; }; diff --git a/src/extract/preferences.ts b/src/extract/preferences.ts index 5bea689..9a93c44 100644 --- a/src/extract/preferences.ts +++ b/src/extract/preferences.ts @@ -38,7 +38,33 @@ export const extractPreferences = (blocks: NormalizedBlock[]): string[] => { } } - return prefs.slice(0, 10); + return applyPreferenceCorrections(prefs).slice(0, 10); +}; + +const NEVER_USE_RE = /\bnever use\s+([\w.-]+)/i; +const POSITIVE_PREF_RE = /\b(?:prefer|always use|please use|use)\b/i; + +export const applyPreferenceCorrections = (prefs: string[]): string[] => { + const corrected: string[] = []; + + for (const pref of prefs) { + const neverUse = pref.match(NEVER_USE_RE)?.[1]?.toLowerCase(); + if (neverUse) { + for (let i = corrected.length - 1; i >= 0; i--) { + const existing = corrected[i].toLowerCase(); + if ( + existing.includes(neverUse) && + POSITIVE_PREF_RE.test(existing) && + !/\bnever\b|\bdo not\b|\bdon't\b/.test(existing) + ) { + corrected.splice(i, 1); + } + } + } + corrected.push(pref); + } + + return corrected; }; /** diff --git a/src/sections.ts b/src/sections.ts index 8231686..05d764f 100644 --- a/src/sections.ts +++ b/src/sections.ts @@ -5,6 +5,7 @@ export interface SectionData { outstandingContext: string[]; filesAndChanges: string[]; commits: string[]; + evidenceHandles: string[]; userPreferences: string[]; briefTranscript: string; /** Structured transcript entries (verbose object format) */ diff --git a/tests/build-sections.test.ts b/tests/build-sections.test.ts index 6474f97..71ce1a8 100644 --- a/tests/build-sections.test.ts +++ b/tests/build-sections.test.ts @@ -7,6 +7,7 @@ describe("buildSections", () => { const r = buildSections({ blocks: [] }); expect(r.sessionGoal).toEqual([]); expect(r.outstandingContext).toEqual([]); + expect(r.evidenceHandles).toEqual([]); expect(r.briefTranscript).toBe(""); }); @@ -56,4 +57,33 @@ describe("buildSections", () => { const matches = r.briefTranscript.match(/\[assistant\]/g); expect(matches?.length).toBe(1); }); + + it("captures exact evidence handles from tool calls and errors", () => { + const blocks: NormalizedBlock[] = [ + { kind: "tool_call", name: "read", args: { path: "src/auth/session.ts" } }, + { kind: "tool_result", name: "bash", text: "FAIL tests/auth-refresh.test.ts\nERR_REFRESH_AFTER_RESET expired token", isError: true }, + { kind: "tool_result", name: "read", text: "probe_id=cache_probe_A17\nspan=spn_cache_keep_91\ncommit=9f3a2b1", isError: false }, + ]; + const r = buildSections({ blocks }); + const evidence = r.evidenceHandles.join("\n"); + expect(r.filesAndChanges.join("\n")).toContain("src/auth/session.ts"); + expect(evidence).toContain("ERR_REFRESH_AFTER_RESET"); + expect(evidence).toContain("cache_probe_A17"); + expect(evidence).toContain("spn_cache_keep_91"); + expect(evidence).toContain("9f3a2b1"); + }); + + it("summarizes bulky tool errors without pasting low-value log lines", () => { + const text = [ + ...Array.from({ length: 20 }, (_, i) => `debug ${i}: warmup ok`), + "CRITICAL CACHE_MISS_AT_LAYER_2B request_id=req_cache_482", + ].join("\n"); + const blocks: NormalizedBlock[] = [ + { kind: "tool_result", name: "bash", text, isError: true }, + ]; + const r = buildSections({ blocks }); + expect(r.briefTranscript).toContain("CACHE_MISS_AT_LAYER_2B"); + expect(r.briefTranscript).not.toContain("debug 0: warmup ok"); + expect(r.outstandingContext.join("\n")).toContain("CACHE_MISS_AT_LAYER_2B"); + }); }); diff --git a/tests/compile.test.ts b/tests/compile.test.ts index 585984b..8dd5f98 100644 --- a/tests/compile.test.ts +++ b/tests/compile.test.ts @@ -77,4 +77,16 @@ describe("compile", () => { expect(r).toContain("earlier lines omitted"); expect(r).toContain("latest"); }); + + it("supersedes stale positive preferences after explicit correction", () => { + const previousSummary = "[User Preferences]\n- For this repo, prefer yarn test when validating.\n\n---\n\n[user]\nold"; + const r = compile({ + previousSummary, + messages: [userMsg("Correction: never use yarn here. Use npm test for broad validation and node --test for focused checks.")], + }); + const current = r.split("\n\n---\n\n")[0]; + expect(current).toContain("never use yarn"); + expect(current).toContain("npm test"); + expect(current).not.toContain("prefer yarn test"); + }); }); diff --git a/tests/format.test.ts b/tests/format.test.ts index bc2773c..61ee710 100644 --- a/tests/format.test.ts +++ b/tests/format.test.ts @@ -7,6 +7,7 @@ const empty: SectionData = { outstandingContext: [], filesAndChanges: [], commits: [], + evidenceHandles: [], userPreferences: [], briefTranscript: "", transcriptEntries: [], From f87c79d0d3c610e59ceb9752643cea9cecd19eb9 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 19:52:14 +0200 Subject: [PATCH 03/65] test: simulate full-prompt cache churn Extend the offline compaction benchmark from compacted-summary churn to simulated provider-prompt churn. Each cycle now composes stable provider/tool/project layers, the compactor output layers, and a kept raw tail, then reports full-prompt LCP, first changed prompt layer, stable prefix tokens, and per-layer token deltas. Validation: node --check bench/compaction/offline-runner.ts scripts/bench-compaction.ts; git diff --check; docker build -t pi-vcc-bench .; docker run --rm --entrypoint bun pi-vcc-bench scripts/bench-compaction.ts; docker run --rm pi-vcc-bench --compactors pi-vcc --assert. --- README.md | 2 +- bench/compaction/README.md | 17 +++++ bench/compaction/offline-runner.ts | 100 ++++++++++++++++++++++++++++- 3 files changed, 117 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d6e92a7..e5533c2 100644 --- a/README.md +++ b/README.md @@ -193,7 +193,7 @@ Typical workflow: **search → find relevant entry indices → expand those indi ## Compaction benchmark -An offline benchmark harness lives under `bench/compaction`. It replays pressure-style synthetic long-session scenarios through multiple compactors and records continuation-oriented metrics: exact state recovery, current-state recovery, recall recovery, prompt size, layer churn, longest common prefix, stale-fact leakage, and recall-only offload leakage. +An offline benchmark harness lives under `bench/compaction`. It replays pressure-style synthetic long-session scenarios through multiple compactors and records continuation-oriented metrics: exact state recovery, current-state recovery, recall recovery, prompt size, simulated full-prompt cache churn, longest common prefix, stale-fact leakage, and recall-only offload leakage. Run all offline compactors: diff --git a/bench/compaction/README.md b/bench/compaction/README.md index f739e2e..5776cb3 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -98,6 +98,23 @@ Each compaction cycle records: The cache-oriented metrics are offline approximations. They do not replace provider-reported cached-token accounting, but they highlight prompt churn that is likely to hurt prefix-based caching. +## Full-prompt cache simulation + +Each cycle also builds a simulated provider prompt so cache churn can be measured outside the compacted summary alone. The simulated prompt contains stable provider/tool/project layers, the compactor's rendered layers, and a small kept raw tail. This does not exactly reproduce Pi's production request, but it catches the main prefix-cache risk: a volatile update moving earlier than necessary. + +Additional cache fields include: + +- `fullPromptChars` and `fullPromptTokensEst` +- `fullPromptLcpTokensWithPrevious` +- `fullPromptLcpTokenRatioWithPrevious` +- `firstChangedPromptLayer` +- `changedPromptLayers` +- `stablePrefixTokens` +- `promptLayerSizes` +- `promptLayerTokenDeltas` + +Use these fields to compare section ordering and stable/volatile splits before adding live provider probes. A better cache-aware layout should generally increase `stablePrefixTokens`, push `firstChangedPromptLayer` later, and keep volatile deltas out of static/current prefix layers when the underlying facts did not change. + ## Running Run all offline compactors: diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index 25ff733..f115770 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -22,6 +22,16 @@ export interface RecallDocument { text: string; } +export interface PromptLayerSnapshot { + name: string; + text: string; +} + +export interface PromptSnapshot { + text: string; + layers: PromptLayerSnapshot[]; +} + export interface CompactorResult { activePromptState: string; layers: LayerSnapshot[]; @@ -68,11 +78,18 @@ export interface CycleMetrics { activeTokensEst: number; currentChars: number; currentTokensEst: number; + fullPromptChars: number; + fullPromptTokensEst: number; compactionMs: number; lcpTokensWithPrevious: number | null; lcpTokenRatioWithPrevious: number | null; firstChangedLayer: string | null; changedLayers: string[]; + fullPromptLcpTokensWithPrevious: number | null; + fullPromptLcpTokenRatioWithPrevious: number | null; + firstChangedPromptLayer: string | null; + changedPromptLayers: string[]; + stablePrefixTokens: number | null; activeTermRecall: number | null; currentTermRecall: number | null; recallTermHitRate: number | null; @@ -87,6 +104,8 @@ export interface CycleMetrics { leakedForbiddenCurrentTerms: string[]; leakedActiveAbsentTerms: string[]; layerSizes: Record; + promptLayerSizes: Record; + promptLayerTokenDeltas: Record; } export interface BenchmarkRunResult { @@ -95,6 +114,7 @@ export interface BenchmarkRunResult { cycles: number; meanActiveTokensEst: number; meanCurrentTokensEst: number; + meanFullPromptTokensEst: number; meanCompactionMs: number; meanActiveTermRecall: number | null; meanCurrentTermRecall: number | null; @@ -104,6 +124,8 @@ export interface BenchmarkRunResult { totalForbiddenCurrentLeaks: number; totalActiveAbsentLeaks: number; meanLcpTokenRatio: number | null; + meanFullPromptLcpTokenRatio: number | null; + meanStablePrefixTokens: number | null; }>; } @@ -144,6 +166,59 @@ const textForRoles = (result: CompactorResult, roles: LayerRole[]): string => { return selected.map((layer) => `[${layer.name}]\n${layer.text}`).join("\n\n"); }; +const renderPromptLayers = (layers: PromptLayerSnapshot[]): string => + layers.map((layer) => `[${layer.name}]\n${layer.text}`).join("\n\n"); + +const simulatedPromptOf = (result: CompactorResult, sourceMessages: Message[]): PromptSnapshot => { + const recentTail = renderedDocuments(sourceMessages.slice(-2)) + .map((doc) => doc.text) + .join("\n"); + const layers: PromptLayerSnapshot[] = [ + { + name: "Provider Prefix", + text: [ + "system: You are an expert coding assistant operating inside Pi.", + "format: preserve compacted state sections and use recall before redoing prior work.", + ].join("\n"), + }, + { + name: "Tool Definitions", + text: "tools: read, bash, edit, write, vcc_recall", + }, + { + name: "Project Instructions", + text: "project: follow local guidance, validate before claiming completion, avoid destructive actions.", + }, + ...result.layers.map((layer) => ({ name: layer.name, text: layer.text })), + { + name: "Kept Raw Tail", + text: recentTail || "- (none)", + }, + ]; + return { layers, text: renderPromptLayers(layers) }; +}; + +const summarizeChangedPromptLayers = ( + previous: PromptSnapshot | undefined, + current: PromptSnapshot, +): { firstChangedPromptLayer: string | null; changedPromptLayers: string[]; promptLayerTokenDeltas: Record } => { + if (!previous) return { firstChangedPromptLayer: null, changedPromptLayers: [], promptLayerTokenDeltas: {} }; + const prevByName = new Map(previous.layers.map((layer) => [layer.name, layer.text])); + const changedPromptLayers = current.layers + .filter((layer) => prevByName.get(layer.name) !== layer.text) + .map((layer) => layer.name); + const promptLayerTokenDeltas = Object.fromEntries(current.layers.map((layer) => { + const previousTokens = tokenize(prevByName.get(layer.name) ?? "").length; + const currentTokens = tokenize(layer.text).length; + return [layer.name, currentTokens - previousTokens]; + })); + return { + firstChangedPromptLayer: changedPromptLayers[0] ?? null, + changedPromptLayers, + promptLayerTokenDeltas, + }; +}; + const termProbe = (terms: ExpectedTerm[] = [], sourceText: string, targetText: string): TermProbeResult[] => terms.map((term) => { const applicable = lowerIncludes(sourceText, term.term); @@ -478,6 +553,8 @@ const cycleMetrics = ( sourceMessages: Message[], result: CompactorResult, previous: CompactorResult | undefined, + prompt: PromptSnapshot, + previousPrompt: PromptSnapshot | undefined, ): CycleMetrics => { const sourceText = sourceTextOf(sourceMessages); const activeText = result.activePromptState; @@ -495,6 +572,12 @@ const cycleMetrics = ( const currentTokens = tokenize(activeText).length; const lcp = previous ? lcpTokens(previous.activePromptState, activeText) : null; const denominator = Math.min(previousTokens, currentTokens); + const promptChanged = summarizeChangedPromptLayers(previousPrompt, prompt); + const previousPromptTokens = previousPrompt ? tokenize(previousPrompt.text).length : 0; + const currentPromptTokens = tokenize(prompt.text).length; + const fullPromptLcp = previousPrompt ? lcpTokens(previousPrompt.text, prompt.text) : null; + const fullPromptDenominator = Math.min(previousPromptTokens, currentPromptTokens); + const stablePrefixTokens = previousPrompt ? fullPromptLcp : null; return { caseId: testCase.id, @@ -505,11 +588,18 @@ const cycleMetrics = ( activeTokensEst: estimateTokens(activeText), currentChars: currentText.length, currentTokensEst: estimateTokens(currentText), + fullPromptChars: prompt.text.length, + fullPromptTokensEst: estimateTokens(prompt.text), compactionMs: Number(result.stats.compactionMs.toFixed(3)), lcpTokensWithPrevious: lcp, lcpTokenRatioWithPrevious: lcp === null || denominator === 0 ? null : Number((lcp / denominator).toFixed(4)), firstChangedLayer: changed.firstChangedLayer, changedLayers: changed.changedLayers, + fullPromptLcpTokensWithPrevious: fullPromptLcp, + fullPromptLcpTokenRatioWithPrevious: fullPromptLcp === null || fullPromptDenominator === 0 ? null : Number((fullPromptLcp / fullPromptDenominator).toFixed(4)), + firstChangedPromptLayer: promptChanged.firstChangedPromptLayer, + changedPromptLayers: promptChanged.changedPromptLayers, + stablePrefixTokens, activeTermRecall: ratioOf(activeProbes), currentTermRecall: ratioOf(currentProbes), recallTermHitRate: ratioOf(recallProbes), @@ -524,6 +614,8 @@ const cycleMetrics = ( leakedForbiddenCurrentTerms, leakedActiveAbsentTerms: activeAbsentLeaks.map((term) => term.label), layerSizes: Object.fromEntries(result.layers.map((layer) => [layer.name, layer.text.length])), + promptLayerSizes: Object.fromEntries(prompt.layers.map((layer) => [layer.name, layer.text.length])), + promptLayerTokenDeltas: promptChanged.promptLayerTokenDeltas, }; }; @@ -553,6 +645,7 @@ const aggregate = (cycles: CycleMetrics[]): BenchmarkRunResult["aggregate"] => { cycles: items.length, meanActiveTokensEst: meanRounded(items.map((item) => item.activeTokensEst)), meanCurrentTokensEst: meanRounded(items.map((item) => item.currentTokensEst)), + meanFullPromptTokensEst: meanRounded(items.map((item) => item.fullPromptTokensEst)), meanCompactionMs: meanRounded(items.map((item) => item.compactionMs)), meanActiveTermRecall: nullableMean((item) => item.activeTermRecall), meanCurrentTermRecall: nullableMean((item) => item.currentTermRecall), @@ -562,6 +655,8 @@ const aggregate = (cycles: CycleMetrics[]): BenchmarkRunResult["aggregate"] => { totalForbiddenCurrentLeaks: items.reduce((sum, item) => sum + item.forbiddenCurrentLeakCount, 0), totalActiveAbsentLeaks: items.reduce((sum, item) => sum + item.activeAbsentLeakCount, 0), meanLcpTokenRatio: nullableMean((item) => item.lcpTokenRatioWithPrevious), + meanFullPromptLcpTokenRatio: nullableMean((item) => item.fullPromptLcpTokenRatioWithPrevious), + meanStablePrefixTokens: nullableMean((item) => item.stablePrefixTokens), }]; })); }; @@ -589,6 +684,7 @@ export const runOfflineCompactionBenchmark = (options: { for (const testCase of cases) { for (const compactor of compactors) { let previous: CompactorResult | undefined; + let previousPrompt: PromptSnapshot | undefined; let previousPoint = 0; testCase.compactionPoints.forEach((point, index) => { const sourceMessages = testCase.messages.slice(0, point); @@ -599,8 +695,10 @@ export const runOfflineCompactionBenchmark = (options: { previous, cycle: index + 1, }); - cycles.push(cycleMetrics(testCase, compactor, index + 1, point, sourceMessages, result, previous)); + const prompt = simulatedPromptOf(result, sourceMessages); + cycles.push(cycleMetrics(testCase, compactor, index + 1, point, sourceMessages, result, previous, prompt, previousPrompt)); previous = result; + previousPrompt = prompt; previousPoint = point; }); } From d2bc8eaf2f830ad79b850e57a2bbe40fe9f1acb6 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 19:53:53 +0200 Subject: [PATCH 04/65] fix: keep blocker updates out of goals Treat current blocker, blocker update, status update, and next step messages as volatile state rather than stable session goals. This keeps the goal section more cache-stable across repeated compactions while preserving the latest blocker in outstanding context and transcript layers. Validation: node --check src/extract/goals.ts tests/extract-goals.test.ts; docker run --rm -v "/home/fl/code/personal/pi-vcc":/work -w /work oven/bun:1.3.13 bun test tests/extract-goals.test.ts; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert. --- src/extract/goals.ts | 2 ++ tests/extract-goals.test.ts | 11 +++++++++++ 2 files changed, 13 insertions(+) diff --git a/src/extract/goals.ts b/src/extract/goals.ts index 5b0a5d7..633a09f 100644 --- a/src/extract/goals.ts +++ b/src/extract/goals.ts @@ -14,6 +14,7 @@ const PREFERENCE_WITH_TASK_RE = /\b(fix|implement|add|create|build|refactor|debug|investigate|update|remove|delete|migrate|deploy|write|set up)\b/i; const NOISE_SHORT_RE = /^(ok|yes|no|sure|yeah|yep|go|hi|hey|thx|thanks|ok\b.*|y|n|k)\s*[.!?]*$/i; +const VOLATILE_STATUS_RE = /^\s*(?:current blocker|blocker update|status update|next step)\s*:/i; // Reject lines that are clearly not user goals (pasted output, code, paths, tool dumps) // or meta-prompt boilerplate (command templates like `/issues` that start with "For each issue:" @@ -44,6 +45,7 @@ const isSubstantiveGoal = (text: string): boolean => { if (t.length <= 5) return false; if (t.length > MAX_GOAL_CHARS) return false; if (NOISE_SHORT_RE.test(t)) return false; + if (VOLATILE_STATUS_RE.test(t)) return false; if (NON_GOAL_RE.test(t)) return false; if (isPreferenceOnly(t)) return false; return true; diff --git a/tests/extract-goals.test.ts b/tests/extract-goals.test.ts index 4f27ba2..d2f941d 100644 --- a/tests/extract-goals.test.ts +++ b/tests/extract-goals.test.ts @@ -83,4 +83,15 @@ describe("extractGoals", () => { expect(goals[0]).toContain("Fix the authentication"); expect(goals.some((g) => g === "ok")).toBe(false); }); + + it("keeps volatile blocker updates out of stable goals", () => { + const goals = extractGoals([ + { kind: "user", text: "Benchmark cache-aware compaction. Stable objective: preserve Layer 0 and Layer 1 prefixes." }, + { kind: "user", text: "Blocker update: offline LCP metrics are done; now add recall top-k metrics." }, + { kind: "user", text: "Current blocker: cached-token accounting is missing." }, + ]); + expect(goals).toEqual([ + "Benchmark cache-aware compaction. Stable objective: preserve Layer 0 and Layer 1 prefixes.", + ]); + }); }); From d54e9276af56421b195047e21c2851944b21eb61 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 19:56:38 +0200 Subject: [PATCH 05/65] test: report section-level cache churn Order stable user preferences before volatile outstanding context and split pi-vcc current summary sections into simulated prompt layers for benchmark cache metrics. The cache-bust scenario now identifies Outstanding Context as the first changed prompt layer, making stable-prefix effects visible before live provider probes. Validation: node --check src/core/format.ts src/core/summarize.ts bench/compaction/offline-runner.ts; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; focused Bun tests for format and compile. --- bench/compaction/README.md | 2 +- bench/compaction/offline-runner.ts | 17 ++++++++++++++++- src/core/format.ts | 2 +- src/core/summarize.ts | 2 +- 4 files changed, 19 insertions(+), 4 deletions(-) diff --git a/bench/compaction/README.md b/bench/compaction/README.md index 5776cb3..b48b700 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -100,7 +100,7 @@ The cache-oriented metrics are offline approximations. They do not replace provi ## Full-prompt cache simulation -Each cycle also builds a simulated provider prompt so cache churn can be measured outside the compacted summary alone. The simulated prompt contains stable provider/tool/project layers, the compactor's rendered layers, and a small kept raw tail. This does not exactly reproduce Pi's production request, but it catches the main prefix-cache risk: a volatile update moving earlier than necessary. +Each cycle also builds a simulated provider prompt so cache churn can be measured outside the compacted summary alone. The simulated prompt contains stable provider/tool/project layers, the compactor's rendered layers, and a small kept raw tail. For `pi-vcc`, current summary sections are split into separate simulated prompt layers so the report can identify which section changes first. This does not exactly reproduce Pi's production request, but it catches the main prefix-cache risk: a volatile update moving earlier than necessary. Additional cache fields include: diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index f115770..f4711f2 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -441,6 +441,21 @@ const makeLayeredCheckpoint = (messages: Message[]): LayerSnapshot[] => { const renderLayers = (layers: LayerSnapshot[]): string => layers.map((layer) => `[${layer.name}]\n${layer.text}`).join("\n\n"); +const splitCurrentSections = (current: string): LayerSnapshot[] => { + const headers = [...current.matchAll(/^\[(.+?)\]/gm)]; + if (headers.length === 0) return [{ name: "Pi VCC Current Sections", role: "current", text: current }]; + return headers.map((header, index) => { + const start = header.index ?? 0; + const end = headers[index + 1]?.index ?? current.length; + const title = header[1]; + return { + name: `Pi VCC ${title}`, + role: "current" as const, + text: current.slice(start, end).trimEnd(), + }; + }); +}; + const splitPiVccSummary = (summary: string): LayerSnapshot[] => { if (!summary.trim()) return []; const parts = summary.split(SEPARATOR).map((part) => part.trim()).filter(Boolean); @@ -453,7 +468,7 @@ const splitPiVccSummary = (summary: string): LayerSnapshot[] => { const current = bodyParts[0] ?? ""; const history = bodyParts.slice(1).join(SEPARATOR); - if (current) layers.push({ name: "Pi VCC Current Sections", role: "current", text: current }); + if (current) layers.push(...splitCurrentSections(current)); if (history) layers.push({ name: "Pi VCC Brief Transcript", role: "history", text: history }); if (hasRecallNote) layers.push({ name: "Pi VCC Recall Note", role: "recall", text: RECALL_NOTE }); return layers.length > 0 ? layers : [{ name: "Pi VCC Current Sections", role: "current", text: summary }]; diff --git a/src/core/format.ts b/src/core/format.ts index 882abe9..a03b3b5 100644 --- a/src/core/format.ts +++ b/src/core/format.ts @@ -29,8 +29,8 @@ export const formatSummary = (data: SectionData): string => { section("Files And Changes", data.filesAndChanges), section("Commits", data.commits), section("Evidence Handles", data.evidenceHandles), - section("Outstanding Context", data.outstandingContext), section("User Preferences", data.userPreferences), + section("Outstanding Context", data.outstandingContext), ].filter(Boolean); const parts: string[] = []; diff --git a/src/core/summarize.ts b/src/core/summarize.ts index 586e721..57462e1 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -12,7 +12,7 @@ export interface CompileInput { fileOps?: FileOps; } -const HEADER_NAMES = ["Session Goal", "Files And Changes", "Commits", "Evidence Handles", "Outstanding Context", "User Preferences"]; +const HEADER_NAMES = ["Session Goal", "Files And Changes", "Commits", "Evidence Handles", "User Preferences", "Outstanding Context"]; const SEPARATOR = "\n\n---\n\n"; From 501826379223efe854d1207368cb86582b1817f5 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 19:59:07 +0200 Subject: [PATCH 06/65] test: support real session benchmark replay Add an optional JSONL session loader so the compaction benchmark can replay mounted Pi sessions without depending on pi-core node_modules. Real-session cases generate compaction points and report size, latency, and cache-churn metrics without gold assertions, complementing the synthetic RED probes. Validation: node --check bench/compaction/real-sessions.ts scripts/bench-compaction.ts; git diff --check; docker build -t pi-vcc-bench .; docker run --rm -v ~/.pi/agent/sessions:/sessions:ro pi-vcc-bench --real-only --real-sessions-dir /sessions --real-limit 1 --compactors pi-vcc --jsonl; docker run --rm pi-vcc-bench --compactors pi-vcc --assert. --- README.md | 15 +++++- bench/compaction/README.md | 28 +++++++++++ bench/compaction/real-sessions.ts | 83 +++++++++++++++++++++++++++++++ scripts/bench-compaction.ts | 13 ++++- 4 files changed, 137 insertions(+), 2 deletions(-) create mode 100644 bench/compaction/real-sessions.ts diff --git a/README.md b/README.md index e5533c2..05c4721 100644 --- a/README.md +++ b/README.md @@ -233,7 +233,20 @@ bun scripts/bench-compaction.ts --compactors pi-vcc --assert docker run --rm pi-vcc-bench --compactors pi-vcc --assert ``` -Assertion failures are expected for current baselines while these RED scenarios document known gaps. The default benchmark is deterministic and does not call model providers. Provider-reported cached-token and latency measurements should be added as an opt-in benchmark because they require credentials and can create billable requests. +Sample real Pi sessions for size, latency, and cache-churn metrics: + +```bash +docker run --rm \ + -v ~/.pi/agent/sessions:/sessions:ro \ + pi-vcc-bench \ + --real-only \ + --real-sessions-dir /sessions \ + --real-limit 2 \ + --compactors pi-vcc \ + --jsonl +``` + +Assertion failures are expected for current baselines while these RED scenarios document known gaps. The default synthetic benchmark is deterministic and does not call model providers. Real-session sampling depends on the mounted local session corpus. Provider-reported cached-token and latency measurements should be added as an opt-in benchmark because they require credentials and can create billable requests. ## Config diff --git a/bench/compaction/README.md b/bench/compaction/README.md index b48b700..9c4aab9 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -141,12 +141,40 @@ Run assertion mode. This exits non-zero if any selected compactor misses active/ bun scripts/bench-compaction.ts --compactors pi-vcc --assert ``` +Append sampled real Pi sessions from a local session directory. Real-session cases have no gold state assertions; they are useful for size, latency, growth, and cache-churn signals: + +```bash +bun scripts/bench-compaction.ts \ + --real-sessions-dir ~/.pi/agent/sessions \ + --real-limit 2 \ + --compactors pi-vcc +``` + +Run only sampled real sessions: + +```bash +bun scripts/bench-compaction.ts \ + --real-only \ + --real-sessions-dir ~/.pi/agent/sessions \ + --real-limit 2 \ + --compactors pi-vcc \ + --jsonl +``` + Run the same checks in Docker: ```bash docker build -t pi-vcc-bench . docker run --rm pi-vcc-bench docker run --rm pi-vcc-bench --compactors pi-vcc --assert +docker run --rm \ + -v ~/.pi/agent/sessions:/sessions:ro \ + pi-vcc-bench \ + --real-only \ + --real-sessions-dir /sessions \ + --real-limit 2 \ + --compactors pi-vcc \ + --jsonl ``` Assertion failures are expected for current baselines while the RED scenarios are documenting known gaps. Use selected compactors when checking one implementation at a time. diff --git a/bench/compaction/real-sessions.ts b/bench/compaction/real-sessions.ts new file mode 100644 index 0000000..1570a5e --- /dev/null +++ b/bench/compaction/real-sessions.ts @@ -0,0 +1,83 @@ +import { readdir, readFile, stat } from "node:fs/promises"; +import { basename } from "node:path"; +import type { Message } from "@mariozechner/pi-ai"; +import type { CompactionBenchmarkCase } from "./synthetic-cases"; + +interface SessionFile { + path: string; + size: number; +} + +const walkJsonl = async (dir: string): Promise => { + const entries = await readdir(dir, { withFileTypes: true }); + const out: SessionFile[] = []; + for (const entry of entries) { + const path = `${dir.replace(/\/$/, "")}/${entry.name}`; + if (entry.isDirectory()) { + out.push(...await walkJsonl(path)); + } else if (entry.isFile() && entry.name.endsWith(".jsonl")) { + const s = await stat(path); + out.push({ path, size: s.size }); + } + } + return out; +}; + +const isMessage = (value: unknown): value is Message => + Boolean(value && typeof value === "object" && typeof (value as any).role === "string" && "content" in (value as any)); + +const loadMessagesFromJsonl = async (path: string): Promise => { + const text = await readFile(path, "utf8"); + const messages: Message[] = []; + for (const line of text.split("\n")) { + if (!line.trim()) continue; + let entry: any; + try { + entry = JSON.parse(line); + } catch { + continue; + } + if (entry?.type !== "message") continue; + if (isMessage(entry.message)) messages.push(entry.message); + } + return messages; +}; + +const compactionPointsFor = (messageCount: number): number[] => { + if (messageCount <= 3) return []; + const raw = [ + Math.ceil(messageCount * 0.4), + Math.ceil(messageCount * 0.7), + messageCount, + ].filter((point) => point > 2 && point <= messageCount); + return [...new Set(raw)]; +}; + +export const loadRealSessionCases = async (options: { + sessionsDir: string; + limit?: number; +}): Promise => { + const limit = Math.max(1, options.limit ?? 2); + const files = (await walkJsonl(options.sessionsDir)) + .sort((a, b) => b.size - a.size) + .slice(0, limit); + + const cases: CompactionBenchmarkCase[] = []; + for (const file of files) { + const messages = await loadMessagesFromJsonl(file.path); + const compactionPoints = compactionPointsFor(messages.length); + if (compactionPoints.length === 0) continue; + cases.push({ + id: `real-session:${basename(file.path, ".jsonl")}`, + description: `Real Pi session replay sampled from ${file.path}`, + messages, + compactionPoints, + gold: { + activeTerms: [], + recallTerms: [], + }, + }); + } + + return cases; +}; diff --git a/scripts/bench-compaction.ts b/scripts/bench-compaction.ts index 5b85e64..a14042c 100644 --- a/scripts/bench-compaction.ts +++ b/scripts/bench-compaction.ts @@ -1,5 +1,7 @@ #!/usr/bin/env node import { failedGatesOf, offlineCompactors, runOfflineCompactionBenchmark } from "../bench/compaction/offline-runner"; +import { syntheticCompactionCases } from "../bench/compaction/synthetic-cases"; +import { loadRealSessionCases } from "../bench/compaction/real-sessions"; const args = process.argv.slice(2); @@ -13,6 +15,10 @@ const argValue = (name: string): string | undefined => { const hasFlag = (name: string): boolean => args.includes(name); +const realSessionsDir = argValue("--real-sessions-dir"); +const realLimitRaw = argValue("--real-limit"); +const realLimit = realLimitRaw ? Number.parseInt(realLimitRaw, 10) : undefined; + const selected = argValue("--compactors") ?.split(",") .map((name) => name.trim()) @@ -30,7 +36,12 @@ if (selected && compactors.length !== selected.length) { process.exit(1); } -const result = runOfflineCompactionBenchmark({ compactors }); +const cases = hasFlag("--real-only") ? [] : [...syntheticCompactionCases]; +if (realSessionsDir) { + cases.push(...await loadRealSessionCases({ sessionsDir: realSessionsDir, limit: realLimit })); +} + +const result = runOfflineCompactionBenchmark({ compactors, cases }); const failures = result.cycles .map((cycle) => ({ cycle, gates: failedGatesOf(cycle) })) .filter((entry) => entry.gates.length > 0); From 3fe31acaec636ea7eb3e801a6c60b2cf5e11cd9d Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:07:08 +0200 Subject: [PATCH 07/65] test: diagnose real-session cache churn Add optional layer-diff diagnostics and a real-session-shaped regression case so cache churn can be inspected without manually parsing large JSON outputs. The diagnostics showed legitimate scope additions in Session Goal and highlighted noisy evidence extraction as the next churn source. Tighten evidence extraction to avoid broad documentation paths, environment-style constants, and unlabeled decimal/hex values as stable handles. Overflow suffixes now avoid exact count churn, and brief-only fresh updates survive summary merges so volatile status remains in transcript instead of disappearing. Validation: node --check on changed benchmark and summary files; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker real-session replay with --show-layer-diff; focused Bun tests for build-sections and compile. --- bench/compaction/README.md | 12 ++++++++ bench/compaction/offline-runner.ts | 44 ++++++++++++++++++++++++++++- bench/compaction/synthetic-cases.ts | 29 +++++++++++++++++++ scripts/bench-compaction.ts | 7 ++++- src/core/build-sections.ts | 2 +- src/core/summarize.ts | 14 +++++---- src/extract/evidence.ts | 14 ++++----- tests/compile.test.ts | 13 +++++++++ 8 files changed, 120 insertions(+), 15 deletions(-) diff --git a/bench/compaction/README.md b/bench/compaction/README.md index 9c4aab9..3142b6b 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -161,6 +161,18 @@ bun scripts/bench-compaction.ts \ --jsonl ``` +Filter cases and include concise layer diffs when investigating cache churn: + +```bash +bun scripts/bench-compaction.ts \ + --real-only \ + --real-sessions-dir ~/.pi/agent/sessions \ + --case-filter ch-observability \ + --compactors pi-vcc \ + --show-layer-diff \ + --jsonl +``` + Run the same checks in Docker: ```bash diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index f4711f2..f571644 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -69,6 +69,14 @@ export interface RecallProbeResult extends TermProbeResult { topHitIds: string[]; } +export interface PromptLayerDiff { + layer: string; + previousPreview: string; + currentPreview: string; + addedLines: string[]; + removedLines: string[]; +} + export interface CycleMetrics { caseId: string; compactor: string; @@ -106,6 +114,7 @@ export interface CycleMetrics { layerSizes: Record; promptLayerSizes: Record; promptLayerTokenDeltas: Record; + promptLayerDiffs?: PromptLayerDiff[]; } export interface BenchmarkRunResult { @@ -219,6 +228,34 @@ const summarizeChangedPromptLayers = ( }; }; +const linePreview = (text: string, maxChars = 400): string => + text.length <= maxChars ? text : `${text.slice(0, maxChars)}...(truncated)`; + +const changedPromptLayerDiffs = ( + previous: PromptSnapshot | undefined, + current: PromptSnapshot, + changedLayers: string[], +): PromptLayerDiff[] => { + if (!previous) return []; + const prevByName = new Map(previous.layers.map((layer) => [layer.name, layer.text])); + const currentByName = new Map(current.layers.map((layer) => [layer.name, layer.text])); + return changedLayers.slice(0, 3).map((layer) => { + const previousText = prevByName.get(layer) ?? ""; + const currentText = currentByName.get(layer) ?? ""; + const previousLines = previousText.split("\n").map((line) => line.trim()).filter(Boolean); + const currentLines = currentText.split("\n").map((line) => line.trim()).filter(Boolean); + const previousSet = new Set(previousLines); + const currentSet = new Set(currentLines); + return { + layer, + previousPreview: linePreview(previousText), + currentPreview: linePreview(currentText), + addedLines: currentLines.filter((line) => !previousSet.has(line)).slice(0, 12), + removedLines: previousLines.filter((line) => !currentSet.has(line)).slice(0, 12), + }; + }); +}; + const termProbe = (terms: ExpectedTerm[] = [], sourceText: string, targetText: string): TermProbeResult[] => terms.map((term) => { const applicable = lowerIncludes(sourceText, term.term); @@ -570,6 +607,7 @@ const cycleMetrics = ( previous: CompactorResult | undefined, prompt: PromptSnapshot, previousPrompt: PromptSnapshot | undefined, + includeDiagnostics: boolean, ): CycleMetrics => { const sourceText = sourceTextOf(sourceMessages); const activeText = result.activePromptState; @@ -631,6 +669,9 @@ const cycleMetrics = ( layerSizes: Object.fromEntries(result.layers.map((layer) => [layer.name, layer.text.length])), promptLayerSizes: Object.fromEntries(prompt.layers.map((layer) => [layer.name, layer.text.length])), promptLayerTokenDeltas: promptChanged.promptLayerTokenDeltas, + ...(includeDiagnostics && promptChanged.changedPromptLayers.length > 0 + ? { promptLayerDiffs: changedPromptLayerDiffs(previousPrompt, prompt, promptChanged.changedPromptLayers) } + : {}), }; }; @@ -691,6 +732,7 @@ export const failedGatesOf = (cycle: CycleMetrics): string[] => { export const runOfflineCompactionBenchmark = (options: { cases?: CompactionBenchmarkCase[]; compactors?: OfflineCompactor[]; + includeDiagnostics?: boolean; } = {}): BenchmarkRunResult => { const cases = options.cases ?? syntheticCompactionCases; const compactors = options.compactors ?? offlineCompactors; @@ -711,7 +753,7 @@ export const runOfflineCompactionBenchmark = (options: { cycle: index + 1, }); const prompt = simulatedPromptOf(result, sourceMessages); - cycles.push(cycleMetrics(testCase, compactor, index + 1, point, sourceMessages, result, previous, prompt, previousPrompt)); + cycles.push(cycleMetrics(testCase, compactor, index + 1, point, sourceMessages, result, previous, prompt, previousPrompt, Boolean(options.includeDiagnostics))); previous = result; previousPrompt = prompt; previousPoint = point; diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts index d6c453b..37cd6c7 100644 --- a/bench/compaction/synthetic-cases.ts +++ b/bench/compaction/synthetic-cases.ts @@ -222,6 +222,35 @@ export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ ], }, }, + { + id: "realistic-scope-and-status", + description: "A real-session-shaped scope extension should be captured, but follow-up status should stay volatile.", + messages: [ + user("Build a local ClickHouse-based OpenTelemetry ingestion and query system."), + assistant("I will start with local ClickHouse, ingestion, and query scaffolding."), + user("Good, now lets add meta monitoring for the chart itself. This means metrics for our clickhouse instance and dashboards for grafana."), + assistant("I will extend the current work with meta monitoring and Grafana dashboards."), + user("Status update: meta monitoring wiring is started; next validate dashboard provisioning."), + assistant("Next step: validate dashboard provisioning without changing the stable objective."), + ], + compactionPoints: [2, 4, 6], + gold: { + activeTerms: [ + { label: "original objective", term: "OpenTelemetry ingestion and query system" }, + { label: "scope extension", term: "meta monitoring" }, + ], + currentTerms: [ + { label: "original objective", term: "OpenTelemetry ingestion and query system" }, + { label: "scope extension", term: "meta monitoring" }, + ], + recallTerms: [ + { label: "dashboard validation", term: "dashboard provisioning", query: "dashboard provisioning" }, + ], + continuationTerms: [ + { label: "volatile next step", term: "validate dashboard provisioning" }, + ], + }, + }, { id: "cache-bust-volatile-next-step", description: "Stable objective and identifiers remain fixed while only volatile next-step state changes across cycles.", diff --git a/scripts/bench-compaction.ts b/scripts/bench-compaction.ts index a14042c..ce1a91c 100644 --- a/scripts/bench-compaction.ts +++ b/scripts/bench-compaction.ts @@ -18,6 +18,8 @@ const hasFlag = (name: string): boolean => args.includes(name); const realSessionsDir = argValue("--real-sessions-dir"); const realLimitRaw = argValue("--real-limit"); const realLimit = realLimitRaw ? Number.parseInt(realLimitRaw, 10) : undefined; +const caseFilter = argValue("--case-filter"); +const includeDiagnostics = hasFlag("--show-layer-diff"); const selected = argValue("--compactors") ?.split(",") @@ -40,8 +42,11 @@ const cases = hasFlag("--real-only") ? [] : [...syntheticCompactionCases]; if (realSessionsDir) { cases.push(...await loadRealSessionCases({ sessionsDir: realSessionsDir, limit: realLimit })); } +const filteredCases = caseFilter + ? cases.filter((testCase) => testCase.id.includes(caseFilter) || testCase.description.includes(caseFilter)) + : cases; -const result = runOfflineCompactionBenchmark({ compactors, cases }); +const result = runOfflineCompactionBenchmark({ compactors, cases: filteredCases, includeDiagnostics }); const failures = result.cycles .map((cycle) => ({ cycle, gates: failedGatesOf(cycle) })) .filter((entry) => entry.gates.length > 0); diff --git a/src/core/build-sections.ts b/src/core/build-sections.ts index 92d1045..0d57c2a 100644 --- a/src/core/build-sections.ts +++ b/src/core/build-sections.ts @@ -53,7 +53,7 @@ const formatFileActivity = (blocks: NormalizedBlock[]): string[] => { const cap = (set: Set, limit: number) => { const arr = [...set]; if (arr.length <= limit) return arr.join(", "); - return arr.slice(0, limit).join(", ") + ` (+${arr.length - limit} more)`; + return arr.slice(0, limit).join(", ") + " (+more)"; }; if (act.modified.size > 0) lines.push(`Modified: ${cap(act.modified, 10)}`); if (act.created.size > 0) lines.push(`Created: ${cap(act.created, 10)}`); diff --git a/src/core/summarize.ts b/src/core/summarize.ts index 57462e1..cb9ac43 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -36,8 +36,12 @@ const sectionOf = (text: string, header: string): string => { /** Extract the brief transcript part (everything after ---) */ const briefOf = (text: string): string => { const idx = text.indexOf(SEPARATOR); - if (idx < 0) return ""; - return text.slice(idx + SEPARATOR.length).trim(); + if (idx >= 0) return text.slice(idx + SEPARATOR.length).trim(); + // A fresh compaction can contain only brief transcript with no header section, + // in which case there is no separator to split on. + const trimmed = text.trim(); + if (!trimmed) return ""; + return HEADER_NAMES.some((header) => trimmed.startsWith(`[${header}]`)) ? "" : trimmed; }; /** Merge a header section */ @@ -79,8 +83,8 @@ const mergeFileLines = (prev: string, fresh: string): string => { const prefix = `- ${cat}: `; if (!line.startsWith(prefix)) continue; let rest = line.slice(prefix.length); - // Strip "(+N more)" suffix - rest = rest.replace(/\s*\(\+\d+ more\)\s*$/, ""); + // Strip overflow suffixes + rest = rest.replace(/\s*\(\+(?:\d+\s+)?more\)\s*$/, ""); for (const p of rest.split(",")) { const trimmed = p.trim(); if (trimmed) merged[cat].add(trimmed); @@ -95,7 +99,7 @@ const mergeFileLines = (prev: string, fresh: string): string => { const cap = (set: Set, limit: number) => { const arr = [...set]; if (arr.length <= limit) return arr.join(", "); - return arr.slice(0, limit).join(", ") + ` (+${arr.length - limit} more)`; + return arr.slice(0, limit).join(", ") + " (+more)"; }; const lines: string[] = []; diff --git a/src/extract/evidence.ts b/src/extract/evidence.ts index 6c95538..ad10b08 100644 --- a/src/extract/evidence.ts +++ b/src/extract/evidence.ts @@ -7,11 +7,11 @@ export interface EvidenceActivity { errorSignatures: Set; } -const PATH_RE = /(?:^|[\s"'`(=])((?:\.?\/?[\w.-]+\/)+[\w.-]+(?:\.[\w.-]+)?)/g; const ABS_PATH_RE = /(?:^|[\s"'`(=])(\/(?:tmp|var|home|workspace|app|repo|src|tests?)\/[\w./-]+)/g; -const ERROR_SIGNATURE_RE = /\b(?:ERR_[A-Z0-9_]+|[A-Z][A-Z0-9]+(?:_[A-Z0-9]+){1,})\b/g; +const PROJECT_PATH_RE = /(?:^|[\s"'`(=])((?:src|test|tests|scripts|bench)\/[\w./-]+)/g; +const ERROR_SIGNATURE_RE = /\b(?:ERR_[A-Z0-9_]+|(?:CACHE|CRITICAL|FATAL|PANIC|ERROR|FAIL)[A-Z0-9_]*(?:_[A-Z0-9]+)+)\b/g; const ID_RE = /\b(?:cache|probe|span|spn|req|request|trace|artifact|bench)[A-Za-z0-9_-]*_[A-Za-z0-9_-]+\b/g; -const COMMIT_RE = /\b[0-9a-f]{7,40}\b/g; +const COMMIT_RE = /\bcommit(?:\s+|[=:])([0-9a-f]{7,40})\b/gi; const addMatches = (set: Set, text: string, regex: RegExp, group = 0) => { for (const match of text.matchAll(regex)) { @@ -27,10 +27,10 @@ const textFromBlock = (block: NormalizedBlock): string => { const addEvidenceFromText = (activity: EvidenceActivity, text: string) => { addMatches(activity.paths, text, ABS_PATH_RE, 1); - addMatches(activity.paths, text, PATH_RE, 1); + addMatches(activity.paths, text, PROJECT_PATH_RE, 1); addMatches(activity.errorSignatures, text, ERROR_SIGNATURE_RE); addMatches(activity.identifiers, text, ID_RE); - addMatches(activity.identifiers, text, COMMIT_RE); + addMatches(activity.identifiers, text, COMMIT_RE, 1); }; export const extractEvidence = (blocks: NormalizedBlock[]): EvidenceActivity => { @@ -58,9 +58,9 @@ export const extractEvidence = (blocks: NormalizedBlock[]): EvidenceActivity => }; const cap = (set: Set, limit: number): string => { - const values = [...set].sort(); + const values = [...set]; if (values.length <= limit) return values.join(", "); - return `${values.slice(0, limit).join(", ")} (+${values.length - limit} more)`; + return `${values.slice(0, limit).join(", ")} (+more)`; }; export const formatEvidence = (activity: EvidenceActivity): string[] => { diff --git a/tests/compile.test.ts b/tests/compile.test.ts index 8dd5f98..801f67a 100644 --- a/tests/compile.test.ts +++ b/tests/compile.test.ts @@ -89,4 +89,17 @@ describe("compile", () => { expect(current).toContain("npm test"); expect(current).not.toContain("prefer yarn test"); }); + + it("preserves fresh brief-only updates when merging previous summary", () => { + const previousSummary = "[Session Goal]\n- Existing goal\n\n---\n\n[user]\nExisting goal"; + const r = compile({ + previousSummary, + messages: [ + userMsg("Status update: wiring is started; next validate dashboard provisioning."), + assistantText("Next step: validate dashboard provisioning without changing the stable objective."), + ], + }); + expect(r).toContain("Existing goal"); + expect(r).toContain("validate dashboard provisioning"); + }); }); From 53dc551b2ef0376605bcfe8f913c29f546767965 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:09:37 +0200 Subject: [PATCH 08/65] test: add cache-stability assertions Add --assert-cache as a separate benchmark gate for synthetic cache-stability probes. Correctness assertions remain focused on recovery/leak checks, while cache assertions verify volatile-only updates do not rewrite early stable prompt layers or collapse the stable prefix below the configured threshold. Validation: node --check bench/compaction/offline-runner.ts scripts/bench-compaction.ts; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache. --- README.md | 2 ++ bench/compaction/README.md | 6 ++++++ bench/compaction/offline-runner.ts | 18 ++++++++++++++++++ scripts/bench-compaction.ts | 26 ++++++++++++++++++++------ 4 files changed, 46 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 05c4721..1d71812 100644 --- a/README.md +++ b/README.md @@ -230,7 +230,9 @@ Use assertion mode when checking a selected compactor against the current benchm ```bash bun scripts/bench-compaction.ts --compactors pi-vcc --assert +bun scripts/bench-compaction.ts --compactors pi-vcc --assert-cache docker run --rm pi-vcc-bench --compactors pi-vcc --assert +docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache ``` Sample real Pi sessions for size, latency, and cache-churn metrics: diff --git a/bench/compaction/README.md b/bench/compaction/README.md index 3142b6b..c9d7440 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -141,6 +141,12 @@ Run assertion mode. This exits non-zero if any selected compactor misses active/ bun scripts/bench-compaction.ts --compactors pi-vcc --assert ``` +Run cache assertion mode for synthetic cache-stability probes. This is separate from correctness assertions and currently checks that volatile-only updates do not rewrite early stable prompt layers: + +```bash +bun scripts/bench-compaction.ts --compactors pi-vcc --assert-cache +``` + Append sampled real Pi sessions from a local session directory. Real-session cases have no gold state assertions; they are useful for size, latency, growth, and cache-churn signals: ```bash diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index f571644..f862bc4 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -729,6 +729,24 @@ export const failedGatesOf = (cycle: CycleMetrics): string[] => { return failures; }; +const CACHE_STABILITY_CASES = new Set(["cache-bust-volatile-next-step"]); +const EARLY_VOLATILE_LAYERS = new Set([ + "Pi VCC Session Goal", + "Pi VCC Files And Changes", + "Pi VCC Evidence Handles", + "Pi VCC User Preferences", +]); + +export const failedCacheGatesOf = (cycle: CycleMetrics): string[] => { + if (!CACHE_STABILITY_CASES.has(cycle.caseId) || cycle.cycle <= 1) return []; + const failures: string[] = []; + if (cycle.firstChangedPromptLayer && EARLY_VOLATILE_LAYERS.has(cycle.firstChangedPromptLayer)) { + failures.push("early-prompt-layer-changed"); + } + if ((cycle.stablePrefixTokens ?? 0) < 90) failures.push("stable-prefix-too-small"); + return failures; +}; + export const runOfflineCompactionBenchmark = (options: { cases?: CompactionBenchmarkCase[]; compactors?: OfflineCompactor[]; diff --git a/scripts/bench-compaction.ts b/scripts/bench-compaction.ts index ce1a91c..a690743 100644 --- a/scripts/bench-compaction.ts +++ b/scripts/bench-compaction.ts @@ -1,5 +1,5 @@ #!/usr/bin/env node -import { failedGatesOf, offlineCompactors, runOfflineCompactionBenchmark } from "../bench/compaction/offline-runner"; +import { failedCacheGatesOf, failedGatesOf, offlineCompactors, runOfflineCompactionBenchmark } from "../bench/compaction/offline-runner"; import { syntheticCompactionCases } from "../bench/compaction/synthetic-cases"; import { loadRealSessionCases } from "../bench/compaction/real-sessions"; @@ -50,6 +50,9 @@ const result = runOfflineCompactionBenchmark({ compactors, cases: filteredCases, const failures = result.cycles .map((cycle) => ({ cycle, gates: failedGatesOf(cycle) })) .filter((entry) => entry.gates.length > 0); +const cacheFailures = result.cycles + .map((cycle) => ({ cycle, gates: failedCacheGatesOf(cycle) })) + .filter((entry) => entry.gates.length > 0); if (hasFlag("--jsonl")) { for (const cycle of result.cycles) { @@ -59,14 +62,16 @@ if (hasFlag("--jsonl")) { console.log(JSON.stringify(result, null, 2)); } -if (hasFlag("--assert") && failures.length > 0) { - console.error(`\nCompaction benchmark assertions failed: ${failures.length} cycle(s)`); - for (const { cycle, gates } of failures.slice(0, 20)) { +const printFailures = (title: string, entries: typeof failures) => { + console.error(`\n${title}: ${entries.length} cycle(s)`); + for (const { cycle, gates } of entries.slice(0, 20)) { console.error(JSON.stringify({ caseId: cycle.caseId, compactor: cycle.compactor, cycle: cycle.cycle, gates, + firstChangedPromptLayer: cycle.firstChangedPromptLayer, + stablePrefixTokens: cycle.stablePrefixTokens, missingActiveTerms: cycle.missingActiveTerms, missingCurrentTerms: cycle.missingCurrentTerms, missingRecallTerms: cycle.missingRecallTerms, @@ -75,8 +80,17 @@ if (hasFlag("--assert") && failures.length > 0) { leakedActiveAbsentTerms: cycle.leakedActiveAbsentTerms, })); } - if (failures.length > 20) { - console.error(`... ${failures.length - 20} additional failing cycle(s) omitted`); + if (entries.length > 20) { + console.error(`... ${entries.length - 20} additional failing cycle(s) omitted`); } +}; + +if (hasFlag("--assert") && failures.length > 0) { + printFailures("Compaction benchmark assertions failed", failures); + process.exit(1); +} + +if (hasFlag("--assert-cache") && cacheFailures.length > 0) { + printFailures("Compaction cache assertions failed", cacheFailures); process.exit(1); } From d0a996208e74c652897141aa2f1a02c5699247be Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:14:39 +0200 Subject: [PATCH 09/65] fix: split scope updates from stable goals Render later scope changes in a Current Scope section instead of appending them to Session Goal. This keeps the original objective stable for cache reuse while preserving legitimate user scope extensions and keeping status-like updates volatile. Also keep brief-only fresh updates during summary merges so status/next-step turns are not dropped when they do not produce header sections. Validation: node --check on changed summary and test files; git diff --check; focused Bun tests for extract-goals, build-sections, format, and compile; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; real-session replay with --show-layer-diff. --- src/core/build-sections.ts | 9 +++++---- src/core/format.ts | 1 + src/core/summarize.ts | 6 +++--- src/extract/goals.ts | 28 +++++++++++++++++----------- src/sections.ts | 1 + tests/build-sections.test.ts | 12 ++++++++++++ tests/extract-goals.test.ts | 16 ++++++---------- tests/format.test.ts | 1 + 8 files changed, 46 insertions(+), 28 deletions(-) diff --git a/src/core/build-sections.ts b/src/core/build-sections.ts index 0d57c2a..3784f20 100644 --- a/src/core/build-sections.ts +++ b/src/core/build-sections.ts @@ -2,7 +2,7 @@ import type { NormalizedBlock } from "../types"; import { clip, clipSentence, nonEmptyLines } from "./content"; import { summarizeToolResultForPrompt } from "./tool-result-summary"; import type { SectionData } from "../sections"; -import { extractGoals } from "../extract/goals"; +import { extractGoalState } from "../extract/goals"; import { extractFiles } from "../extract/files"; import { extractPreferences, dedupPreferencesAgainstGoals } from "../extract/preferences"; import { extractCommits, formatCommits } from "../extract/commits"; @@ -64,13 +64,14 @@ const formatFileActivity = (blocks: NormalizedBlock[]): string[] => { export const buildSections = (input: BuildSectionsInput): SectionData => { const { blocks } = input; const briefSections = buildBriefSections(blocks); - const sessionGoal = extractGoals(blocks); + const goalState = extractGoalState(blocks); const userPreferences = dedupPreferencesAgainstGoals( extractPreferences(blocks), - sessionGoal, + [...goalState.stableGoals, ...goalState.currentScope], ); return { - sessionGoal, + sessionGoal: goalState.stableGoals, + currentScope: goalState.currentScope, outstandingContext: extractOutstandingContext(blocks), filesAndChanges: formatFileActivity(blocks), commits: formatCommits(extractCommits(blocks)), diff --git a/src/core/format.ts b/src/core/format.ts index a03b3b5..f09a696 100644 --- a/src/core/format.ts +++ b/src/core/format.ts @@ -26,6 +26,7 @@ export const RECALL_NOTE = export const formatSummary = (data: SectionData): string => { const headerParts = [ section("Session Goal", data.sessionGoal), + section("Current Scope", data.currentScope), section("Files And Changes", data.filesAndChanges), section("Commits", data.commits), section("Evidence Handles", data.evidenceHandles), diff --git a/src/core/summarize.ts b/src/core/summarize.ts index cb9ac43..bd14561 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -12,7 +12,7 @@ export interface CompileInput { fileOps?: FileOps; } -const HEADER_NAMES = ["Session Goal", "Files And Changes", "Commits", "Evidence Handles", "User Preferences", "Outstanding Context"]; +const HEADER_NAMES = ["Session Goal", "Current Scope", "Files And Changes", "Commits", "Evidence Handles", "User Preferences", "Outstanding Context"]; const SEPARATOR = "\n\n---\n\n"; @@ -46,8 +46,8 @@ const briefOf = (text: string): string => { /** Merge a header section */ const mergeHeaderSection = (header: string, prev: string, fresh: string): string => { - // Outstanding Context is volatile -- always use fresh only - if (header === "Outstanding Context") return fresh; + // Volatile sections -- always use fresh only + if (header === "Outstanding Context" || header === "Current Scope") return fresh; if (!prev) return fresh; if (!fresh) return prev; diff --git a/src/extract/goals.ts b/src/extract/goals.ts index 633a09f..ba74a4f 100644 --- a/src/extract/goals.ts +++ b/src/extract/goals.ts @@ -55,9 +55,14 @@ const isSubstantiveGoal = (text: string): boolean => { // so that pasted outputs below the actual instruction do not trigger matches. const LEADING_CHARS = 200; -export const extractGoals = (blocks: NormalizedBlock[]): string[] => { - const goals: string[] = []; - let latestScopeChange: string[] | null = null; +export interface GoalExtraction { + stableGoals: string[]; + currentScope: string[]; +} + +export const extractGoalState = (blocks: NormalizedBlock[]): GoalExtraction => { + const stableGoals: string[] = []; + let latestScopeChange: string[] = []; for (const b of blocks) { if (b.kind !== "user") continue; @@ -68,8 +73,8 @@ export const extractGoals = (blocks: NormalizedBlock[]): string[] => { .filter((l) => l.length > 5); if (lines.length === 0) continue; - if (goals.length === 0) { - goals.push(...lines.slice(0, 6)); + if (stableGoals.length === 0) { + stableGoals.push(...lines.slice(0, 6)); continue; } @@ -81,10 +86,11 @@ export const extractGoals = (blocks: NormalizedBlock[]): string[] => { } } - // Only emit the [Scope change] marker when we actually captured bullets. - if (latestScopeChange && latestScopeChange.length > 0) { - goals.push("[Scope change]", ...latestScopeChange); - } - - return goals.slice(0, 8); + return { + stableGoals: stableGoals.slice(0, 8), + currentScope: latestScopeChange.slice(0, 5), + }; }; + +export const extractGoals = (blocks: NormalizedBlock[]): string[] => + extractGoalState(blocks).stableGoals; diff --git a/src/sections.ts b/src/sections.ts index 05d764f..8ecc64f 100644 --- a/src/sections.ts +++ b/src/sections.ts @@ -2,6 +2,7 @@ import type { TranscriptEntry } from "./core/brief"; export interface SectionData { sessionGoal: string[]; + currentScope: string[]; outstandingContext: string[]; filesAndChanges: string[]; commits: string[]; diff --git a/tests/build-sections.test.ts b/tests/build-sections.test.ts index 71ce1a8..81d3073 100644 --- a/tests/build-sections.test.ts +++ b/tests/build-sections.test.ts @@ -6,6 +6,7 @@ describe("buildSections", () => { it("returns all-empty for no blocks", () => { const r = buildSections({ blocks: [] }); expect(r.sessionGoal).toEqual([]); + expect(r.currentScope).toEqual([]); expect(r.outstandingContext).toEqual([]); expect(r.evidenceHandles).toEqual([]); expect(r.briefTranscript).toBe(""); @@ -73,6 +74,17 @@ describe("buildSections", () => { expect(evidence).toContain("9f3a2b1"); }); + it("separates scope changes from stable goals", () => { + const blocks: NormalizedBlock[] = [ + { kind: "user", text: "Build a local ClickHouse-based OpenTelemetry ingestion and query system." }, + { kind: "user", text: "Good, now lets add meta monitoring for the chart itself." }, + { kind: "user", text: "Status update: validate dashboard provisioning next." }, + ]; + const r = buildSections({ blocks }); + expect(r.sessionGoal).toEqual(["Build a local ClickHouse-based OpenTelemetry ingestion and query system."]); + expect(r.currentScope).toEqual(["Good, now lets add meta monitoring for the chart itself."]); + }); + it("summarizes bulky tool errors without pasting low-value log lines", () => { const text = [ ...Array.from({ length: 20 }, (_, i) => `debug ${i}: warmup ok`), diff --git a/tests/extract-goals.test.ts b/tests/extract-goals.test.ts index d2f941d..4d0c2a3 100644 --- a/tests/extract-goals.test.ts +++ b/tests/extract-goals.test.ts @@ -38,29 +38,27 @@ describe("extractGoals", () => { expect(extractGoals(blocks)).toEqual(["first goal"]); }); - it("detects scope change with explicit pivot keywords", () => { + it("keeps explicit pivot keywords out of stable goals", () => { const blocks: NormalizedBlock[] = [ { kind: "user", text: "Fix login bug" }, { kind: "assistant", text: "ok" }, { kind: "user", text: "Actually, instead let's refactor the auth module" }, ]; const goals = extractGoals(blocks); - expect(goals).toContain("Fix login bug"); - expect(goals).toContain("[Scope change]"); - expect(goals.some((g) => g.includes("refactor"))).toBe(true); + expect(goals).toEqual(["Fix login bug"]); }); - it("detects scope change from new task statements", () => { + it("keeps new task statements out of stable goals", () => { const blocks: NormalizedBlock[] = [ { kind: "user", text: "Fix login bug" }, { kind: "assistant", text: "done" }, { kind: "user", text: "Now implement the user registration flow" }, ]; const goals = extractGoals(blocks); - expect(goals).toContain("[Scope change]"); + expect(goals).toEqual(["Fix login bug"]); }); - it("keeps latest scope change only", () => { + it("keeps stable goals unchanged across multiple scope changes", () => { const blocks: NormalizedBlock[] = [ { kind: "user", text: "Fix login bug" }, { kind: "assistant", text: "done" }, @@ -68,9 +66,7 @@ describe("extractGoals", () => { { kind: "assistant", text: "ok" }, { kind: "user", text: "Change of plan, implement password reset" }, ]; - const goals = extractGoals(blocks); - const scopeIdx = goals.indexOf("[Scope change]"); - expect(goals[scopeIdx + 1]).toContain("password reset"); + expect(extractGoals(blocks)).toEqual(["Fix login bug"]); }); it("skips noise short user messages as goals", () => { diff --git a/tests/format.test.ts b/tests/format.test.ts index 61ee710..b549a07 100644 --- a/tests/format.test.ts +++ b/tests/format.test.ts @@ -4,6 +4,7 @@ import type { SectionData } from "../src/sections"; const empty: SectionData = { sessionGoal: [], + currentScope: [], outstandingContext: [], filesAndChanges: [], commits: [], From 40aa00ba511ae177ade33457cc1c7ace4fd389b3 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:18:11 +0200 Subject: [PATCH 10/65] fix: keep merged goals cache-stable When merging with an existing summary, demote fresh goal-like lines into Current Scope so Session Goal remains the stable original objective. Status-only windows keep the prior Current Scope, and direct preference/status-table lines are filtered from stable goals. This moves the sampled real-session first changed layer from Session Goal to Current Scope while preserving scope and continuation terms in the active prompt. Validation: node --check src/core/summarize.ts src/extract/goals.ts tests/compile.test.ts tests/extract-goals.test.ts; focused Bun tests for compile and extract-goals; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; real-session replay with --show-layer-diff. --- src/core/summarize.ts | 33 +++++++++++++++++++++++++++++---- src/extract/goals.ts | 7 ++++++- tests/compile.test.ts | 21 +++++++++++++++++++++ tests/extract-goals.test.ts | 16 ++++++++++++++++ 4 files changed, 72 insertions(+), 5 deletions(-) diff --git a/src/core/summarize.ts b/src/core/summarize.ts index bd14561..ee92633 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -46,8 +46,11 @@ const briefOf = (text: string): string => { /** Merge a header section */ const mergeHeaderSection = (header: string, prev: string, fresh: string): string => { - // Volatile sections -- always use fresh only - if (header === "Outstanding Context" || header === "Current Scope") return fresh; + // Current Scope is the latest explicit scope change; keep previous when the + // fresh window only has status/transcript updates. + if (header === "Current Scope") return fresh || prev; + // Outstanding Context is volatile -- always use fresh only. + if (header === "Outstanding Context") return fresh; if (!prev) return fresh; if (!fresh) return prev; @@ -116,11 +119,33 @@ const mergeBriefTranscript = (prev: string, fresh: string): string => { return prev + "\n\n" + fresh; }; +const demoteFreshGoalToScope = (fresh: string): string => { + const goal = sectionOf(fresh, "Session Goal"); + if (!goal) return fresh; + + const goalLines = goal.split("\n").slice(1).filter((line) => line.startsWith("- ")); + const withoutGoal = fresh + .replace(goal, "") + .replace(/^\s+/, "") + .replace(/\n{3,}/g, "\n\n") + .trim(); + if (goalLines.length === 0) return withoutGoal; + + const currentScope = sectionOf(withoutGoal, "Current Scope"); + if (currentScope) { + return withoutGoal.replace(currentScope, `${currentScope}\n${goalLines.join("\n")}`); + } + + const scopeSection = `[Current Scope]\n${goalLines.join("\n")}`; + return withoutGoal ? `${scopeSection}\n\n${withoutGoal}` : scopeSection; +}; + const mergePrevious = (prev: string, fresh: string): string => { + const mergeFresh = demoteFreshGoalToScope(fresh); // Merge header sections const headers = HEADER_NAMES .map((header) => { - const freshSec = sectionOf(fresh, header); + const freshSec = sectionOf(mergeFresh, header); const prevSec = sectionOf(prev, header); return mergeHeaderSection(header, prevSec, freshSec); }) @@ -128,7 +153,7 @@ const mergePrevious = (prev: string, fresh: string): string => { // Merge brief transcript const prevBrief = briefOf(prev); - const freshBrief = briefOf(fresh); + const freshBrief = briefOf(mergeFresh); const mergedBrief = mergeBriefTranscript(prevBrief, freshBrief); const parts: string[] = []; diff --git a/src/extract/goals.ts b/src/extract/goals.ts index ba74a4f..b1338f1 100644 --- a/src/extract/goals.ts +++ b/src/extract/goals.ts @@ -10,6 +10,7 @@ const TASK_RE = const PREFERENCE_RE = /\b(prefer(?:s|red|ring)?|always use|never use|please use|please avoid|do not use|don'?t use)\b/i; +const DIRECT_PREFERENCE_RE = /\b(?:prefer(?:s|red|ring)?|please use|please avoid|always use|never use)\b/i; const PREFERENCE_WITH_TASK_RE = /\b(fix|implement|add|create|build|refactor|debug|investigate|update|remove|delete|migrate|deploy|write|set up)\b/i; @@ -22,6 +23,9 @@ const VOLATILE_STATUS_RE = /^\s*(?:current blocker|blocker update|status update| const NON_GOAL_RE = /^\s*[\[│├└─╭╰]|```|^\s*(=[A-Z]+\(|function |const |let |var |import |export |class )|^(https?:|file:|\/[A-Za-z])|\\n|^\s*For each\b|\bin full\b[^\n]*\b(comments|issue|issues|PRs?|linked)\b/; +const TABLE_OR_STATUS_RE = + /\b(READY\s+STATUS\s+RESTARTS|\d+\/\d+\s+(?:Running|Pending|Completed|Error|CrashLoopBackOff)\b)/; + // Signals that the rest of the user message is a command template (e.g. /issues), // in which case we should stop collecting goals at the signal line. const TEMPLATE_SIGNAL_RE = @@ -38,7 +42,7 @@ const stripLeadingBullet = (line: string): string => const MAX_GOAL_CHARS = 200; const isPreferenceOnly = (text: string): boolean => - PREFERENCE_RE.test(text) && !PREFERENCE_WITH_TASK_RE.test(text); + DIRECT_PREFERENCE_RE.test(text) || (PREFERENCE_RE.test(text) && !PREFERENCE_WITH_TASK_RE.test(text)); const isSubstantiveGoal = (text: string): boolean => { const t = text.trim(); @@ -46,6 +50,7 @@ const isSubstantiveGoal = (text: string): boolean => { if (t.length > MAX_GOAL_CHARS) return false; if (NOISE_SHORT_RE.test(t)) return false; if (VOLATILE_STATUS_RE.test(t)) return false; + if (TABLE_OR_STATUS_RE.test(t)) return false; if (NON_GOAL_RE.test(t)) return false; if (isPreferenceOnly(t)) return false; return true; diff --git a/tests/compile.test.ts b/tests/compile.test.ts index 801f67a..d1015cc 100644 --- a/tests/compile.test.ts +++ b/tests/compile.test.ts @@ -102,4 +102,25 @@ describe("compile", () => { expect(r).toContain("Existing goal"); expect(r).toContain("validate dashboard provisioning"); }); + + it("demotes fresh goals to current scope when merging previous summary", () => { + const previousSummary = "[Session Goal]\n- Existing goal\n\n---\n\n[user]\nExisting goal"; + const r = compile({ + previousSummary, + messages: [userMsg("Also add meta monitoring dashboards")], + }); + const current = r.split("\n\n---\n\n")[0]; + expect(current).toContain("[Session Goal]\n- Existing goal"); + expect(current).toContain("[Current Scope]\n- Also add meta monitoring dashboards"); + }); + + it("keeps prior current scope when fresh window is status-only", () => { + const previousSummary = "[Session Goal]\n- Existing goal\n\n[Current Scope]\n- Add meta monitoring\n\n---\n\n[user]\nExisting goal"; + const r = compile({ + previousSummary, + messages: [userMsg("Status update: validate dashboard provisioning next")], + }); + const current = r.split("\n\n---\n\n")[0]; + expect(current).toContain("[Current Scope]\n- Add meta monitoring"); + }); }); diff --git a/tests/extract-goals.test.ts b/tests/extract-goals.test.ts index 4d0c2a3..88b2870 100644 --- a/tests/extract-goals.test.ts +++ b/tests/extract-goals.test.ts @@ -90,4 +90,20 @@ describe("extractGoals", () => { "Benchmark cache-aware compaction. Stable objective: preserve Layer 0 and Layer 1 prefixes.", ]); }); + + it("keeps pasted kubernetes status tables out of stable goals", () => { + const goals = extractGoals([ + { kind: "user", text: "Fix chart naming" }, + { kind: "user", text: "NAME READY STATUS RESTARTS AGE\ngrafana-db-1 1/1 Running 0 101m" }, + ]); + expect(goals).toEqual(["Fix chart naming"]); + }); + + it("keeps direct preference instructions out of stable goals", () => { + const goals = extractGoals([ + { kind: "user", text: "Install kube-prometheus-stack" }, + { kind: "user", text: "I hate verbose naming; please use the name fix thing they provide" }, + ]); + expect(goals).toEqual(["Install kube-prometheus-stack"]); + }); }); From 039b5227f182526ab825714f280c542a37e6e753 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:20:40 +0200 Subject: [PATCH 11/65] fix: ignore preference-like error text Skip copied error/stack-trace lines during preference extraction so phrases like 'always include the lines below' do not become durable user preferences. Real-session diagnostics still show legitimate preference growth, but the bogus SYNTAX_ERROR stack-trace line is filtered out. Validation: node --check src/extract/preferences.ts tests/extract-preferences.test.ts; focused Bun preference tests; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; real-session replay with --show-layer-diff. --- src/extract/preferences.ts | 1 + tests/extract-preferences.test.ts | 7 +++++++ 2 files changed, 8 insertions(+) diff --git a/src/extract/preferences.ts b/src/extract/preferences.ts index 9a93c44..200a3d9 100644 --- a/src/extract/preferences.ts +++ b/src/extract/preferences.ts @@ -25,6 +25,7 @@ export const extractPreferences = (blocks: NormalizedBlock[]): string[] => { if (trimmed.length > 200) continue; // Reject questions. if (trimmed.endsWith("?") || trimmed.includes("?...")) continue; + if (/\b(SYNTAX_ERROR|Stack trace|Exception|Traceback)\b/i.test(trimmed)) continue; if (!PREF_PATTERNS.some((p) => p.test(trimmed))) continue; const clipped = clip(trimmed, 200); diff --git a/tests/extract-preferences.test.ts b/tests/extract-preferences.test.ts index cf8f250..64241dd 100644 --- a/tests/extract-preferences.test.ts +++ b/tests/extract-preferences.test.ts @@ -27,4 +27,11 @@ describe("extractPreferences", () => { ]; expect(extractPreferences(blocks).length).toBe(1); }); + + it("ignores copied error text that says always include stack traces", () => { + const blocks: NormalizedBlock[] = [ + { kind: "user", text: "METRICS ENGI... . (SYNTAX_ERROR), Stack trace (when copying this message, always include the lines below):" }, + ]; + expect(extractPreferences(blocks)).toEqual([]); + }); }); From 8398694a3da9d953a66ed09a55783b3cc26eaebc Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:28:00 +0200 Subject: [PATCH 12/65] fix: filter pasted config from scope Exclude pasted Kubernetes/config fragments, shell prompts, and structured log lines from goal and current-scope extraction. This keeps copied diagnostic output from bloating Current Scope while preserving real user scope updates. Validation: node --check src/extract/goals.ts tests/extract-goals.test.ts; focused Bun extract-goals tests; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; real-session replay with --show-layer-diff. --- src/extract/goals.ts | 4 ++++ tests/extract-goals.test.ts | 16 ++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/src/extract/goals.ts b/src/extract/goals.ts index b1338f1..e2eb23a 100644 --- a/src/extract/goals.ts +++ b/src/extract/goals.ts @@ -25,6 +25,8 @@ const NON_GOAL_RE = const TABLE_OR_STATUS_RE = /\b(READY\s+STATUS\s+RESTARTS|\d+\/\d+\s+(?:Running|Pending|Completed|Error|CrashLoopBackOff)\b)/; +const CONFIG_FRAGMENT_RE = /^\s*(?:apiVersion|kind|metadata|labels|annotations|spec|data|creationTimestamp|name|namespace|app(?:\.kubernetes\.io\/[-\w]+)?|chart|grafana_dashboard|heritage|release|resourceVersion|uid)\s*:/i; +const LOG_OR_COMMAND_RE = /^\s*(?:[❯$>]\s+|\{.*"(?:time|level|msg)"\s*:)/; // Signals that the rest of the user message is a command template (e.g. /issues), // in which case we should stop collecting goals at the signal line. @@ -51,6 +53,8 @@ const isSubstantiveGoal = (text: string): boolean => { if (NOISE_SHORT_RE.test(t)) return false; if (VOLATILE_STATUS_RE.test(t)) return false; if (TABLE_OR_STATUS_RE.test(t)) return false; + if (CONFIG_FRAGMENT_RE.test(t)) return false; + if (LOG_OR_COMMAND_RE.test(t)) return false; if (NON_GOAL_RE.test(t)) return false; if (isPreferenceOnly(t)) return false; return true; diff --git a/tests/extract-goals.test.ts b/tests/extract-goals.test.ts index 88b2870..038fca2 100644 --- a/tests/extract-goals.test.ts +++ b/tests/extract-goals.test.ts @@ -106,4 +106,20 @@ describe("extractGoals", () => { ]); expect(goals).toEqual(["Install kube-prometheus-stack"]); }); + + it("keeps pasted config fragments out of stable goals", () => { + const goals = extractGoals([ + { kind: "user", text: "Fix dashboard provisioning" }, + { kind: "user", text: "kind: ConfigMap\nmetadata:\ncreationTimestamp: \"2026-04-19T22:23:16Z\"\nlabels:\napp: grafana\napp.kubernetes.io/instance: monitoring\nchart: kubePrometheusStack-83.6.0\ngrafana_dashboard: \"1\"\nresourceVersion: \"21956\"\nuid: d27df580-8819-472e-90d4-0ac281b138f5" }, + ]); + expect(goals).toEqual(["Fix dashboard provisioning"]); + }); + + it("keeps pasted commands and JSON logs out of stable goals", () => { + const goals = extractGoals([ + { kind: "user", text: "Fix dashboard provisioning" }, + { kind: "user", text: "❯ kubectl get cm monitoring-k8s-monitoring-cluster-total -oyaml\n{\"time\": \"2026-04-19T22:20:47Z\", \"msg\": \"Starting collector\", \"level\": \"INFO\"}" }, + ]); + expect(goals).toEqual(["Fix dashboard provisioning"]); + }); }); From 7442eb74a5b8e01d3d307e48a565686ffae0f39f Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:34:45 +0200 Subject: [PATCH 13/65] test: compare compaction across refs Add a Docker-backed ref comparison runner that builds isolated git worktrees for a baseline and head ref, runs the same compaction benchmark in each image, and writes paired JSONL plus a Markdown delta report. Document the original-vs-implementation workflow and use 53dc551 as the practical runnable baseline for the current benchmark harness. Validation: node --check scripts/compare-compaction-refs.mjs; git diff --check; node scripts/compare-compaction-refs.mjs --head HEAD --compactors pi-vcc --case-filter cache-bust --out /tmp/pi-vcc-ref-compare.gKFg5K; node scripts/compare-compaction-refs.mjs --head HEAD --compactors pi-vcc --real-only --real-sessions-dir ~/.pi/agent/sessions --real-limit 1 --show-layer-diff --out /tmp/pi-vcc-ref-compare-real.lUVm68. --- bench/compaction/README.md | 45 ++++++ scripts/compare-compaction-refs.mjs | 237 ++++++++++++++++++++++++++++ 2 files changed, 282 insertions(+) create mode 100755 scripts/compare-compaction-refs.mjs diff --git a/bench/compaction/README.md b/bench/compaction/README.md index c9d7440..b062e70 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -197,6 +197,51 @@ docker run --rm \ Assertion failures are expected for current baselines while the RED scenarios are documenting known gaps. Use selected compactors when checking one implementation at a time. +## Comparing refs + +Use the ref comparison runner when you need an original-vs-implementation benchmark instead of a single working-tree run. It creates isolated git worktrees, builds each ref as its own Docker image, runs the same benchmark command in both images, and writes paired JSONL plus a Markdown delta report. + +A practical runnable baseline is `53dc551`, the cache-stability assertion checkpoint before the later production layout/extraction refinements. Compare it with the current checkout: + +```bash +node scripts/compare-compaction-refs.mjs \ + --baseline 53dc551 \ + --head HEAD \ + --compactors pi-vcc \ + --out /tmp/pi-vcc-compaction-compare +``` + +Older refs can be useful historically, but they must contain a runnable version of the benchmark harness and its source dependencies. + +Include sampled real sessions with the same Docker-only benchmark path: + +```bash +node scripts/compare-compaction-refs.mjs \ + --baseline 53dc551 \ + --head HEAD \ + --compactors pi-vcc \ + --real-only \ + --real-sessions-dir ~/.pi/agent/sessions \ + --real-limit 1 \ + --show-layer-diff \ + --out /tmp/pi-vcc-compaction-compare-real +``` + +The output directory contains: + +- `baseline.jsonl`: per-cycle metrics for the baseline ref +- `head.jsonl`: per-cycle metrics for the implementation ref +- `comparison.md`: aggregate deltas and notable changed cycles +- `baseline.stderr.log` / `head.stderr.log`: benchmark diagnostics from each Docker run + +For cache-aware compaction, the most useful report signals are: + +- increased mean stable-prefix tokens +- later `firstChangedPromptLayer` in matched cycles +- fewer cache failure cycles +- no increase in correctness failure cycles +- lower or justified full-prompt token counts + ## Interpreting results A useful compactor should: diff --git a/scripts/compare-compaction-refs.mjs b/scripts/compare-compaction-refs.mjs new file mode 100755 index 0000000..54fe1db --- /dev/null +++ b/scripts/compare-compaction-refs.mjs @@ -0,0 +1,237 @@ +#!/usr/bin/env node +import { spawnSync } from "node:child_process"; +import { existsSync, mkdirSync, readFileSync, rmSync, writeFileSync } from "node:fs"; +import { tmpdir } from "node:os"; +import { basename, join, resolve } from "node:path"; + +const args = process.argv.slice(2); + +const valueOf = (name, fallback) => { + const inline = args.find((arg) => arg.startsWith(`${name}=`)); + if (inline) return inline.slice(name.length + 1); + const index = args.indexOf(name); + return index >= 0 ? args[index + 1] : fallback; +}; + +const hasFlag = (name) => args.includes(name); + +const baselineRef = valueOf("--baseline", "53dc551"); +const headRef = valueOf("--head", "HEAD"); +const compactors = valueOf("--compactors", "pi-vcc"); +const realSessionsDir = valueOf("--real-sessions-dir"); +const realLimit = valueOf("--real-limit"); +const caseFilter = valueOf("--case-filter"); +const outDir = resolve(valueOf("--out", join(tmpdir(), `pi-vcc-compaction-compare-${Date.now()}`))); +const keepWorktrees = hasFlag("--keep-worktrees"); +const includeRealOnly = hasFlag("--real-only"); +const includeLayerDiff = hasFlag("--show-layer-diff"); + +const run = (command, commandArgs, options = {}) => { + const result = spawnSync(command, commandArgs, { + cwd: options.cwd, + stdio: options.capture ? ["ignore", "pipe", "pipe"] : "inherit", + encoding: "utf8", + }); + if (result.status !== 0) { + const rendered = `${command} ${commandArgs.join(" ")}`; + if (options.capture) { + process.stderr.write(result.stdout ?? ""); + process.stderr.write(result.stderr ?? ""); + } + throw new Error(`Command failed (${result.status}): ${rendered}`); + } + return result.stdout ?? ""; +}; + +const repoRoot = run("git", ["rev-parse", "--show-toplevel"], { capture: true }).trim(); + +const ensureRef = (ref) => { + run("git", ["rev-parse", "--verify", `${ref}^{commit}`], { cwd: repoRoot, capture: true }); +}; + +const safeName = (value) => value.replace(/[^a-zA-Z0-9_.-]+/g, "-").replace(/^-+|-+$/g, "").slice(0, 60) || "ref"; +const runId = `${Date.now()}-${process.pid}`; +const worktreeRoot = join(tmpdir(), `pi-vcc-ref-compare-${runId}`); +const baselineWorktree = join(worktreeRoot, `baseline-${safeName(baselineRef)}`); +const headWorktree = join(worktreeRoot, `head-${safeName(headRef)}`); + +const benchArgs = () => { + const out = ["--jsonl", "--compactors", compactors]; + if (includeRealOnly) out.push("--real-only"); + if (realSessionsDir) out.push("--real-sessions-dir", "/sessions"); + if (realLimit) out.push("--real-limit", realLimit); + if (caseFilter) out.push("--case-filter", caseFilter); + if (includeLayerDiff) out.push("--show-layer-diff"); + return out; +}; + +const readJsonl = (path) => readFileSync(path, "utf8") + .split("\n") + .map((line) => line.trim()) + .filter(Boolean) + .map((line) => JSON.parse(line)); + +const correctnessFailures = (cycle) => [ + ...(cycle.missingActiveTerms ?? []), + ...(cycle.missingCurrentTerms ?? []), + ...(cycle.missingRecallTerms ?? []), + ...(cycle.leakedForbiddenTerms ?? []), + ...(cycle.leakedForbiddenCurrentTerms ?? []), + ...(cycle.leakedActiveAbsentTerms ?? []), +].length; + +const cacheFailures = (cycle) => { + if (cycle.caseId !== "cache-bust-volatile-next-step" || cycle.cycle <= 1) return 0; + const early = new Set([ + "Pi VCC Session Goal", + "Pi VCC Files And Changes", + "Pi VCC Evidence Handles", + "Pi VCC User Preferences", + ]); + let count = 0; + if (cycle.firstChangedPromptLayer && early.has(cycle.firstChangedPromptLayer)) count += 1; + if ((cycle.stablePrefixTokens ?? 0) < 90) count += 1; + return count; +}; + +const mean = (items, selector) => { + const values = items.map(selector).filter((value) => typeof value === "number" && Number.isFinite(value)); + if (values.length === 0) return null; + return values.reduce((sum, value) => sum + value, 0) / values.length; +}; + +const fmt = (value, digits = 2) => value === null || value === undefined ? "n/a" : Number(value).toFixed(digits); +const signed = (value, digits = 2) => value === null || value === undefined ? "n/a" : `${value >= 0 ? "+" : ""}${Number(value).toFixed(digits)}`; + +const summarize = (label, rows) => ({ + label, + cycles: rows.length, + meanStablePrefixTokens: mean(rows, (row) => row.stablePrefixTokens), + meanFullPromptTokensEst: mean(rows, (row) => row.fullPromptTokensEst), + meanCurrentTokensEst: mean(rows, (row) => row.currentTokensEst), + correctnessFailureCycles: rows.filter((row) => correctnessFailures(row) > 0).length, + cacheFailureCycles: rows.filter((row) => cacheFailures(row) > 0).length, +}); + +const keyOf = (row) => `${row.caseId}\u0000${row.compactor}\u0000${row.cycle}`; + +const markdownReport = ({ baselineRows, headRows, baselinePath, headPath }) => { + const baseline = summarize("baseline", baselineRows); + const head = summarize("head", headRows); + const baselineByKey = new Map(baselineRows.map((row) => [keyOf(row), row])); + const pairs = headRows + .map((headRow) => ({ baselineRow: baselineByKey.get(keyOf(headRow)), headRow })) + .filter((pair) => pair.baselineRow); + const stableDeltas = pairs.map(({ baselineRow, headRow }) => (headRow.stablePrefixTokens ?? 0) - (baselineRow.stablePrefixTokens ?? 0)); + const tokenDeltas = pairs.map(({ baselineRow, headRow }) => headRow.fullPromptTokensEst - baselineRow.fullPromptTokensEst); + const currentDeltas = pairs.map(({ baselineRow, headRow }) => headRow.currentTokensEst - baselineRow.currentTokensEst); + const improved = pairs.filter(({ baselineRow, headRow }) => + (headRow.stablePrefixTokens ?? 0) > (baselineRow.stablePrefixTokens ?? 0) + || correctnessFailures(headRow) < correctnessFailures(baselineRow) + || cacheFailures(headRow) < cacheFailures(baselineRow) + ); + const regressed = pairs.filter(({ baselineRow, headRow }) => + (headRow.stablePrefixTokens ?? 0) < (baselineRow.stablePrefixTokens ?? 0) + || correctnessFailures(headRow) > correctnessFailures(baselineRow) + || cacheFailures(headRow) > cacheFailures(baselineRow) + ); + const notable = pairs + .filter(({ baselineRow, headRow }) => baselineRow.firstChangedPromptLayer !== headRow.firstChangedPromptLayer + || correctnessFailures(baselineRow) !== correctnessFailures(headRow) + || cacheFailures(baselineRow) !== cacheFailures(headRow)) + .slice(0, 20); + + const lines = []; + lines.push("# Compaction Ref Comparison"); + lines.push(""); + lines.push(`- Baseline ref: \`${baselineRef}\``); + lines.push(`- Head ref: \`${headRef}\``); + lines.push(`- Compactors: \`${compactors}\``); + if (realSessionsDir) lines.push(`- Real sessions: \`${realSessionsDir}\``); + if (realLimit) lines.push(`- Real session limit: \`${realLimit}\``); + if (caseFilter) lines.push(`- Case filter: \`${caseFilter}\``); + lines.push(`- Baseline JSONL: \`${baselinePath}\``); + lines.push(`- Head JSONL: \`${headPath}\``); + lines.push(""); + lines.push("## Aggregate"); + lines.push(""); + lines.push("| metric | baseline | head | delta |"); + lines.push("| --- | ---: | ---: | ---: |"); + lines.push(`| cycles | ${baseline.cycles} | ${head.cycles} | ${head.cycles - baseline.cycles} |`); + lines.push(`| mean stable prefix tokens | ${fmt(baseline.meanStablePrefixTokens)} | ${fmt(head.meanStablePrefixTokens)} | ${signed(mean(stableDeltas, (v) => v))} |`); + lines.push(`| mean full prompt tokens | ${fmt(baseline.meanFullPromptTokensEst)} | ${fmt(head.meanFullPromptTokensEst)} | ${signed(mean(tokenDeltas, (v) => v))} |`); + lines.push(`| mean current tokens | ${fmt(baseline.meanCurrentTokensEst)} | ${fmt(head.meanCurrentTokensEst)} | ${signed(mean(currentDeltas, (v) => v))} |`); + lines.push(`| correctness failure cycles | ${baseline.correctnessFailureCycles} | ${head.correctnessFailureCycles} | ${head.correctnessFailureCycles - baseline.correctnessFailureCycles} |`); + lines.push(`| cache failure cycles | ${baseline.cacheFailureCycles} | ${head.cacheFailureCycles} | ${head.cacheFailureCycles - baseline.cacheFailureCycles} |`); + lines.push(""); + lines.push("## Matched-cycle signals"); + lines.push(""); + lines.push(`- Matched cycles: ${pairs.length}`); + lines.push(`- Improved cycles: ${improved.length}`); + lines.push(`- Regressed cycles: ${regressed.length}`); + lines.push(""); + lines.push("## Notable changed cycles"); + lines.push(""); + if (notable.length === 0) { + lines.push("No notable first-layer, correctness, or cache-gate changes in matched cycles."); + } else { + lines.push("| case | compactor | cycle | baseline first layer | head first layer | stable prefix delta | correctness delta | cache delta |"); + lines.push("| --- | --- | ---: | --- | --- | ---: | ---: | ---: |"); + for (const { baselineRow, headRow } of notable) { + lines.push(`| ${headRow.caseId} | ${headRow.compactor} | ${headRow.cycle} | ${baselineRow.firstChangedPromptLayer ?? "n/a"} | ${headRow.firstChangedPromptLayer ?? "n/a"} | ${signed((headRow.stablePrefixTokens ?? 0) - (baselineRow.stablePrefixTokens ?? 0), 0)} | ${correctnessFailures(headRow) - correctnessFailures(baselineRow)} | ${cacheFailures(headRow) - cacheFailures(baselineRow)} |`); + } + } + lines.push(""); + return `${lines.join("\n")}\n`; +}; + +const runBench = ({ label, ref, worktree }) => { + console.error(`Adding ${label} worktree for ${ref}`); + run("git", ["worktree", "add", "--detach", worktree, ref], { cwd: repoRoot }); + const image = `pi-vcc-bench-${safeName(label)}-${runId}`.toLowerCase(); + console.error(`Building ${image}`); + run("docker", ["build", "-t", image, "."], { cwd: worktree }); + const jsonlPath = join(outDir, `${label}.jsonl`); + const stderrPath = join(outDir, `${label}.stderr.log`); + const dockerArgs = ["run", "--rm"]; + if (realSessionsDir) dockerArgs.push("-v", `${resolve(realSessionsDir)}:/sessions:ro`); + dockerArgs.push(image, ...benchArgs()); + console.error(`Running ${label} benchmark`); + const result = spawnSync("docker", dockerArgs, { cwd: worktree, encoding: "utf8", stdio: ["ignore", "pipe", "pipe"] }); + writeFileSync(jsonlPath, result.stdout ?? ""); + writeFileSync(stderrPath, result.stderr ?? ""); + if (result.status !== 0) { + process.stderr.write(result.stderr ?? ""); + throw new Error(`${label} benchmark failed with status ${result.status}; see ${stderrPath}`); + } + return { jsonlPath, stderrPath }; +}; + +try { + ensureRef(baselineRef); + ensureRef(headRef); + mkdirSync(outDir, { recursive: true }); + mkdirSync(worktreeRoot, { recursive: true }); + + const baseline = runBench({ label: "baseline", ref: baselineRef, worktree: baselineWorktree }); + const head = runBench({ label: "head", ref: headRef, worktree: headWorktree }); + const report = markdownReport({ + baselineRows: readJsonl(baseline.jsonlPath), + headRows: readJsonl(head.jsonlPath), + baselinePath: baseline.jsonlPath, + headPath: head.jsonlPath, + }); + const reportPath = join(outDir, "comparison.md"); + writeFileSync(reportPath, report); + console.log(report); + console.error(`Wrote ${reportPath}`); +} finally { + if (!keepWorktrees && existsSync(worktreeRoot)) { + for (const worktree of [baselineWorktree, headWorktree]) { + if (existsSync(worktree)) { + spawnSync("git", ["worktree", "remove", "--force", worktree], { cwd: repoRoot, stdio: "ignore" }); + } + } + rmSync(worktreeRoot, { recursive: true, force: true }); + } +} From ab8a7583b9c40881bdcfccb448da89945bd3f683 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:37:08 +0200 Subject: [PATCH 14/65] test: expose production compaction layers Add compileWithLayers so benchmarks can consume the production-rendered compaction layers directly instead of maintaining benchmark-side parsing of the final summary text. The existing compile API remains a text-only wrapper with unchanged output. Update the pi-vcc offline compactor to use the production layer metadata while preserving activePromptState and existing benchmark metrics. Validation: node --check src/core/summarize.ts bench/compaction/offline-runner.ts tests/compile.test.ts; focused Docker Bun compile tests; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; git diff --check; node scripts/compare-compaction-refs.mjs --head HEAD --compactors pi-vcc --case-filter cache-bust --out /tmp/pi-vcc-layer-ref.nFbWTN; real-session Docker replay with --show-layer-diff. --- bench/compaction/offline-runner.ts | 44 ++++------------------------ src/core/summarize.ts | 47 ++++++++++++++++++++++++++++-- tests/compile.test.ts | 17 ++++++++++- 3 files changed, 65 insertions(+), 43 deletions(-) diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index f862bc4..4080ea9 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -1,8 +1,7 @@ import { performance } from "node:perf_hooks"; import type { Message } from "@mariozechner/pi-ai"; -import { compile } from "../../src/core/summarize"; +import { compileWithLayers } from "../../src/core/summarize"; import { buildSections } from "../../src/core/build-sections"; -import { RECALL_NOTE } from "../../src/core/format"; import { normalize } from "../../src/core/normalize"; import { renderMessage } from "../../src/core/render-entries"; import { clip, textOf } from "../../src/core/content"; @@ -478,54 +477,21 @@ const makeLayeredCheckpoint = (messages: Message[]): LayerSnapshot[] => { const renderLayers = (layers: LayerSnapshot[]): string => layers.map((layer) => `[${layer.name}]\n${layer.text}`).join("\n\n"); -const splitCurrentSections = (current: string): LayerSnapshot[] => { - const headers = [...current.matchAll(/^\[(.+?)\]/gm)]; - if (headers.length === 0) return [{ name: "Pi VCC Current Sections", role: "current", text: current }]; - return headers.map((header, index) => { - const start = header.index ?? 0; - const end = headers[index + 1]?.index ?? current.length; - const title = header[1]; - return { - name: `Pi VCC ${title}`, - role: "current" as const, - text: current.slice(start, end).trimEnd(), - }; - }); -}; - -const splitPiVccSummary = (summary: string): LayerSnapshot[] => { - if (!summary.trim()) return []; - const parts = summary.split(SEPARATOR).map((part) => part.trim()).filter(Boolean); - if (parts.length === 0) return [{ name: "Pi VCC Current Sections", role: "current", text: summary }]; - - const layers: LayerSnapshot[] = []; - const last = parts[parts.length - 1]; - const hasRecallNote = last === RECALL_NOTE; - const bodyParts = hasRecallNote ? parts.slice(0, -1) : parts; - const current = bodyParts[0] ?? ""; - const history = bodyParts.slice(1).join(SEPARATOR); - - if (current) layers.push(...splitCurrentSections(current)); - if (history) layers.push({ name: "Pi VCC Brief Transcript", role: "history", text: history }); - if (hasRecallNote) layers.push({ name: "Pi VCC Recall Note", role: "recall", text: RECALL_NOTE }); - return layers.length > 0 ? layers : [{ name: "Pi VCC Current Sections", role: "current", text: summary }]; -}; - export const offlineCompactors: OfflineCompactor[] = [ { name: "pi-vcc", compact: ({ messages, allMessages, previous }) => { const start = performance.now(); - const summary = compile({ messages, previousSummary: previous?.activePromptState }); + const summary = compileWithLayers({ messages, previousSummary: previous?.activePromptState }); const elapsed = performance.now() - start; return { - activePromptState: summary, - layers: splitPiVccSummary(summary), + activePromptState: summary.text, + layers: summary.layers, recallCorpus: renderedDocuments(allMessages), stats: { compactionMs: elapsed, estimatedInputTokens: estimateTokens(sourceTextOf(messages)), - estimatedOutputTokens: estimateTokens(summary), + estimatedOutputTokens: estimateTokens(summary.text), }, }; }, diff --git a/src/core/summarize.ts b/src/core/summarize.ts index ee92633..a7df771 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -12,6 +12,19 @@ export interface CompileInput { fileOps?: FileOps; } +export type CompiledLayerRole = "current" | "history" | "recall"; + +export interface CompiledSummaryLayer { + name: string; + role: CompiledLayerRole; + text: string; +} + +export interface CompileWithLayersResult { + text: string; + layers: CompiledSummaryLayer[]; +} + const HEADER_NAMES = ["Session Goal", "Current Scope", "Files And Changes", "Commits", "Evidence Handles", "User Preferences", "Outstanding Context"]; const SEPARATOR = "\n\n---\n\n"; @@ -119,6 +132,31 @@ const mergeBriefTranscript = (prev: string, fresh: string): string => { return prev + "\n\n" + fresh; }; +const layersOfCurrentSections = (current: string): CompiledSummaryLayer[] => + HEADER_NAMES.map((header) => sectionOf(current, header)) + .filter(Boolean) + .map((text) => { + const header = text.match(/^\[(.+?)\]/)?.[1] ?? "Current Sections"; + return { name: `Pi VCC ${header}`, role: "current" as const, text }; + }); + +const layersOfCompiledSummary = (summary: string): CompiledSummaryLayer[] => { + const parts = summary.split(SEPARATOR).map((part) => part.trim()).filter(Boolean); + if (parts.length === 0) return []; + + const last = parts[parts.length - 1]; + const hasRecallNote = last === RECALL_NOTE; + const bodyParts = hasRecallNote ? parts.slice(0, -1) : parts; + const current = bodyParts[0] ?? ""; + const history = bodyParts.slice(1).join(SEPARATOR); + const layers: CompiledSummaryLayer[] = []; + + if (current) layers.push(...layersOfCurrentSections(current)); + if (history) layers.push({ name: "Pi VCC Brief Transcript", role: "history", text: history }); + if (hasRecallNote) layers.push({ name: "Pi VCC Recall Note", role: "recall", text: RECALL_NOTE }); + return layers; +}; + const demoteFreshGoalToScope = (fresh: string): string => { const goal = sectionOf(fresh, "Session Goal"); if (!goal) return fresh; @@ -167,7 +205,9 @@ const mergePrevious = (prev: string, fresh: string): string => { return parts.join(SEPARATOR); }; -export const compile = (input: CompileInput): string => { +export const compile = (input: CompileInput): string => compileWithLayers(input).text; + +export const compileWithLayers = (input: CompileInput): CompileWithLayersResult => { const blocks = filterNoise(normalize(input.messages)); const data = buildSections({ blocks }); const fresh = formatSummary(data); @@ -177,8 +217,9 @@ export const compile = (input: CompileInput): string => { ? stripRecallNote(input.previousSummary) : undefined; const merged = prev ? mergePrevious(prev, fresh) : fresh; - if (!merged) return ""; - return merged + SEPARATOR + RECALL_NOTE; + if (!merged) return { text: "", layers: [] }; + const text = merged + SEPARATOR + RECALL_NOTE; + return { text, layers: layersOfCompiledSummary(text) }; }; const stripRecallNote = (text: string): string => { diff --git a/tests/compile.test.ts b/tests/compile.test.ts index d1015cc..6734640 100644 --- a/tests/compile.test.ts +++ b/tests/compile.test.ts @@ -1,5 +1,5 @@ import { describe, it, expect } from "bun:test"; -import { compile } from "../src/core/summarize"; +import { compile, compileWithLayers } from "../src/core/summarize"; import { userMsg, assistantText, @@ -28,6 +28,21 @@ describe("compile", () => { expect(r).toContain("Found the issue."); }); + it("exposes production layers without changing compiled text", () => { + const input = { + messages: [ + userMsg("Fix login bug"), + assistantWithToolCall("Read", { path: "auth.ts" }), + assistantText("Found the issue."), + ], + }; + const layered = compileWithLayers(input); + expect(layered.text).toBe(compile(input)); + expect(layered.layers.map((layer) => layer.name)).toContain("Pi VCC Session Goal"); + expect(layered.layers.map((layer) => layer.name)).toContain("Pi VCC Brief Transcript"); + expect(layered.layers.at(-1)).toMatchObject({ name: "Pi VCC Recall Note", role: "recall" }); + }); + it("merges previous summary goals", () => { const r = compile({ messages: [userMsg("New task")], From c915e833f334c319b883a92e82e895b4ecd0c161 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:40:15 +0200 Subject: [PATCH 15/65] refactor: model compaction state explicitly Introduce a structured compaction state between extracted section data and rendered summaries. The renderer owns deterministic current-section ordering plus separate history and recall layers, while compile() preserves the existing text output. compileWithLayers now builds from the structured state before merging, which gives the cache benchmark a production representation to compare as later cache-aware rendering becomes more layered. Validation: node --check src/core/compaction-state.ts src/core/summarize.ts tests/compaction-state.test.ts tests/compile.test.ts; Docker Bun tests for compaction-state and compile; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; node scripts/compare-compaction-refs.mjs --head HEAD --compactors pi-vcc --case-filter cache-bust --out /tmp/pi-vcc-state-ref.MFhfNq. --- src/core/compaction-state.ts | 117 +++++++++++++++++++++++++++++++++ src/core/summarize.ts | 27 ++++---- tests/compaction-state.test.ts | 59 +++++++++++++++++ 3 files changed, 188 insertions(+), 15 deletions(-) create mode 100644 src/core/compaction-state.ts create mode 100644 tests/compaction-state.test.ts diff --git a/src/core/compaction-state.ts b/src/core/compaction-state.ts new file mode 100644 index 0000000..ff46fcd --- /dev/null +++ b/src/core/compaction-state.ts @@ -0,0 +1,117 @@ +import type { SectionData } from "../sections"; +import { capBrief, RECALL_NOTE } from "./format"; + +export type CompiledLayerRole = "current" | "history" | "recall"; + +export interface CompiledSummaryLayer { + name: string; + role: CompiledLayerRole; + text: string; +} + +export interface CompileWithLayersResult { + text: string; + layers: CompiledSummaryLayer[]; +} + +export interface CompactionState { + current: { + sessionGoal: string[]; + currentScope: string[]; + filesAndChanges: string[]; + commits: string[]; + evidenceHandles: string[]; + userPreferences: string[]; + outstandingContext: string[]; + }; + history: { + briefTranscript: string; + }; + recall: { + note: string; + }; +} + +export const CURRENT_SECTION_ORDER = [ + "Session Goal", + "Current Scope", + "Files And Changes", + "Commits", + "Evidence Handles", + "User Preferences", + "Outstanding Context", +] as const; + +export type CurrentSectionName = typeof CURRENT_SECTION_ORDER[number]; + +const stateKeyOf = (section: CurrentSectionName): keyof CompactionState["current"] => { + switch (section) { + case "Session Goal": return "sessionGoal"; + case "Current Scope": return "currentScope"; + case "Files And Changes": return "filesAndChanges"; + case "Commits": return "commits"; + case "Evidence Handles": return "evidenceHandles"; + case "User Preferences": return "userPreferences"; + case "Outstanding Context": return "outstandingContext"; + } +}; + +const section = (title: string, items: string[]): string => { + if (items.length === 0) return ""; + const body = items.map((item) => `- ${item}`).join("\n"); + return `[${title}]\n${body}`; +}; + +export const buildCompactionState = (data: SectionData): CompactionState => ({ + current: { + sessionGoal: data.sessionGoal, + currentScope: data.currentScope, + filesAndChanges: data.filesAndChanges, + commits: data.commits, + evidenceHandles: data.evidenceHandles, + userPreferences: data.userPreferences, + outstandingContext: data.outstandingContext, + }, + history: { + briefTranscript: data.briefTranscript, + }, + recall: { + note: RECALL_NOTE, + }, +}); + +export const renderCurrentSections = (state: CompactionState): CompiledSummaryLayer[] => + CURRENT_SECTION_ORDER + .map((title) => ({ title, text: section(title, state.current[stateKeyOf(title)]) })) + .filter((entry) => entry.text) + .map((entry) => ({ + name: `Pi VCC ${entry.title}`, + role: "current" as const, + text: entry.text, + })); + +export const renderCompactionState = ( + state: CompactionState, + options: { includeRecallNote?: boolean } = {}, +): CompileWithLayersResult => { + const layers: CompiledSummaryLayer[] = [ + ...renderCurrentSections(state), + ]; + if (state.history.briefTranscript) { + layers.push({ + name: "Pi VCC Brief Transcript", + role: "history", + text: capBrief(state.history.briefTranscript), + }); + } + if (options.includeRecallNote && layers.length > 0) { + layers.push({ name: "Pi VCC Recall Note", role: "recall", text: state.recall.note }); + } + + const bodyLayers = options.includeRecallNote ? layers : layers.filter((layer) => layer.role !== "recall"); + const currentText = bodyLayers.filter((layer) => layer.role === "current").map((layer) => layer.text).join("\n\n"); + const historyText = bodyLayers.filter((layer) => layer.role === "history").map((layer) => layer.text).join("\n\n"); + const recallText = bodyLayers.filter((layer) => layer.role === "recall").map((layer) => layer.text).join("\n\n"); + const text = [currentText, historyText, recallText].filter(Boolean).join("\n\n---\n\n"); + return { text, layers: bodyLayers }; +}; diff --git a/src/core/summarize.ts b/src/core/summarize.ts index a7df771..8bc4565 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -3,8 +3,16 @@ import type { FileOps } from "../types"; import { normalize } from "./normalize"; import { filterNoise } from "./filter-noise"; import { buildSections } from "./build-sections"; -import { formatSummary, capBrief, RECALL_NOTE } from "./format"; +import { capBrief, RECALL_NOTE } from "./format"; import { applyPreferenceCorrections } from "../extract/preferences"; +import { + buildCompactionState, + CURRENT_SECTION_ORDER, + renderCompactionState, + type CompiledLayerRole, + type CompiledSummaryLayer, + type CompileWithLayersResult, +} from "./compaction-state"; export interface CompileInput { messages: Message[]; @@ -12,20 +20,9 @@ export interface CompileInput { fileOps?: FileOps; } -export type CompiledLayerRole = "current" | "history" | "recall"; +export type { CompiledLayerRole, CompiledSummaryLayer, CompileWithLayersResult } from "./compaction-state"; -export interface CompiledSummaryLayer { - name: string; - role: CompiledLayerRole; - text: string; -} - -export interface CompileWithLayersResult { - text: string; - layers: CompiledSummaryLayer[]; -} - -const HEADER_NAMES = ["Session Goal", "Current Scope", "Files And Changes", "Commits", "Evidence Handles", "User Preferences", "Outstanding Context"]; +const HEADER_NAMES = [...CURRENT_SECTION_ORDER]; const SEPARATOR = "\n\n---\n\n"; @@ -210,7 +207,7 @@ export const compile = (input: CompileInput): string => compileWithLayers(input) export const compileWithLayers = (input: CompileInput): CompileWithLayersResult => { const blocks = filterNoise(normalize(input.messages)); const data = buildSections({ blocks }); - const fresh = formatSummary(data); + const fresh = renderCompactionState(buildCompactionState(data)).text; // Strip any legacy RECALL_NOTE baked into prev summary (pre-fix format) // so merge doesn't re-stack it inside the brief. const prev = input.previousSummary diff --git a/tests/compaction-state.test.ts b/tests/compaction-state.test.ts new file mode 100644 index 0000000..0007b34 --- /dev/null +++ b/tests/compaction-state.test.ts @@ -0,0 +1,59 @@ +import { describe, expect, it } from "bun:test"; +import { buildCompactionState, renderCompactionState } from "../src/core/compaction-state"; +import type { SectionData } from "../src/sections"; + +const sectionData = (overrides: Partial = {}): SectionData => ({ + sessionGoal: [], + currentScope: [], + outstandingContext: [], + filesAndChanges: [], + commits: [], + evidenceHandles: [], + userPreferences: [], + briefTranscript: "", + transcriptEntries: [], + ...overrides, +}); + +describe("compaction state", () => { + it("renders current sections in deterministic order", () => { + const state = buildCompactionState(sectionData({ + userPreferences: ["Use Docker for benchmarks"], + sessionGoal: ["Benchmark compaction"], + filesAndChanges: ["Modified: src/core/summarize.ts"], + currentScope: ["Expose production layers"], + })); + + const rendered = renderCompactionState(state); + expect(rendered.layers.map((layer) => layer.name)).toEqual([ + "Pi VCC Session Goal", + "Pi VCC Current Scope", + "Pi VCC Files And Changes", + "Pi VCC User Preferences", + ]); + expect(rendered.text.indexOf("[Session Goal]")).toBeLessThan(rendered.text.indexOf("[Current Scope]")); + expect(rendered.text.indexOf("[Current Scope]")).toBeLessThan(rendered.text.indexOf("[Files And Changes]")); + }); + + it("keeps history and recall in separate trailing layers", () => { + const state = buildCompactionState(sectionData({ + sessionGoal: ["Benchmark compaction"], + briefTranscript: "[user]\nBenchmark compaction", + })); + + const rendered = renderCompactionState(state, { includeRecallNote: true }); + expect(rendered.layers.map((layer) => [layer.name, layer.role])).toEqual([ + ["Pi VCC Session Goal", "current"], + ["Pi VCC Brief Transcript", "history"], + ["Pi VCC Recall Note", "recall"], + ]); + expect(rendered.text).toContain("\n\n---\n\n[user]\nBenchmark compaction"); + expect(rendered.text).toContain("\n\n---\n\nUse `vcc_recall`"); + }); + + it("renders empty state as empty text without a recall-only layer", () => { + const rendered = renderCompactionState(buildCompactionState(sectionData()), { includeRecallNote: true }); + expect(rendered.text).toBe(""); + expect(rendered.layers).toEqual([]); + }); +}); From 1474094d80a9c49350b03ce4ee5e906e652b3bb8 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:41:29 +0200 Subject: [PATCH 16/65] refactor: render merged summaries from state Parse merged summary text back into CompactionState and render the final text/layers through the structured renderer. This removes the remaining ad hoc final layer construction from summarize while preserving compile() output. The structured path now covers fresh extraction, merged state reconstruction, deterministic rendering, and compileWithLayers metadata, preparing the implementation for section-level patching without changing the public summary format. Validation: node --check src/core/compaction-state.ts src/core/summarize.ts tests/compaction-state.test.ts tests/compile.test.ts; Docker Bun tests for compaction-state and compile; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; node scripts/compare-compaction-refs.mjs --head HEAD --compactors pi-vcc --case-filter cache-bust --out /tmp/pi-vcc-state-parse-ref.b3CuCT. --- src/core/compaction-state.ts | 37 ++++++++++++++++++++++++++++++++++ src/core/summarize.ts | 29 ++------------------------ tests/compaction-state.test.ts | 15 +++++++++++++- 3 files changed, 53 insertions(+), 28 deletions(-) diff --git a/src/core/compaction-state.ts b/src/core/compaction-state.ts index ff46fcd..50a589c 100644 --- a/src/core/compaction-state.ts +++ b/src/core/compaction-state.ts @@ -90,6 +90,43 @@ export const renderCurrentSections = (state: CompactionState): CompiledSummaryLa text: entry.text, })); +const emptyCurrent = (): CompactionState["current"] => ({ + sessionGoal: [], + currentScope: [], + filesAndChanges: [], + commits: [], + evidenceHandles: [], + userPreferences: [], + outstandingContext: [], +}); + +const parseSectionItems = (text: string): string[] => + text.split("\n").slice(1).map((line) => line.replace(/^-\s*/, "").trim()).filter(Boolean); + +export const parseCompactionState = (summary: string): CompactionState => { + const parts = summary.split("\n\n---\n\n").map((part) => part.trim()).filter(Boolean); + const last = parts[parts.length - 1]; + const bodyParts = last === RECALL_NOTE ? parts.slice(0, -1) : parts; + const currentText = bodyParts[0] ?? ""; + const historyText = bodyParts.slice(1).join("\n\n---\n\n"); + const current = emptyCurrent(); + + const headers = [...currentText.matchAll(/^\[(.+?)\]/gm)]; + for (const [index, header] of headers.entries()) { + const title = header[1] as CurrentSectionName; + if (!CURRENT_SECTION_ORDER.includes(title)) continue; + const start = header.index ?? 0; + const end = headers[index + 1]?.index ?? currentText.length; + current[stateKeyOf(title)] = parseSectionItems(currentText.slice(start, end).trim()); + } + + return { + current, + history: { briefTranscript: historyText }, + recall: { note: RECALL_NOTE }, + }; +}; + export const renderCompactionState = ( state: CompactionState, options: { includeRecallNote?: boolean } = {}, diff --git a/src/core/summarize.ts b/src/core/summarize.ts index 8bc4565..c64bfdd 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -8,6 +8,7 @@ import { applyPreferenceCorrections } from "../extract/preferences"; import { buildCompactionState, CURRENT_SECTION_ORDER, + parseCompactionState, renderCompactionState, type CompiledLayerRole, type CompiledSummaryLayer, @@ -129,31 +130,6 @@ const mergeBriefTranscript = (prev: string, fresh: string): string => { return prev + "\n\n" + fresh; }; -const layersOfCurrentSections = (current: string): CompiledSummaryLayer[] => - HEADER_NAMES.map((header) => sectionOf(current, header)) - .filter(Boolean) - .map((text) => { - const header = text.match(/^\[(.+?)\]/)?.[1] ?? "Current Sections"; - return { name: `Pi VCC ${header}`, role: "current" as const, text }; - }); - -const layersOfCompiledSummary = (summary: string): CompiledSummaryLayer[] => { - const parts = summary.split(SEPARATOR).map((part) => part.trim()).filter(Boolean); - if (parts.length === 0) return []; - - const last = parts[parts.length - 1]; - const hasRecallNote = last === RECALL_NOTE; - const bodyParts = hasRecallNote ? parts.slice(0, -1) : parts; - const current = bodyParts[0] ?? ""; - const history = bodyParts.slice(1).join(SEPARATOR); - const layers: CompiledSummaryLayer[] = []; - - if (current) layers.push(...layersOfCurrentSections(current)); - if (history) layers.push({ name: "Pi VCC Brief Transcript", role: "history", text: history }); - if (hasRecallNote) layers.push({ name: "Pi VCC Recall Note", role: "recall", text: RECALL_NOTE }); - return layers; -}; - const demoteFreshGoalToScope = (fresh: string): string => { const goal = sectionOf(fresh, "Session Goal"); if (!goal) return fresh; @@ -215,8 +191,7 @@ export const compileWithLayers = (input: CompileInput): CompileWithLayersResult : undefined; const merged = prev ? mergePrevious(prev, fresh) : fresh; if (!merged) return { text: "", layers: [] }; - const text = merged + SEPARATOR + RECALL_NOTE; - return { text, layers: layersOfCompiledSummary(text) }; + return renderCompactionState(parseCompactionState(merged), { includeRecallNote: true }); }; const stripRecallNote = (text: string): string => { diff --git a/tests/compaction-state.test.ts b/tests/compaction-state.test.ts index 0007b34..b9473b4 100644 --- a/tests/compaction-state.test.ts +++ b/tests/compaction-state.test.ts @@ -1,5 +1,5 @@ import { describe, expect, it } from "bun:test"; -import { buildCompactionState, renderCompactionState } from "../src/core/compaction-state"; +import { buildCompactionState, parseCompactionState, renderCompactionState } from "../src/core/compaction-state"; import type { SectionData } from "../src/sections"; const sectionData = (overrides: Partial = {}): SectionData => ({ @@ -56,4 +56,17 @@ describe("compaction state", () => { expect(rendered.text).toBe(""); expect(rendered.layers).toEqual([]); }); + + it("parses rendered summary back into structured state", () => { + const rendered = renderCompactionState(buildCompactionState(sectionData({ + sessionGoal: ["Benchmark compaction"], + currentScope: ["Expose production layers"], + userPreferences: ["Use Docker for benchmarks"], + briefTranscript: "[user]\nBenchmark compaction", + }))); + + const reparsed = renderCompactionState(parseCompactionState(rendered.text)); + expect(reparsed.text).toBe(rendered.text); + expect(reparsed.layers.map((layer) => layer.name)).toEqual(rendered.layers.map((layer) => layer.name)); + }); }); From 03fb1d27e28750ff4e0ee6b2754db53b9cdb5d3a Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:42:38 +0200 Subject: [PATCH 17/65] refactor: render stable sections before scope Move high-volatility Current Scope after the stable current sections in the structured compaction renderer. This preserves the public summary format while pushing ordinary scope churn later in the prompt prefix. Sampled real-session replay now first changes at Evidence Handles instead of Current Scope, with stablePrefixTokens 248 and 284 for cycles 2 and 3. Validation: node --check src/core/compaction-state.ts src/core/summarize.ts tests/compaction-state.test.ts tests/compile.test.ts; Docker Bun tests for compaction-state and compile; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; real-session Docker replay with --show-layer-diff. --- src/core/compaction-state.ts | 2 +- tests/compaction-state.test.ts | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/core/compaction-state.ts b/src/core/compaction-state.ts index 50a589c..b304d00 100644 --- a/src/core/compaction-state.ts +++ b/src/core/compaction-state.ts @@ -34,11 +34,11 @@ export interface CompactionState { export const CURRENT_SECTION_ORDER = [ "Session Goal", - "Current Scope", "Files And Changes", "Commits", "Evidence Handles", "User Preferences", + "Current Scope", "Outstanding Context", ] as const; diff --git a/tests/compaction-state.test.ts b/tests/compaction-state.test.ts index b9473b4..0bdd6c5 100644 --- a/tests/compaction-state.test.ts +++ b/tests/compaction-state.test.ts @@ -27,12 +27,12 @@ describe("compaction state", () => { const rendered = renderCompactionState(state); expect(rendered.layers.map((layer) => layer.name)).toEqual([ "Pi VCC Session Goal", - "Pi VCC Current Scope", "Pi VCC Files And Changes", "Pi VCC User Preferences", + "Pi VCC Current Scope", ]); - expect(rendered.text.indexOf("[Session Goal]")).toBeLessThan(rendered.text.indexOf("[Current Scope]")); - expect(rendered.text.indexOf("[Current Scope]")).toBeLessThan(rendered.text.indexOf("[Files And Changes]")); + expect(rendered.text.indexOf("[Session Goal]")).toBeLessThan(rendered.text.indexOf("[Files And Changes]")); + expect(rendered.text.indexOf("[User Preferences]")).toBeLessThan(rendered.text.indexOf("[Current Scope]")); }); it("keeps history and recall in separate trailing layers", () => { From 0e8e7bf49fb4045b11787af05f87db02534e1c55 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:48:32 +0200 Subject: [PATCH 18/65] test: add evidence growth cache probe Add a synthetic case where stable work state remains fixed while new evidence handles appear across compactions. The probe captures the current bottleneck: Evidence Handles is the first changed prompt layer while correctness terms remain preserved. A split evidence-layer experiment was tested and reverted because it regressed cache metrics on both the new probe and sampled real-session replay. Validation: node --check bench/compaction/synthetic-cases.ts; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --case-filter cache-bust-evidence-growth --show-layer-diff --jsonl; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache. --- bench/compaction/synthetic-cases.ts | 38 +++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts index 37cd6c7..5273a2f 100644 --- a/bench/compaction/synthetic-cases.ts +++ b/bench/compaction/synthetic-cases.ts @@ -251,6 +251,44 @@ export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ ], }, }, + { + id: "cache-bust-evidence-growth", + description: "Stable work state remains unchanged while new evidence handles are discovered across compactions.", + messages: [ + user("Audit cache probes. Stable objective: preserve prefix cache while tracking evidence handles. Always keep benchmark validation in Docker."), + assistant("Stable checkpoint: preserve prefix cache; validation preference Docker; canonical file src/cache/probe.ts."), + toolCall("read", { path: "src/cache/probe.ts" }), + toolResult("read", "export const cacheProbe = 'cache_probe_alpha';\n// request_id=req_cache_alpha"), + assistant("Evidence handles so far: src/cache/probe.ts and cache_probe_alpha."), + toolCall("bash", { command: "grep -R cache_probe_beta /tmp/cache-evidence-beta.log" }), + toolResult("bash", "CACHE_LAYER_SHIFT request_id=req_cache_beta\ntrace_id=trace_cache_beta\n/tmp/cache-evidence-beta.log"), + assistant("Additional evidence handle: /tmp/cache-evidence-beta.log with req_cache_beta."), + toolCall("bash", { command: "grep -R cache_probe_gamma /tmp/cache-evidence-gamma.log" }), + toolResult("bash", "CACHE_LAYER_STABLE request_id=req_cache_gamma\ntrace_id=trace_cache_gamma\n/tmp/cache-evidence-gamma.log"), + assistant("Additional evidence handle: /tmp/cache-evidence-gamma.log with req_cache_gamma."), + ], + compactionPoints: [5, 8, 11], + gold: { + activeTerms: [ + { label: "stable objective", term: "preserve prefix cache" }, + { label: "canonical file", term: "src/cache/probe.ts" }, + { label: "validation preference", term: "Docker" }, + { label: "latest evidence", term: "req_cache_gamma" }, + ], + currentTerms: [ + { label: "stable objective", term: "preserve prefix cache" }, + { label: "canonical file", term: "src/cache/probe.ts" }, + { label: "validation preference", term: "Docker" }, + { label: "latest evidence", term: "req_cache_gamma" }, + ], + recallTerms: [ + { label: "earlier beta evidence", term: "CACHE_LAYER_SHIFT request_id=req_cache_beta", query: "CACHE_LAYER_SHIFT req_cache_beta" }, + ], + continuationTerms: [ + { label: "latest evidence", term: "req_cache_gamma" }, + ], + }, + }, { id: "cache-bust-volatile-next-step", description: "Stable objective and identifiers remain fixed while only volatile next-step state changes across cycles.", From 3b0afe3e338d689f68e68da5770e018f87b7004d Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:50:05 +0200 Subject: [PATCH 19/65] fix: normalize evidence path handles Normalize path evidence before it enters the compacted state: strip punctuation variants and drop broad absolute directories while retaining specific files and tmp artifacts. This reduces noisy Evidence Handles churn without changing the current summary structure. The evidence layer split experiment was not kept because it regressed stable-prefix metrics. The focused evidence-growth probe remains as the RED signal for this bottleneck. Validation: node --check src/extract/evidence.ts tests/extract-evidence.test.ts; Docker Bun extract-evidence tests; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --case-filter cache-bust-evidence-growth --show-layer-diff --jsonl; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; sampled real-session replay; ref comparisons in /tmp/pi-vcc-evidence-noise-ref.G8zNvv and /tmp/pi-vcc-evidence-noise-real-ref.5GQES8. --- src/extract/evidence.ts | 29 ++++++++++++++++++++++++----- tests/extract-evidence.test.ts | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 58 insertions(+), 5 deletions(-) create mode 100644 tests/extract-evidence.test.ts diff --git a/src/extract/evidence.ts b/src/extract/evidence.ts index ad10b08..3253a97 100644 --- a/src/extract/evidence.ts +++ b/src/extract/evidence.ts @@ -13,9 +13,25 @@ const ERROR_SIGNATURE_RE = /\b(?:ERR_[A-Z0-9_]+|(?:CACHE|CRITICAL|FATAL|PANIC|ER const ID_RE = /\b(?:cache|probe|span|spn|req|request|trace|artifact|bench)[A-Za-z0-9_-]*_[A-Za-z0-9_-]+\b/g; const COMMIT_RE = /\bcommit(?:\s+|[=:])([0-9a-f]{7,40})\b/gi; -const addMatches = (set: Set, text: string, regex: RegExp, group = 0) => { +const normalizePathEvidence = (value: string): string => + value.trim().replace(/[.,;:]+$/, ""); + +const isSpecificPathEvidence = (value: string): boolean => { + const normalized = normalizePathEvidence(value); + if (/^\/tmp\//.test(normalized)) return true; + const base = normalized.split("/").at(-1) ?? ""; + return /\.[A-Za-z0-9_-]+$/.test(base); +}; + +const addMatches = ( + set: Set, + text: string, + regex: RegExp, + group = 0, + normalize: (value: string) => string | null = (value) => value.trim(), +) => { for (const match of text.matchAll(regex)) { - const value = (match[group] ?? match[0]).trim(); + const value = normalize(match[group] ?? match[0]); if (value) set.add(value); } }; @@ -26,8 +42,11 @@ const textFromBlock = (block: NormalizedBlock): string => { }; const addEvidenceFromText = (activity: EvidenceActivity, text: string) => { - addMatches(activity.paths, text, ABS_PATH_RE, 1); - addMatches(activity.paths, text, PROJECT_PATH_RE, 1); + addMatches(activity.paths, text, ABS_PATH_RE, 1, (value) => { + const normalized = normalizePathEvidence(value); + return isSpecificPathEvidence(normalized) ? normalized : null; + }); + addMatches(activity.paths, text, PROJECT_PATH_RE, 1, (value) => normalizePathEvidence(value)); addMatches(activity.errorSignatures, text, ERROR_SIGNATURE_RE); addMatches(activity.identifiers, text, ID_RE); addMatches(activity.identifiers, text, COMMIT_RE, 1); @@ -43,7 +62,7 @@ export const extractEvidence = (blocks: NormalizedBlock[]): EvidenceActivity => for (const block of blocks) { if (block.kind === "tool_call") { const path = extractPath(block.args); - if (path) activity.paths.add(path); + if (path) activity.paths.add(normalizePathEvidence(path)); for (const key of ["command", "cmd", "query", "path", "file", "file_path", "filePath"]) { const value = block.args[key]; if (typeof value === "string") addEvidenceFromText(activity, value); diff --git a/tests/extract-evidence.test.ts b/tests/extract-evidence.test.ts new file mode 100644 index 0000000..4b21a99 --- /dev/null +++ b/tests/extract-evidence.test.ts @@ -0,0 +1,34 @@ +import { describe, expect, it } from "bun:test"; +import { extractEvidence, formatEvidence } from "../src/extract/evidence"; +import type { NormalizedBlock } from "../src/types"; + +describe("extractEvidence", () => { + it("normalizes trailing punctuation on paths", () => { + const blocks: NormalizedBlock[] = [ + { kind: "assistant", text: "Read /home/fl/code/project/src/app.ts. Then compare src/app.ts," }, + ]; + const evidence = extractEvidence(blocks); + expect([...evidence.paths]).toContain("/home/fl/code/project/src/app.ts"); + expect([...evidence.paths]).toContain("src/app.ts"); + expect([...evidence.paths]).not.toContain("/home/fl/code/project/src/app.ts."); + expect([...evidence.paths]).not.toContain("src/app.ts,"); + }); + + it("drops broad absolute directories while keeping files and tmp artifacts", () => { + const blocks: NormalizedBlock[] = [ + { kind: "assistant", text: "/home/fl/code/project /home/fl/code/project/values.yaml /tmp/cache-evidence-beta.log /var/lib/grafana/dashboards" }, + ]; + const evidence = extractEvidence(blocks); + expect([...evidence.paths]).toContain("/home/fl/code/project/values.yaml"); + expect([...evidence.paths]).toContain("/tmp/cache-evidence-beta.log"); + expect([...evidence.paths]).not.toContain("/home/fl/code/project"); + expect([...evidence.paths]).not.toContain("/var/lib/grafana/dashboards"); + }); + + it("formats retained evidence handles", () => { + const blocks: NormalizedBlock[] = [ + { kind: "assistant", text: "CACHE_LAYER_SHIFT request_id=req_cache_beta /tmp/cache-evidence-beta.log" }, + ]; + expect(formatEvidence(extractEvidence(blocks)).join("\n")).toContain("req_cache_beta"); + }); +}); From ea27f19d3e4f419a93b2ae8042a4798b83086448 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:52:34 +0200 Subject: [PATCH 20/65] refactor: keep prior evidence handles stable Keep the existing Evidence Handles section stable when merging with a previous summary and render newly discovered handles in a later Recent Evidence Handles section. This preserves evidence recoverability while pushing evidence-only churn later in the prompt. Evidence-growth diagnostics now first change at Recent Evidence Handles instead of Evidence Handles, and sampled real-session replay first changes at User Preferences with stablePrefixTokens 328/338 for cycles 2/3. Validation: node --check src/core/compaction-state.ts src/core/summarize.ts tests/compaction-state.test.ts tests/compile.test.ts; Docker Bun tests for compaction-state and compile; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --case-filter cache-bust-evidence-growth --show-layer-diff --jsonl; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; sampled real-session replay with --show-layer-diff. --- src/core/compaction-state.ts | 5 +++++ src/core/summarize.ts | 15 ++++++++++++++- tests/compaction-state.test.ts | 16 ++++++++++++++++ tests/compile.test.ts | 23 +++++++++++++++++++++++ 4 files changed, 58 insertions(+), 1 deletion(-) diff --git a/src/core/compaction-state.ts b/src/core/compaction-state.ts index b304d00..4a26c2f 100644 --- a/src/core/compaction-state.ts +++ b/src/core/compaction-state.ts @@ -21,6 +21,7 @@ export interface CompactionState { filesAndChanges: string[]; commits: string[]; evidenceHandles: string[]; + recentEvidenceHandles: string[]; userPreferences: string[]; outstandingContext: string[]; }; @@ -39,6 +40,7 @@ export const CURRENT_SECTION_ORDER = [ "Evidence Handles", "User Preferences", "Current Scope", + "Recent Evidence Handles", "Outstanding Context", ] as const; @@ -51,6 +53,7 @@ const stateKeyOf = (section: CurrentSectionName): keyof CompactionState["current case "Files And Changes": return "filesAndChanges"; case "Commits": return "commits"; case "Evidence Handles": return "evidenceHandles"; + case "Recent Evidence Handles": return "recentEvidenceHandles"; case "User Preferences": return "userPreferences"; case "Outstanding Context": return "outstandingContext"; } @@ -69,6 +72,7 @@ export const buildCompactionState = (data: SectionData): CompactionState => ({ filesAndChanges: data.filesAndChanges, commits: data.commits, evidenceHandles: data.evidenceHandles, + recentEvidenceHandles: [], userPreferences: data.userPreferences, outstandingContext: data.outstandingContext, }, @@ -96,6 +100,7 @@ const emptyCurrent = (): CompactionState["current"] => ({ filesAndChanges: [], commits: [], evidenceHandles: [], + recentEvidenceHandles: [], userPreferences: [], outstandingContext: [], }); diff --git a/src/core/summarize.ts b/src/core/summarize.ts index c64bfdd..1603a04 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -23,7 +23,7 @@ export interface CompileInput { export type { CompiledLayerRole, CompiledSummaryLayer, CompileWithLayersResult } from "./compaction-state"; -const HEADER_NAMES = [...CURRENT_SECTION_ORDER]; +const HEADER_NAMES = ["Evidence Handles", "Recent Evidence Handles", ...CURRENT_SECTION_ORDER]; const SEPARATOR = "\n\n---\n\n"; @@ -57,6 +57,7 @@ const briefOf = (text: string): string => { /** Merge a header section */ const mergeHeaderSection = (header: string, prev: string, fresh: string): string => { + if (header === "Evidence Handles") return prev || fresh; // Current Scope is the latest explicit scope change; keep previous when the // fresh window only has status/transcript updates. if (header === "Current Scope") return fresh || prev; @@ -124,6 +125,16 @@ const mergeFileLines = (prev: string, fresh: string): string => { return `[Files And Changes]\n${lines.join("\n")}`; }; +const evidenceItemsOf = (section: string): string[] => + section.split("\n").filter((line) => line.startsWith("- ")); + +const freshRecentEvidenceSection = (prevEvidence: string, freshEvidence: string): string => { + if (!prevEvidence || !freshEvidence) return ""; + const previous = new Set(evidenceItemsOf(prevEvidence)); + const freshOnly = evidenceItemsOf(freshEvidence).filter((line) => !previous.has(line)); + return freshOnly.length > 0 ? `[Recent Evidence Handles]\n${freshOnly.join("\n")}` : ""; +}; + const mergeBriefTranscript = (prev: string, fresh: string): string => { if (!prev) return fresh; if (!fresh) return prev; @@ -154,8 +165,10 @@ const demoteFreshGoalToScope = (fresh: string): string => { const mergePrevious = (prev: string, fresh: string): string => { const mergeFresh = demoteFreshGoalToScope(fresh); // Merge header sections + const recentEvidence = freshRecentEvidenceSection(sectionOf(prev, "Evidence Handles"), sectionOf(mergeFresh, "Evidence Handles")); const headers = HEADER_NAMES .map((header) => { + if (header === "Recent Evidence Handles") return recentEvidence; const freshSec = sectionOf(mergeFresh, header); const prevSec = sectionOf(prev, header); return mergeHeaderSection(header, prevSec, freshSec); diff --git a/tests/compaction-state.test.ts b/tests/compaction-state.test.ts index 0bdd6c5..1e7b29b 100644 --- a/tests/compaction-state.test.ts +++ b/tests/compaction-state.test.ts @@ -57,6 +57,22 @@ describe("compaction state", () => { expect(rendered.layers).toEqual([]); }); + it("renders recent evidence after current scope", () => { + const state = buildCompactionState(sectionData({ + sessionGoal: ["Benchmark compaction"], + evidenceHandles: ["Paths: src/cache/probe.ts"], + currentScope: ["Keep going"], + })); + state.current.recentEvidenceHandles = ["Identifiers: req_cache_beta"]; + const rendered = renderCompactionState(state); + expect(rendered.layers.map((layer) => layer.name)).toEqual([ + "Pi VCC Session Goal", + "Pi VCC Evidence Handles", + "Pi VCC Current Scope", + "Pi VCC Recent Evidence Handles", + ]); + }); + it("parses rendered summary back into structured state", () => { const rendered = renderCompactionState(buildCompactionState(sectionData({ sessionGoal: ["Benchmark compaction"], diff --git a/tests/compile.test.ts b/tests/compile.test.ts index 6734640..aee7391 100644 --- a/tests/compile.test.ts +++ b/tests/compile.test.ts @@ -138,4 +138,27 @@ describe("compile", () => { const current = r.split("\n\n---\n\n")[0]; expect(current).toContain("[Current Scope]\n- Add meta monitoring"); }); + + it("preserves evidence handles when merging", () => { + const previousSummary = "[Session Goal]\n- Existing goal\n\n[Evidence Handles]\n- Paths: src/cache/probe.ts\n- Identifiers: req_cache_beta\n\n---\n\n[user]\nExisting goal"; + const r = compile({ + previousSummary, + messages: [userMsg("Status update: continue validation")], + }); + const current = r.split("\n\n---\n\n")[0]; + expect(current).toContain("[Evidence Handles]\n- Paths: src/cache/probe.ts\n- Identifiers: req_cache_beta"); + }); + + it("places newly discovered evidence in a later recent section", () => { + const previousSummary = "[Session Goal]\n- Existing goal\n\n[Evidence Handles]\n- Paths: src/cache/probe.ts\n\n---\n\n[user]\nExisting goal"; + const r = compile({ + previousSummary, + messages: [toolResult("bash", "CACHE_LAYER_SHIFT request_id=req_cache_beta /tmp/cache-evidence-beta.log")], + }); + const current = r.split("\n\n---\n\n")[0]; + expect(current).toContain("[Evidence Handles]\n- Paths: src/cache/probe.ts"); + expect(current).toContain("[Recent Evidence Handles]"); + expect(current).toContain("req_cache_beta"); + expect(current.indexOf("[Evidence Handles]")).toBeLessThan(current.indexOf("[Recent Evidence Handles]")); + }); }); From fc26ceab37aebdca94ff8477c559c8e358c03192 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 20:54:26 +0200 Subject: [PATCH 21/65] refactor: keep prior preferences stable Keep stable User Preferences byte-identical when a later compaction only discovers additive preferences, and place those new preferences in a later Recent User Preferences section. Corrections still update the stable preference section so stale preferences are removed. Sampled real-session replay now first changes at Current Scope instead of User Preferences, with stablePrefixTokens 339/339 for cycles 2/3. Validation: node --check src/core/compaction-state.ts src/core/summarize.ts tests/compaction-state.test.ts tests/compile.test.ts; Docker Bun tests for compaction-state and compile; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; sampled real-session replay with --show-layer-diff. --- src/core/compaction-state.ts | 5 +++++ src/core/summarize.ts | 16 ++++++++++++++-- tests/compaction-state.test.ts | 4 +++- tests/compile.test.ts | 24 ++++++++++++++++++++++++ 4 files changed, 46 insertions(+), 3 deletions(-) diff --git a/src/core/compaction-state.ts b/src/core/compaction-state.ts index 4a26c2f..e2760d1 100644 --- a/src/core/compaction-state.ts +++ b/src/core/compaction-state.ts @@ -23,6 +23,7 @@ export interface CompactionState { evidenceHandles: string[]; recentEvidenceHandles: string[]; userPreferences: string[]; + recentUserPreferences: string[]; outstandingContext: string[]; }; history: { @@ -40,6 +41,7 @@ export const CURRENT_SECTION_ORDER = [ "Evidence Handles", "User Preferences", "Current Scope", + "Recent User Preferences", "Recent Evidence Handles", "Outstanding Context", ] as const; @@ -55,6 +57,7 @@ const stateKeyOf = (section: CurrentSectionName): keyof CompactionState["current case "Evidence Handles": return "evidenceHandles"; case "Recent Evidence Handles": return "recentEvidenceHandles"; case "User Preferences": return "userPreferences"; + case "Recent User Preferences": return "recentUserPreferences"; case "Outstanding Context": return "outstandingContext"; } }; @@ -74,6 +77,7 @@ export const buildCompactionState = (data: SectionData): CompactionState => ({ evidenceHandles: data.evidenceHandles, recentEvidenceHandles: [], userPreferences: data.userPreferences, + recentUserPreferences: [], outstandingContext: data.outstandingContext, }, history: { @@ -102,6 +106,7 @@ const emptyCurrent = (): CompactionState["current"] => ({ evidenceHandles: [], recentEvidenceHandles: [], userPreferences: [], + recentUserPreferences: [], outstandingContext: [], }); diff --git a/src/core/summarize.ts b/src/core/summarize.ts index 1603a04..798f140 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -23,7 +23,7 @@ export interface CompileInput { export type { CompiledLayerRole, CompiledSummaryLayer, CompileWithLayersResult } from "./compaction-state"; -const HEADER_NAMES = ["Evidence Handles", "Recent Evidence Handles", ...CURRENT_SECTION_ORDER]; +const HEADER_NAMES = ["Evidence Handles", "Recent Evidence Handles", "Recent User Preferences", ...CURRENT_SECTION_ORDER]; const SEPARATOR = "\n\n---\n\n"; @@ -58,6 +58,7 @@ const briefOf = (text: string): string => { /** Merge a header section */ const mergeHeaderSection = (header: string, prev: string, fresh: string): string => { if (header === "Evidence Handles") return prev || fresh; + if (header === "User Preferences" && prev && fresh && !/\b(correction|never)\b/i.test(fresh)) return prev; // Current Scope is the latest explicit scope change; keep previous when the // fresh window only has status/transcript updates. if (header === "Current Scope") return fresh || prev; @@ -125,9 +126,11 @@ const mergeFileLines = (prev: string, fresh: string): string => { return `[Files And Changes]\n${lines.join("\n")}`; }; -const evidenceItemsOf = (section: string): string[] => +const cleanListItemsOf = (section: string): string[] => section.split("\n").filter((line) => line.startsWith("- ")); +const evidenceItemsOf = cleanListItemsOf; + const freshRecentEvidenceSection = (prevEvidence: string, freshEvidence: string): string => { if (!prevEvidence || !freshEvidence) return ""; const previous = new Set(evidenceItemsOf(prevEvidence)); @@ -135,6 +138,13 @@ const freshRecentEvidenceSection = (prevEvidence: string, freshEvidence: string) return freshOnly.length > 0 ? `[Recent Evidence Handles]\n${freshOnly.join("\n")}` : ""; }; +const freshRecentUserPreferencesSection = (prevPreferences: string, freshPreferences: string): string => { + if (!prevPreferences || !freshPreferences || /\b(correction|never)\b/i.test(freshPreferences)) return ""; + const previous = new Set(cleanListItemsOf(prevPreferences)); + const freshOnly = cleanListItemsOf(freshPreferences).filter((line) => !previous.has(line)); + return freshOnly.length > 0 ? `[Recent User Preferences]\n${freshOnly.join("\n")}` : ""; +}; + const mergeBriefTranscript = (prev: string, fresh: string): string => { if (!prev) return fresh; if (!fresh) return prev; @@ -166,9 +176,11 @@ const mergePrevious = (prev: string, fresh: string): string => { const mergeFresh = demoteFreshGoalToScope(fresh); // Merge header sections const recentEvidence = freshRecentEvidenceSection(sectionOf(prev, "Evidence Handles"), sectionOf(mergeFresh, "Evidence Handles")); + const recentUserPreferences = freshRecentUserPreferencesSection(sectionOf(prev, "User Preferences"), sectionOf(mergeFresh, "User Preferences")); const headers = HEADER_NAMES .map((header) => { if (header === "Recent Evidence Handles") return recentEvidence; + if (header === "Recent User Preferences") return recentUserPreferences; const freshSec = sectionOf(mergeFresh, header); const prevSec = sectionOf(prev, header); return mergeHeaderSection(header, prevSec, freshSec); diff --git a/tests/compaction-state.test.ts b/tests/compaction-state.test.ts index 1e7b29b..5403277 100644 --- a/tests/compaction-state.test.ts +++ b/tests/compaction-state.test.ts @@ -57,18 +57,20 @@ describe("compaction state", () => { expect(rendered.layers).toEqual([]); }); - it("renders recent evidence after current scope", () => { + it("renders recent preference and evidence sections after current scope", () => { const state = buildCompactionState(sectionData({ sessionGoal: ["Benchmark compaction"], evidenceHandles: ["Paths: src/cache/probe.ts"], currentScope: ["Keep going"], })); + state.current.recentUserPreferences = ["Prefer query read only mode"]; state.current.recentEvidenceHandles = ["Identifiers: req_cache_beta"]; const rendered = renderCompactionState(state); expect(rendered.layers.map((layer) => layer.name)).toEqual([ "Pi VCC Session Goal", "Pi VCC Evidence Handles", "Pi VCC Current Scope", + "Pi VCC Recent User Preferences", "Pi VCC Recent Evidence Handles", ]); }); diff --git a/tests/compile.test.ts b/tests/compile.test.ts index aee7391..726aeef 100644 --- a/tests/compile.test.ts +++ b/tests/compile.test.ts @@ -149,6 +149,30 @@ describe("compile", () => { expect(current).toContain("[Evidence Handles]\n- Paths: src/cache/probe.ts\n- Identifiers: req_cache_beta"); }); + it("places newly discovered preferences in a later recent section", () => { + const previousSummary = "[Session Goal]\n- Existing goal\n\n[User Preferences]\n- Always use Docker for benchmarks\n\n---\n\n[user]\nExisting goal"; + const r = compile({ + previousSummary, + messages: [userMsg("I would prefer query read only mode")], + }); + const current = r.split("\n\n---\n\n")[0]; + expect(current).toContain("[User Preferences]\n- Always use Docker for benchmarks"); + expect(current).toContain("[Recent User Preferences]\n- I would prefer query read only mode"); + expect(current.indexOf("[User Preferences]")).toBeLessThan(current.indexOf("[Recent User Preferences]")); + }); + + it("applies preference corrections to the stable preference section", () => { + const previousSummary = "[Session Goal]\n- Existing goal\n\n[User Preferences]\n- prefer yarn test\n\n---\n\n[user]\nExisting goal"; + const r = compile({ + previousSummary, + messages: [userMsg("Correction: never use yarn here. Use npm test.")], + }); + const current = r.split("\n\n---\n\n")[0]; + expect(current).toContain("never use yarn"); + expect(current).not.toContain("prefer yarn test"); + expect(current).not.toContain("[Recent User Preferences]"); + }); + it("places newly discovered evidence in a later recent section", () => { const previousSummary = "[Session Goal]\n- Existing goal\n\n[Evidence Handles]\n- Paths: src/cache/probe.ts\n\n---\n\n[user]\nExisting goal"; const r = compile({ From e2016cf357d55086484b0f8704b677f5f11e5e62 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 21:00:27 +0200 Subject: [PATCH 22/65] refactor: keep prior scope stable Add a scope-growth cache probe and preserve established Current Scope when later compactions discover additive scope updates. New additive scope lines are rendered in Recent Scope Updates so durable scope remains recoverable without rewriting the earlier scope section. The new probe now first changes at Recent Scope Updates with no missing current terms. Sampled real-session replay first changed at Recent Scope Updates with stablePrefixTokens 369/379. Validation: node --check src/core/compaction-state.ts src/core/summarize.ts tests/compaction-state.test.ts tests/compile.test.ts bench/compaction/synthetic-cases.ts; Docker Bun tests for compaction-state and compile; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --case-filter cache-bust-scope-growth --show-layer-diff --jsonl; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; sampled real-session replay with --show-layer-diff. --- bench/compaction/synthetic-cases.ts | 35 +++++++++++++++++++++++++++++ src/core/compaction-state.ts | 5 +++++ src/core/summarize.ts | 16 +++++++++---- tests/compaction-state.test.ts | 2 ++ tests/compile.test.ts | 12 ++++++++++ 5 files changed, 66 insertions(+), 4 deletions(-) diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts index 5273a2f..5ad8e98 100644 --- a/bench/compaction/synthetic-cases.ts +++ b/bench/compaction/synthetic-cases.ts @@ -251,6 +251,41 @@ export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ ], }, }, + { + id: "cache-bust-scope-growth", + description: "Stable objective and evidence remain fixed while additive scope updates change across compactions.", + messages: [ + user("Build cache-aware compaction. Stable objective: preserve cacheable prefix while keeping continuation state recoverable."), + assistant("Stable checkpoint: preserve cacheable prefix; canonical file src/core/compaction-state.ts; validation in Docker."), + user("Also add dashboard provisioning checks to the current scope."), + assistant("I will include dashboard provisioning checks in the current scope without changing the stable objective."), + user("Also add Grafana datasource validation to the current scope."), + assistant("I will include Grafana datasource validation as the latest scope update."), + user("Also add provider cache accounting notes to the current scope."), + assistant("I will include provider cache accounting notes while preserving the stable objective."), + ], + compactionPoints: [4, 6, 8], + gold: { + activeTerms: [ + { label: "stable objective", term: "preserve cacheable prefix" }, + { label: "canonical file", term: "src/core/compaction-state.ts" }, + { label: "first scope", term: "dashboard provisioning checks" }, + { label: "latest scope", term: "provider cache accounting notes" }, + ], + currentTerms: [ + { label: "stable objective", term: "preserve cacheable prefix" }, + { label: "canonical file", term: "src/core/compaction-state.ts" }, + { label: "first scope", term: "dashboard provisioning checks" }, + { label: "latest scope", term: "provider cache accounting notes" }, + ], + recallTerms: [ + { label: "middle scope", term: "Grafana datasource validation", query: "Grafana datasource validation" }, + ], + continuationTerms: [ + { label: "latest scope", term: "provider cache accounting notes" }, + ], + }, + }, { id: "cache-bust-evidence-growth", description: "Stable work state remains unchanged while new evidence handles are discovered across compactions.", diff --git a/src/core/compaction-state.ts b/src/core/compaction-state.ts index e2760d1..4d808c6 100644 --- a/src/core/compaction-state.ts +++ b/src/core/compaction-state.ts @@ -18,6 +18,7 @@ export interface CompactionState { current: { sessionGoal: string[]; currentScope: string[]; + recentScopeUpdates: string[]; filesAndChanges: string[]; commits: string[]; evidenceHandles: string[]; @@ -41,6 +42,7 @@ export const CURRENT_SECTION_ORDER = [ "Evidence Handles", "User Preferences", "Current Scope", + "Recent Scope Updates", "Recent User Preferences", "Recent Evidence Handles", "Outstanding Context", @@ -52,6 +54,7 @@ const stateKeyOf = (section: CurrentSectionName): keyof CompactionState["current switch (section) { case "Session Goal": return "sessionGoal"; case "Current Scope": return "currentScope"; + case "Recent Scope Updates": return "recentScopeUpdates"; case "Files And Changes": return "filesAndChanges"; case "Commits": return "commits"; case "Evidence Handles": return "evidenceHandles"; @@ -72,6 +75,7 @@ export const buildCompactionState = (data: SectionData): CompactionState => ({ current: { sessionGoal: data.sessionGoal, currentScope: data.currentScope, + recentScopeUpdates: [], filesAndChanges: data.filesAndChanges, commits: data.commits, evidenceHandles: data.evidenceHandles, @@ -101,6 +105,7 @@ export const renderCurrentSections = (state: CompactionState): CompiledSummaryLa const emptyCurrent = (): CompactionState["current"] => ({ sessionGoal: [], currentScope: [], + recentScopeUpdates: [], filesAndChanges: [], commits: [], evidenceHandles: [], diff --git a/src/core/summarize.ts b/src/core/summarize.ts index 798f140..7df9363 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -23,7 +23,7 @@ export interface CompileInput { export type { CompiledLayerRole, CompiledSummaryLayer, CompileWithLayersResult } from "./compaction-state"; -const HEADER_NAMES = ["Evidence Handles", "Recent Evidence Handles", "Recent User Preferences", ...CURRENT_SECTION_ORDER]; +const HEADER_NAMES = ["Evidence Handles", "Recent Evidence Handles", "Recent User Preferences", "Recent Scope Updates", ...CURRENT_SECTION_ORDER]; const SEPARATOR = "\n\n---\n\n"; @@ -59,9 +59,8 @@ const briefOf = (text: string): string => { const mergeHeaderSection = (header: string, prev: string, fresh: string): string => { if (header === "Evidence Handles") return prev || fresh; if (header === "User Preferences" && prev && fresh && !/\b(correction|never)\b/i.test(fresh)) return prev; - // Current Scope is the latest explicit scope change; keep previous when the - // fresh window only has status/transcript updates. - if (header === "Current Scope") return fresh || prev; + // Keep established scope stable; additive fresh scope is rendered later. + if (header === "Current Scope") return prev || fresh; // Outstanding Context is volatile -- always use fresh only. if (header === "Outstanding Context") return fresh; if (!prev) return fresh; @@ -138,6 +137,13 @@ const freshRecentEvidenceSection = (prevEvidence: string, freshEvidence: string) return freshOnly.length > 0 ? `[Recent Evidence Handles]\n${freshOnly.join("\n")}` : ""; }; +const freshRecentScopeSection = (prevScope: string, freshScope: string): string => { + if (!prevScope || !freshScope) return ""; + const previous = new Set(cleanListItemsOf(prevScope)); + const freshOnly = cleanListItemsOf(freshScope).filter((line) => !previous.has(line)); + return freshOnly.length > 0 ? `[Recent Scope Updates]\n${freshOnly.join("\n")}` : ""; +}; + const freshRecentUserPreferencesSection = (prevPreferences: string, freshPreferences: string): string => { if (!prevPreferences || !freshPreferences || /\b(correction|never)\b/i.test(freshPreferences)) return ""; const previous = new Set(cleanListItemsOf(prevPreferences)); @@ -177,10 +183,12 @@ const mergePrevious = (prev: string, fresh: string): string => { // Merge header sections const recentEvidence = freshRecentEvidenceSection(sectionOf(prev, "Evidence Handles"), sectionOf(mergeFresh, "Evidence Handles")); const recentUserPreferences = freshRecentUserPreferencesSection(sectionOf(prev, "User Preferences"), sectionOf(mergeFresh, "User Preferences")); + const recentScope = freshRecentScopeSection(sectionOf(prev, "Current Scope"), sectionOf(mergeFresh, "Current Scope")); const headers = HEADER_NAMES .map((header) => { if (header === "Recent Evidence Handles") return recentEvidence; if (header === "Recent User Preferences") return recentUserPreferences; + if (header === "Recent Scope Updates") return recentScope; const freshSec = sectionOf(mergeFresh, header); const prevSec = sectionOf(prev, header); return mergeHeaderSection(header, prevSec, freshSec); diff --git a/tests/compaction-state.test.ts b/tests/compaction-state.test.ts index 5403277..dd18300 100644 --- a/tests/compaction-state.test.ts +++ b/tests/compaction-state.test.ts @@ -63,6 +63,7 @@ describe("compaction state", () => { evidenceHandles: ["Paths: src/cache/probe.ts"], currentScope: ["Keep going"], })); + state.current.recentScopeUpdates = ["Validate dashboards"]; state.current.recentUserPreferences = ["Prefer query read only mode"]; state.current.recentEvidenceHandles = ["Identifiers: req_cache_beta"]; const rendered = renderCompactionState(state); @@ -70,6 +71,7 @@ describe("compaction state", () => { "Pi VCC Session Goal", "Pi VCC Evidence Handles", "Pi VCC Current Scope", + "Pi VCC Recent Scope Updates", "Pi VCC Recent User Preferences", "Pi VCC Recent Evidence Handles", ]); diff --git a/tests/compile.test.ts b/tests/compile.test.ts index 726aeef..3efe384 100644 --- a/tests/compile.test.ts +++ b/tests/compile.test.ts @@ -149,6 +149,18 @@ describe("compile", () => { expect(current).toContain("[Evidence Handles]\n- Paths: src/cache/probe.ts\n- Identifiers: req_cache_beta"); }); + it("places newly discovered scope in a later recent section", () => { + const previousSummary = "[Session Goal]\n- Existing goal\n\n[Current Scope]\n- Add dashboard provisioning checks\n\n---\n\n[user]\nExisting goal"; + const r = compile({ + previousSummary, + messages: [userMsg("Also add provider cache accounting notes to the current scope")], + }); + const current = r.split("\n\n---\n\n")[0]; + expect(current).toContain("[Current Scope]\n- Add dashboard provisioning checks"); + expect(current).toContain("[Recent Scope Updates]\n- Also add provider cache accounting notes to the current scope"); + expect(current.indexOf("[Current Scope]")).toBeLessThan(current.indexOf("[Recent Scope Updates]")); + }); + it("places newly discovered preferences in a later recent section", () => { const previousSummary = "[Session Goal]\n- Existing goal\n\n[User Preferences]\n- Always use Docker for benchmarks\n\n---\n\n[user]\nExisting goal"; const r = compile({ From 7c26f1df793219378c20bf36e115b5568416c60b Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 21:05:09 +0200 Subject: [PATCH 23/65] test: enforce cache boundary probes Extend cache assertions from a single early-layer heuristic to explicit per-case boundaries. Scope, evidence, and volatile-next-step probes now require their first changed prompt layer to land at the intended recent or volatile section with a minimum stable-prefix token floor. Update the ref comparison summary to use the same cache-boundary failure logic and document the expected boundaries in the benchmark README. Validation: node --check bench/compaction/offline-runner.ts scripts/compare-compaction-refs.mjs; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; docker run --rm pi-vcc-bench --compactors pi-vcc --case-filter cache-bust --show-layer-diff --jsonl; node scripts/compare-compaction-refs.mjs --head HEAD --compactors pi-vcc --out /tmp/pi-vcc-cache-gates-ref.g1aMbt. --- bench/compaction/README.md | 8 +++++- bench/compaction/offline-runner.ts | 44 +++++++++++++++++++++-------- scripts/compare-compaction-refs.mjs | 40 ++++++++++++++++++++------ 3 files changed, 71 insertions(+), 21 deletions(-) diff --git a/bench/compaction/README.md b/bench/compaction/README.md index b062e70..a3f557e 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -141,12 +141,18 @@ Run assertion mode. This exits non-zero if any selected compactor misses active/ bun scripts/bench-compaction.ts --compactors pi-vcc --assert ``` -Run cache assertion mode for synthetic cache-stability probes. This is separate from correctness assertions and currently checks that volatile-only updates do not rewrite early stable prompt layers: +Run cache assertion mode for synthetic cache-stability probes. This is separate from correctness assertions and checks that each cache probe first changes only at its intended recent/volatile boundary, with a minimum stable-prefix token floor: ```bash bun scripts/bench-compaction.ts --compactors pi-vcc --assert-cache ``` +The current cache-boundary probes are: + +- `cache-bust-volatile-next-step`: first change should be `Pi VCC Outstanding Context` or later. +- `cache-bust-evidence-growth`: first change should be `Pi VCC Recent Evidence Handles` or later. +- `cache-bust-scope-growth`: first change should be `Pi VCC Recent Scope Updates` or later. + Append sampled real Pi sessions from a local session directory. Real-session cases have no gold state assertions; they are useful for size, latency, growth, and cache-churn signals: ```bash diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index 4080ea9..7fdf514 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -695,21 +695,43 @@ export const failedGatesOf = (cycle: CycleMetrics): string[] => { return failures; }; -const CACHE_STABILITY_CASES = new Set(["cache-bust-volatile-next-step"]); -const EARLY_VOLATILE_LAYERS = new Set([ - "Pi VCC Session Goal", - "Pi VCC Files And Changes", - "Pi VCC Evidence Handles", - "Pi VCC User Preferences", -]); +const CACHE_BOUNDARIES: Record = { + "cache-bust-volatile-next-step": { + allowedFirstChangedLayers: [ + "Pi VCC Outstanding Context", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 90, + }, + "cache-bust-evidence-growth": { + allowedFirstChangedLayers: [ + "Pi VCC Recent Evidence Handles", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 110, + }, + "cache-bust-scope-growth": { + allowedFirstChangedLayers: [ + "Pi VCC Recent Scope Updates", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 110, + }, +}; export const failedCacheGatesOf = (cycle: CycleMetrics): string[] => { - if (!CACHE_STABILITY_CASES.has(cycle.caseId) || cycle.cycle <= 1) return []; + const boundary = CACHE_BOUNDARIES[cycle.caseId]; + if (!boundary || cycle.cycle <= 1) return []; const failures: string[] = []; - if (cycle.firstChangedPromptLayer && EARLY_VOLATILE_LAYERS.has(cycle.firstChangedPromptLayer)) { - failures.push("early-prompt-layer-changed"); + if (!cycle.firstChangedPromptLayer) { + failures.push("missing-first-changed-layer"); + } else if (!boundary.allowedFirstChangedLayers.includes(cycle.firstChangedPromptLayer)) { + failures.push("unexpected-first-changed-layer"); } - if ((cycle.stablePrefixTokens ?? 0) < 90) failures.push("stable-prefix-too-small"); + if ((cycle.stablePrefixTokens ?? 0) < boundary.minStablePrefixTokens) failures.push("stable-prefix-too-small"); return failures; }; diff --git a/scripts/compare-compaction-refs.mjs b/scripts/compare-compaction-refs.mjs index 54fe1db..43590c5 100755 --- a/scripts/compare-compaction-refs.mjs +++ b/scripts/compare-compaction-refs.mjs @@ -80,17 +80,39 @@ const correctnessFailures = (cycle) => [ ...(cycle.leakedActiveAbsentTerms ?? []), ].length; +const cacheBoundaries = { + "cache-bust-volatile-next-step": { + allowedFirstChangedLayers: [ + "Pi VCC Outstanding Context", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 90, + }, + "cache-bust-evidence-growth": { + allowedFirstChangedLayers: [ + "Pi VCC Recent Evidence Handles", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 110, + }, + "cache-bust-scope-growth": { + allowedFirstChangedLayers: [ + "Pi VCC Recent Scope Updates", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 110, + }, +}; + const cacheFailures = (cycle) => { - if (cycle.caseId !== "cache-bust-volatile-next-step" || cycle.cycle <= 1) return 0; - const early = new Set([ - "Pi VCC Session Goal", - "Pi VCC Files And Changes", - "Pi VCC Evidence Handles", - "Pi VCC User Preferences", - ]); + const boundary = cacheBoundaries[cycle.caseId]; + if (!boundary || cycle.cycle <= 1) return 0; let count = 0; - if (cycle.firstChangedPromptLayer && early.has(cycle.firstChangedPromptLayer)) count += 1; - if ((cycle.stablePrefixTokens ?? 0) < 90) count += 1; + if (!cycle.firstChangedPromptLayer || !boundary.allowedFirstChangedLayers.includes(cycle.firstChangedPromptLayer)) count += 1; + if ((cycle.stablePrefixTokens ?? 0) < boundary.minStablePrefixTokens) count += 1; return count; }; From 03df01e05774c66e9a195ec75bd7c3998a863aa0 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Mon, 27 Apr 2026 21:10:21 +0200 Subject: [PATCH 24/65] test: cap mutable recent sections Add a mutable-tail growth probe and cap rendered recent scope, preference, and evidence sections to the latest items. Cache assertions now enforce the mutable-tail boundary plus maximum recent layer sizes. This keeps stable sections byte-stable while preventing the recent mutable area from growing without bound; older recent details remain recoverable through transcript/recall. Validation: node --check src/core/compaction-state.ts bench/compaction/offline-runner.ts bench/compaction/synthetic-cases.ts scripts/compare-compaction-refs.mjs tests/compaction-state.test.ts; Docker Bun tests for compaction-state and compile; git diff --check; docker build -t pi-vcc-bench .; docker run --rm pi-vcc-bench --compactors pi-vcc --assert; docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache; docker run --rm pi-vcc-bench --compactors pi-vcc --case-filter cache-bust-mutable-tail-growth --show-layer-diff --jsonl; ref comparisons in /tmp/pi-vcc-tail-caps-ref.coNiHu and /tmp/pi-vcc-tail-caps-real.PPkbEn. --- bench/compaction/README.md | 1 + bench/compaction/offline-runner.ts | 27 ++++++++++++- bench/compaction/synthetic-cases.ts | 62 +++++++++++++++++++++++++++++ scripts/compare-compaction-refs.mjs | 19 +++++++++ src/core/compaction-state.ts | 18 +++++++-- tests/compaction-state.test.ts | 15 +++++++ 6 files changed, 138 insertions(+), 4 deletions(-) diff --git a/bench/compaction/README.md b/bench/compaction/README.md index a3f557e..1878578 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -152,6 +152,7 @@ The current cache-boundary probes are: - `cache-bust-volatile-next-step`: first change should be `Pi VCC Outstanding Context` or later. - `cache-bust-evidence-growth`: first change should be `Pi VCC Recent Evidence Handles` or later. - `cache-bust-scope-growth`: first change should be `Pi VCC Recent Scope Updates` or later. +- `cache-bust-mutable-tail-growth`: first change should be in a recent/volatile layer and recent layer sizes must stay under their caps. Append sampled real Pi sessions from a local session directory. Real-session cases have no gold state assertions; they are useful for size, latency, growth, and cache-churn signals: diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index 7fdf514..982adde 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -695,7 +695,13 @@ export const failedGatesOf = (cycle: CycleMetrics): string[] => { return failures; }; -const CACHE_BOUNDARIES: Record = { +interface CacheBoundary { + allowedFirstChangedLayers: string[]; + minStablePrefixTokens: number; + maxPromptLayerSizes?: Record; +} + +const CACHE_BOUNDARIES: Record = { "cache-bust-volatile-next-step": { allowedFirstChangedLayers: [ "Pi VCC Outstanding Context", @@ -720,6 +726,22 @@ const CACHE_BOUNDARIES: Record { @@ -732,6 +754,9 @@ export const failedCacheGatesOf = (cycle: CycleMetrics): string[] => { failures.push("unexpected-first-changed-layer"); } if ((cycle.stablePrefixTokens ?? 0) < boundary.minStablePrefixTokens) failures.push("stable-prefix-too-small"); + for (const [layer, maxSize] of Object.entries(boundary.maxPromptLayerSizes ?? {})) { + if ((cycle.promptLayerSizes[layer] ?? 0) > maxSize) failures.push(`recent-layer-too-large:${layer}`); + } return failures; }; diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts index 5ad8e98..e9959dd 100644 --- a/bench/compaction/synthetic-cases.ts +++ b/bench/compaction/synthetic-cases.ts @@ -324,6 +324,68 @@ export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ ], }, }, + { + id: "cache-bust-mutable-tail-growth", + description: "Recent scope, preference, and evidence updates should stay bounded while latest items remain recoverable.", + messages: [ + user("Maintain cache-aware compaction. Stable objective: keep stable sections byte-stable while bounding recent mutable state."), + assistant("Stable checkpoint: keep stable sections byte-stable; canonical file src/core/summarize.ts."), + user("Also add scope item tail_scope_01 to the current scope. I prefer tail preference tail_pref_01."), + toolCall("bash", { command: "grep req_tail_ev_01 /tmp/tail-evidence-01.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_01 /tmp/tail-evidence-01.log"), + assistant("Recorded tail_scope_01, tail_pref_01, and req_tail_ev_01."), + user("Also add scope item tail_scope_02 to the current scope. I prefer tail preference tail_pref_02."), + toolCall("bash", { command: "grep req_tail_ev_02 /tmp/tail-evidence-02.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_02 /tmp/tail-evidence-02.log"), + assistant("Recorded tail_scope_02, tail_pref_02, and req_tail_ev_02."), + user("Also add scope item tail_scope_03 to the current scope. I prefer tail preference tail_pref_03."), + toolCall("bash", { command: "grep req_tail_ev_03 /tmp/tail-evidence-03.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_03 /tmp/tail-evidence-03.log"), + assistant("Recorded tail_scope_03, tail_pref_03, and req_tail_ev_03."), + user("Also add scope item tail_scope_04 to the current scope. I prefer tail preference tail_pref_04."), + toolCall("bash", { command: "grep req_tail_ev_04 /tmp/tail-evidence-04.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_04 /tmp/tail-evidence-04.log"), + assistant("Recorded tail_scope_04, tail_pref_04, and req_tail_ev_04."), + user("Also add scope item tail_scope_05 to the current scope. I prefer tail preference tail_pref_05."), + toolCall("bash", { command: "grep req_tail_ev_05 /tmp/tail-evidence-05.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_05 /tmp/tail-evidence-05.log"), + assistant("Recorded tail_scope_05, tail_pref_05, and req_tail_ev_05."), + user("Also add scope item tail_scope_06 to the current scope. I prefer tail preference tail_pref_06."), + toolCall("bash", { command: "grep req_tail_ev_06 /tmp/tail-evidence-06.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_06 /tmp/tail-evidence-06.log"), + assistant("Recorded tail_scope_06, tail_pref_06, and req_tail_ev_06."), + user("Also add scope item tail_scope_07 to the current scope. I prefer tail preference tail_pref_07."), + toolCall("bash", { command: "grep req_tail_ev_07 /tmp/tail-evidence-07.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_07 /tmp/tail-evidence-07.log"), + assistant("Recorded tail_scope_07, tail_pref_07, and req_tail_ev_07."), + user("Also add scope item tail_scope_08 to the current scope. I prefer tail preference tail_pref_08."), + toolCall("bash", { command: "grep req_tail_ev_08 /tmp/tail-evidence-08.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_08 /tmp/tail-evidence-08.log"), + assistant("Recorded tail_scope_08, tail_pref_08, and req_tail_ev_08."), + ], + compactionPoints: [10, 22, 34], + gold: { + activeTerms: [ + { label: "stable objective", term: "keep stable sections byte-stable" }, + { label: "latest scope", term: "tail_scope_08" }, + { label: "latest preference", term: "tail_pref_08" }, + { label: "latest evidence", term: "req_tail_ev_08" }, + ], + currentTerms: [ + { label: "stable objective", term: "keep stable sections byte-stable" }, + { label: "latest scope", term: "tail_scope_08" }, + { label: "latest preference", term: "tail_pref_08" }, + { label: "latest evidence", term: "req_tail_ev_08" }, + ], + recallTerms: [ + { label: "old scope", term: "tail_scope_01", query: "tail_scope_01" }, + { label: "old evidence", term: "req_tail_ev_01", query: "req_tail_ev_01" }, + ], + continuationTerms: [ + { label: "latest scope", term: "tail_scope_08" }, + ], + }, + }, { id: "cache-bust-volatile-next-step", description: "Stable objective and identifiers remain fixed while only volatile next-step state changes across cycles.", diff --git a/scripts/compare-compaction-refs.mjs b/scripts/compare-compaction-refs.mjs index 43590c5..bd4100b 100755 --- a/scripts/compare-compaction-refs.mjs +++ b/scripts/compare-compaction-refs.mjs @@ -105,6 +105,22 @@ const cacheBoundaries = { ], minStablePrefixTokens: 110, }, + "cache-bust-mutable-tail-growth": { + allowedFirstChangedLayers: [ + "Pi VCC Recent Scope Updates", + "Pi VCC Recent User Preferences", + "Pi VCC Recent Evidence Handles", + "Pi VCC Outstanding Context", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 140, + maxPromptLayerSizes: { + "Pi VCC Recent Scope Updates": 420, + "Pi VCC Recent User Preferences": 360, + "Pi VCC Recent Evidence Handles": 260, + }, + }, }; const cacheFailures = (cycle) => { @@ -113,6 +129,9 @@ const cacheFailures = (cycle) => { let count = 0; if (!cycle.firstChangedPromptLayer || !boundary.allowedFirstChangedLayers.includes(cycle.firstChangedPromptLayer)) count += 1; if ((cycle.stablePrefixTokens ?? 0) < boundary.minStablePrefixTokens) count += 1; + for (const [layer, maxSize] of Object.entries(boundary.maxPromptLayerSizes ?? {})) { + if ((cycle.promptLayerSizes?.[layer] ?? 0) > maxSize) count += 1; + } return count; }; diff --git a/src/core/compaction-state.ts b/src/core/compaction-state.ts index 4d808c6..e8b6a2d 100644 --- a/src/core/compaction-state.ts +++ b/src/core/compaction-state.ts @@ -65,9 +65,21 @@ const stateKeyOf = (section: CurrentSectionName): keyof CompactionState["current } }; -const section = (title: string, items: string[]): string => { - if (items.length === 0) return ""; - const body = items.map((item) => `- ${item}`).join("\n"); +export const RECENT_SECTION_ITEM_LIMITS: Partial> = { + "Recent Scope Updates": 6, + "Recent User Preferences": 6, + "Recent Evidence Handles": 8, +}; + +const cappedItems = (title: CurrentSectionName, items: string[]): string[] => { + const limit = RECENT_SECTION_ITEM_LIMITS[title]; + return limit && items.length > limit ? items.slice(-limit) : items; +}; + +const section = (title: CurrentSectionName, items: string[]): string => { + const capped = cappedItems(title, items); + if (capped.length === 0) return ""; + const body = capped.map((item) => `- ${item}`).join("\n"); return `[${title}]\n${body}`; }; diff --git a/tests/compaction-state.test.ts b/tests/compaction-state.test.ts index dd18300..d86a346 100644 --- a/tests/compaction-state.test.ts +++ b/tests/compaction-state.test.ts @@ -77,6 +77,21 @@ describe("compaction state", () => { ]); }); + it("caps recent mutable sections to the latest items", () => { + const state = buildCompactionState(sectionData({ sessionGoal: ["Benchmark compaction"] })); + state.current.recentScopeUpdates = Array.from({ length: 8 }, (_, i) => `scope-${i + 1}`); + state.current.recentUserPreferences = Array.from({ length: 8 }, (_, i) => `pref-${i + 1}`); + state.current.recentEvidenceHandles = Array.from({ length: 10 }, (_, i) => `evidence-${i + 1}`); + const rendered = renderCompactionState(state); + const lines = rendered.text.split("\n"); + expect(lines).not.toContain("- scope-1"); + expect(lines).toContain("- scope-8"); + expect(lines).not.toContain("- pref-1"); + expect(lines).toContain("- pref-8"); + expect(lines).not.toContain("- evidence-1"); + expect(lines).toContain("- evidence-10"); + }); + it("parses rendered summary back into structured state", () => { const rendered = renderCompactionState(buildCompactionState(sectionData({ sessionGoal: ["Benchmark compaction"], From 438f5450f34592c3cd060c53872a79f2fbb365ca Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Tue, 28 Apr 2026 17:09:12 +0200 Subject: [PATCH 25/65] test: report compaction comparison outliers Add outlier sections to the ref comparison report for broader real-session runs: worst stable-prefix deltas, largest full-prompt growth, earliest changed head layers, and largest recent mutable layers. A real-limit 5 run shows aggregate improvement but also highlights the next bottlenecks: Commits is often the earliest changed stable layer, and Recent Evidence Handles can still be large in real sessions. Validation: node --check scripts/compare-compaction-refs.mjs; git diff --check; node scripts/compare-compaction-refs.mjs --head HEAD --compactors pi-vcc --real-only --real-sessions-dir ~/.pi/agent/sessions --real-limit 5 --show-layer-diff --out /tmp/pi-vcc-real-limit-5-report-1777388942. --- scripts/compare-compaction-refs.mjs | 82 +++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) diff --git a/scripts/compare-compaction-refs.mjs b/scripts/compare-compaction-refs.mjs index bd4100b..bad45f3 100755 --- a/scripts/compare-compaction-refs.mjs +++ b/scripts/compare-compaction-refs.mjs @@ -144,6 +144,32 @@ const mean = (items, selector) => { const fmt = (value, digits = 2) => value === null || value === undefined ? "n/a" : Number(value).toFixed(digits); const signed = (value, digits = 2) => value === null || value === undefined ? "n/a" : `${value >= 0 ? "+" : ""}${Number(value).toFixed(digits)}`; +const RECENT_MUTABLE_LAYERS = [ + "Pi VCC Recent Scope Updates", + "Pi VCC Recent User Preferences", + "Pi VCC Recent Evidence Handles", +]; + +const layerRank = (layer) => { + if (!layer) return 999; + if (layer === "Provider Prefix") return 0; + if (layer === "Tool Definitions") return 1; + if (layer === "Project Instructions") return 2; + if (layer.startsWith("Pi VCC Session Goal")) return 3; + if (layer.startsWith("Pi VCC Files")) return 4; + if (layer.startsWith("Pi VCC Commits")) return 5; + if (layer.startsWith("Pi VCC Evidence Handles")) return 6; + if (layer.startsWith("Pi VCC User Preferences")) return 7; + if (layer.startsWith("Pi VCC Current Scope")) return 8; + if (layer.startsWith("Pi VCC Recent")) return 9; + if (layer.startsWith("Pi VCC Outstanding")) return 10; + if (layer.startsWith("Pi VCC Brief")) return 11; + if (layer === "Kept Raw Tail") return 12; + return 50; +}; + +const rowLabel = (row) => `${row.caseId} / ${row.compactor} / cycle ${row.cycle}`; + const summarize = (label, rows) => ({ label, cycles: rows.length, @@ -181,6 +207,24 @@ const markdownReport = ({ baselineRows, headRows, baselinePath, headPath }) => { || correctnessFailures(baselineRow) !== correctnessFailures(headRow) || cacheFailures(baselineRow) !== cacheFailures(headRow)) .slice(0, 20); + const worstStablePrefixDeltas = pairs + .filter(({ baselineRow, headRow }) => baselineRow.stablePrefixTokens !== null && headRow.stablePrefixTokens !== null) + .map(({ baselineRow, headRow }) => ({ baselineRow, headRow, delta: headRow.stablePrefixTokens - baselineRow.stablePrefixTokens })) + .sort((a, b) => a.delta - b.delta) + .slice(0, 10); + const largestPromptGrowth = pairs + .map(({ baselineRow, headRow }) => ({ baselineRow, headRow, delta: headRow.fullPromptTokensEst - baselineRow.fullPromptTokensEst })) + .sort((a, b) => b.delta - a.delta) + .slice(0, 10); + const earliestFirstChanged = headRows + .filter((row) => row.cycle > 1 && row.firstChangedPromptLayer) + .sort((a, b) => layerRank(a.firstChangedPromptLayer) - layerRank(b.firstChangedPromptLayer) || (a.stablePrefixTokens ?? 0) - (b.stablePrefixTokens ?? 0)) + .slice(0, 10); + const largestRecentLayers = headRows + .flatMap((row) => RECENT_MUTABLE_LAYERS.map((layer) => ({ row, layer, size: row.promptLayerSizes?.[layer] ?? 0 }))) + .filter((entry) => entry.size > 0) + .sort((a, b) => b.size - a.size) + .slice(0, 10); const lines = []; lines.push("# Compaction Ref Comparison"); @@ -223,6 +267,44 @@ const markdownReport = ({ baselineRows, headRows, baselinePath, headPath }) => { } } lines.push(""); + lines.push("## Outliers"); + lines.push(""); + lines.push("### Worst stable-prefix deltas"); + lines.push(""); + lines.push("| case | baseline | head | delta | head first layer |"); + lines.push("| --- | ---: | ---: | ---: | --- |"); + for (const { baselineRow, headRow, delta } of worstStablePrefixDeltas) { + lines.push(`| ${rowLabel(headRow)} | ${baselineRow.stablePrefixTokens ?? "n/a"} | ${headRow.stablePrefixTokens ?? "n/a"} | ${signed(delta, 0)} | ${headRow.firstChangedPromptLayer ?? "n/a"} |`); + } + lines.push(""); + lines.push("### Largest full-prompt growth"); + lines.push(""); + lines.push("| case | baseline tokens | head tokens | delta | head first layer |"); + lines.push("| --- | ---: | ---: | ---: | --- |"); + for (const { baselineRow, headRow, delta } of largestPromptGrowth) { + lines.push(`| ${rowLabel(headRow)} | ${baselineRow.fullPromptTokensEst} | ${headRow.fullPromptTokensEst} | ${signed(delta, 0)} | ${headRow.firstChangedPromptLayer ?? "n/a"} |`); + } + lines.push(""); + lines.push("### Earliest changed head layers"); + lines.push(""); + lines.push("| case | first changed layer | stable prefix tokens | full prompt tokens |"); + lines.push("| --- | --- | ---: | ---: |"); + for (const row of earliestFirstChanged) { + lines.push(`| ${rowLabel(row)} | ${row.firstChangedPromptLayer ?? "n/a"} | ${row.stablePrefixTokens ?? "n/a"} | ${row.fullPromptTokensEst} |`); + } + lines.push(""); + lines.push("### Largest recent mutable layers"); + lines.push(""); + if (largestRecentLayers.length === 0) { + lines.push("No recent mutable layers were present in the head run."); + } else { + lines.push("| case | layer | chars |"); + lines.push("| --- | --- | ---: |"); + for (const { row, layer, size } of largestRecentLayers) { + lines.push(`| ${rowLabel(row)} | ${layer} | ${size} |`); + } + } + lines.push(""); return `${lines.join("\n")}\n`; }; From 2548dfeb3c699bda9ae2471d19dedc1d7733a3f7 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Tue, 28 Apr 2026 17:47:34 +0200 Subject: [PATCH 26/65] docs: add compaction north star guidance Add project-level agent guidance that frames pi-vcc compaction around expected continuation value: recall fidelity, semantic coherence, working room, retrieval dependence, and cache preservation. The guidance records the current stable/recent layout and benchmark commands future agents should use before claiming cache or correctness improvements. Validation: git diff --check; reviewed AGENTS.md for durable project guidance; reviewer subagent found no must-fix issues and suggested making the baseline ref explicit, which is included. --- AGENTS.md | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 AGENTS.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..e60840d --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,106 @@ +# AGENTS.md + +## Project North Star + +`pi-vcc` is an algorithmic conversation compactor for Pi. Its goal is not merely to make summaries shorter; it is to maximize expected continuation value after compaction. + +Optimize compaction across these objectives: + +1. **Recall fidelity** — important goals, constraints, files, identifiers, evidence handles, decisions, blockers, and next actions remain available either in active context or recall. +2. **Semantic coherence** — the compacted state should let the agent understand what is happening, why it matters, and what to do next. +3. **Post-compaction working room** — active prompt state should stay compact enough to leave useful room for future work. +4. **Retrieval dependence** — bulky or older detail may move out of active context only when it remains recoverable through transcript, recall, files, or artifacts. +5. **Cache preservation** — stable prompt prefixes should remain byte/token stable across ordinary compactions; volatile updates should be isolated into late recent/volatile sections. + +A shorter summary is not better if it loses continuity, exact identifiers, recoverability, or cache reuse. + +## Compaction Design Principles + +- Prefer stable structured state over full-summary rewrites. +- Keep durable facts before volatile facts. +- Keep volatile updates in explicit recent/volatile sections. +- Preserve exact paths, identifiers, error signatures, request IDs, span/probe IDs, and commit references when they are relevant evidence. +- Offload bulky re-fetchable details to recall/history with pointers rather than active prompt bodies. +- Separate current truth from historical transcript. Stale or corrected facts may remain recallable, but must not remain current guidance. +- Treat prompt-cache churn as a first-class performance and cost concern. + +## Current Cache-Aware Layout + +Stable/current sections should remain as stable as possible: + +```text +Session Goal +Files And Changes +Commits +Evidence Handles +User Preferences +Current Scope +``` + +Recent/volatile sections may change more often and should stay bounded: + +```text +Recent Scope Updates +Recent User Preferences +Recent Evidence Handles +Outstanding Context +Brief Transcript +Kept Raw Tail +``` + +Do not move volatile content back into stable sections without benchmark-backed evidence. + +## Benchmarking Expectations + +Use the Docker benchmark path as the primary validation route: + +```bash +docker build -t pi-vcc-bench . +docker run --rm pi-vcc-bench --compactors pi-vcc --assert +docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache +``` + +For original-vs-current comparisons: + +```bash +node scripts/compare-compaction-refs.mjs \ + --baseline 53dc551 \ + --head HEAD \ + --compactors pi-vcc \ + --out /tmp/pi-vcc-compaction-compare +``` + +For real-session cache behavior: + +```bash +node scripts/compare-compaction-refs.mjs \ + --baseline 53dc551 \ + --head HEAD \ + --compactors pi-vcc \ + --real-only \ + --real-sessions-dir ~/.pi/agent/sessions \ + --real-limit 5 \ + --show-layer-diff \ + --out /tmp/pi-vcc-real-compare +``` + +## Interpreting Results + +Good changes should generally: + +- preserve or improve correctness assertions +- preserve or improve cache-boundary assertions +- move `firstChangedPromptLayer` later, not earlier +- increase stable-prefix tokens for repeated compactions +- avoid growing full prompt tokens unless the added state is justified +- keep recent/volatile sections bounded + +If a change improves one metric while hurting another, judge it by expected continuation value, not by any single metric alone. + +## Development Guidance + +- Add a focused RED probe before or alongside compaction behavior changes. +- Keep synthetic probes for exact correctness and cache-boundary behavior. +- Use real-session replay to find outliers and avoid overfitting synthetic cases. +- Prefer small semantic commits that can be reviewed and reverted independently. +- Do not claim cache improvements without fresh benchmark evidence. From 0d6288cda1e1bb0e6357ee8f122e3512888cb513 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Tue, 28 Apr 2026 18:15:07 +0200 Subject: [PATCH 27/65] feat: add pi-vcc compaction report card Emit a separate pi-vcc custom message after extension-driven compaction so users can sanity-check what changed without patching Pi's built-in compaction card. The report stores section policy/status, stable-vs-recent churn, cap warnings, source/kept counts, and machine-readable details on both the compaction details and the UI message.\n\nThe hook skips prior pi-vcc report cards while summarizing to avoid report self-churn, and the existing compile/compileWithLayers APIs are preserved via an internal compilation helper.\n\nValidation:\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -v /home/fl/.npm/_npx/86d717fff1af7182/node_modules:/app/node_modules:ro -w /app oven/bun:1.3.13 bun test tests/before-compact-hook.test.ts tests/compaction-report.test.ts tests/compaction-state.test.ts tests/compile.test.ts\n- docker build -t pi-vcc-bench .\n- docker run --rm pi-vcc-bench --compactors pi-vcc --assert\n- docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache --- README.md | 2 + index.ts | 2 + package.json | 1 + src/core/compaction-report.ts | 339 ++++++++++++++++++++++++++++++ src/core/summarize.ts | 62 +++++- src/details.ts | 3 + src/hooks/before-compact.ts | 57 ++++- src/ui/compaction-report-card.ts | 35 +++ tests/before-compact-hook.test.ts | 24 ++- tests/compaction-report.test.ts | 87 ++++++++ 10 files changed, 598 insertions(+), 14 deletions(-) create mode 100644 src/core/compaction-report.ts create mode 100644 src/ui/compaction-report-card.ts create mode 100644 tests/compaction-report.test.ts diff --git a/README.md b/README.md index 1d71812..939079a 100644 --- a/README.md +++ b/README.md @@ -50,6 +50,7 @@ Measured on real session JSONLs under `~/.pi/agent/sessions` (chars = rendered m - **`/pi-vcc-recall`** — slash command to search history directly, results shown as collapsible message and auto-fed to agent as context - **Fallback cut** — still works when Pi core returns nothing to summarize - **`/pi-vcc`** — manual compaction on demand +- **Compaction report card** — pi-vcc emits a separate sanity-check card after compaction with message counts, stable/recent section churn, cap warnings, and machine-readable details for deeper inspection ## Install @@ -74,6 +75,7 @@ pi -e https://github.com/sting8k/pi-vcc Once installed, pi-vcc registers a `session_before_compact` hook. - Run `/pi-vcc` to trigger pi-vcc compaction manually. +- After pi-vcc compacts, it emits a separate `[pi-vcc]` report card. The collapsed card is a quick sanity check; expand it for section-level churn, caps, warnings, and where to inspect the full machine-readable report. - By default, `/compact` and auto-threshold compactions still go through pi core (LLM-based). Set `overrideDefaultCompaction: true` in the config to let pi-vcc handle all compaction paths. - To search older active-lineage history after compaction, use `vcc_recall`. - To intentionally search across all lineages, pass `scope:"all"` to `vcc_recall` or run `/pi-vcc-recall scope:all`. diff --git a/index.ts b/index.ts index 93a0e02..a43b133 100644 --- a/index.ts +++ b/index.ts @@ -4,9 +4,11 @@ import { registerBeforeCompactHook } from "./src/hooks/before-compact"; import { registerPiVccCommand } from "./src/commands/pi-vcc"; import { registerVccRecallCommand } from "./src/commands/vcc-recall"; import { registerRecallTool } from "./src/tools/recall"; +import { registerCompactionReportCard } from "./src/ui/compaction-report-card"; export default (pi: ExtensionAPI) => { scaffoldSettings(); + registerCompactionReportCard(pi); registerBeforeCompactHook(pi); registerPiVccCommand(pi); registerVccRecallCommand(pi); diff --git a/package.json b/package.json index dac40fb..4a57bea 100644 --- a/package.json +++ b/package.json @@ -16,6 +16,7 @@ }, "peerDependencies": { "@mariozechner/pi-coding-agent": "*", + "@mariozechner/pi-tui": "*", "@sinclair/typebox": "*" }, "pi": { diff --git a/src/core/compaction-report.ts b/src/core/compaction-report.ts new file mode 100644 index 0000000..d8dce6e --- /dev/null +++ b/src/core/compaction-report.ts @@ -0,0 +1,339 @@ +import { + CURRENT_SECTION_ORDER, + RECENT_SECTION_ITEM_LIMITS, + type CompactionState, + type CompiledLayerRole, + type CompiledSummaryLayer, + type CurrentSectionName, +} from "./compaction-state"; + +export const PI_VCC_COMPACTION_REPORT_TYPE = "pi-vcc-compaction-report"; + +export type CompactionReportSectionPolicy = + | "stable-current" + | "recent-volatile" + | "history" + | "recall"; + +export type CompactionReportSectionStatus = "new" | "changed" | "unchanged"; + +export interface CompactionReportCap { + section: string; + before: number; + after: number; + dropped: number; +} + +export interface CompactionReportSection { + name: string; + title: string; + role: CompiledLayerRole; + policy: CompactionReportSectionPolicy; + status: CompactionReportSectionStatus; + itemCount: number; + renderedItemCount: number; + chars: number; + limit?: number; + capped?: CompactionReportCap; + reason: string; + preview: string[]; +} + +export interface BuildCompactionReportInput { + layers: CompiledSummaryLayer[]; + previousLayers: CompiledSummaryLayer[]; + state: CompactionState; + sourceMessageCount: number; + keptMessageCount: number; + keptTokensEst: number; + skippedInternalMessageCount?: number; + tokensBefore: number; + previousSummaryUsed: boolean; + summaryText: string; +} + +export interface PiVccCompactionReport { + compactor: "pi-vcc"; + version: 1; + sourceMessageCount: number; + keptMessageCount: number; + keptTokensEst: number; + skippedInternalMessageCount: number; + tokensBefore: number; + summaryChars: number; + previousSummaryUsed: boolean; + firstChangedLayer?: string; + firstChangedPolicy?: CompactionReportSectionPolicy; + stableSectionCount: number; + stableUnchangedCount: number; + stableChangedSections: string[]; + recentSectionCount: number; + cappedSections: CompactionReportCap[]; + sections: CompactionReportSection[]; + warnings: string[]; +} + +const STABLE_CURRENT_SECTIONS = new Set([ + "Session Goal", + "Files And Changes", + "Commits", + "Evidence Handles", + "User Preferences", + "Current Scope", +]); + +const RECENT_VOLATILE_SECTIONS = new Set([ + "Recent Scope Updates", + "Recent User Preferences", + "Recent Evidence Handles", + "Outstanding Context", +]); + +const titleOfLayer = (name: string): string => + name.startsWith("Pi VCC ") ? name.slice("Pi VCC ".length) : name; + +const isCurrentSectionName = (title: string): title is CurrentSectionName => + (CURRENT_SECTION_ORDER as readonly string[]).includes(title); + +const stateItemsOf = (state: CompactionState, title: CurrentSectionName): string[] => { + switch (title) { + case "Session Goal": return state.current.sessionGoal; + case "Files And Changes": return state.current.filesAndChanges; + case "Commits": return state.current.commits; + case "Evidence Handles": return state.current.evidenceHandles; + case "User Preferences": return state.current.userPreferences; + case "Current Scope": return state.current.currentScope; + case "Recent Scope Updates": return state.current.recentScopeUpdates; + case "Recent User Preferences": return state.current.recentUserPreferences; + case "Recent Evidence Handles": return state.current.recentEvidenceHandles; + case "Outstanding Context": return state.current.outstandingContext; + } +}; + +const policyOf = (title: string, role: CompiledLayerRole): CompactionReportSectionPolicy => { + if (role === "history") return "history"; + if (role === "recall") return "recall"; + if (RECENT_VOLATILE_SECTIONS.has(title)) return "recent-volatile"; + if (STABLE_CURRENT_SECTIONS.has(title)) return "stable-current"; + return "stable-current"; +}; + +const reasonOf = (policy: CompactionReportSectionPolicy): string => { + switch (policy) { + case "stable-current": + return "Durable current state kept early for continuity and cache reuse."; + case "recent-volatile": + return "Additive or volatile state isolated late so stable sections can stay cacheable."; + case "history": + return "Condensed transcript context for coherence when exact history is not needed inline."; + case "recall": + return "Pointer that older exact detail remains recoverable from transcript/recall."; + } +}; + +const statusOf = ( + layer: CompiledSummaryLayer, + previousByName: Map, +): CompactionReportSectionStatus => { + if (!previousByName.has(layer.name)) return "new"; + return previousByName.get(layer.name) === layer.text ? "unchanged" : "changed"; +}; + +const nonEmptyLines = (text: string): string[] => + text.split("\n").map((line) => line.trim()).filter(Boolean); + +const renderedItemCountOf = (layer: CompiledSummaryLayer): number => { + const bulletCount = (layer.text.match(/^- /gm) ?? []).length; + if (bulletCount > 0) return bulletCount; + if (layer.role === "recall") return layer.text.trim() ? 1 : 0; + return nonEmptyLines(layer.text).length; +}; + +const itemCountOf = (state: CompactionState, layer: CompiledSummaryLayer, title: string): number => { + if (isCurrentSectionName(title)) return stateItemsOf(state, title).length; + if (layer.role === "recall") return layer.text.trim() ? 1 : 0; + return nonEmptyLines(layer.text).length; +}; + +const previewOf = (layer: CompiledSummaryLayer): string[] => + nonEmptyLines(layer.text) + .filter((line) => !/^\[.+?\]$/.test(line)) + .map((line) => line.replace(/^-\s*/, "")) + .slice(0, 2) + .map((line) => line.length > 140 ? `${line.slice(0, 137)}...` : line); + +const capOf = (title: string, itemCount: number): CompactionReportCap | undefined => { + if (!isCurrentSectionName(title)) return undefined; + const limit = RECENT_SECTION_ITEM_LIMITS[title]; + if (!limit || itemCount <= limit) return undefined; + return { + section: title, + before: itemCount, + after: limit, + dropped: itemCount - limit, + }; +}; + +export const buildCompactionReport = (input: BuildCompactionReportInput): PiVccCompactionReport => { + const previousByName = new Map(input.previousLayers.map((layer) => [layer.name, layer.text])); + const sections = input.layers.map((layer): CompactionReportSection => { + const title = titleOfLayer(layer.name); + const policy = policyOf(title, layer.role); + const itemCount = itemCountOf(input.state, layer, title); + const renderedItemCount = renderedItemCountOf(layer); + const capped = capOf(title, itemCount); + return { + name: layer.name, + title, + role: layer.role, + policy, + status: statusOf(layer, previousByName), + itemCount, + renderedItemCount, + chars: layer.text.length, + limit: isCurrentSectionName(title) ? RECENT_SECTION_ITEM_LIMITS[title] : undefined, + capped, + reason: reasonOf(policy), + preview: previewOf(layer), + }; + }); + + const firstChanged = sections.find((section) => section.status !== "unchanged"); + const stableSections = sections.filter((section) => section.policy === "stable-current"); + const stableChangedSections = stableSections + .filter((section) => section.status !== "unchanged") + .map((section) => section.title); + const cappedSections = sections.flatMap((section) => section.capped ? [section.capped] : []); + const warnings: string[] = []; + + if (input.previousSummaryUsed && firstChanged?.policy === "stable-current") { + warnings.push(`First changed layer is stable/current: ${firstChanged.title}`); + } + for (const cap of cappedSections) { + warnings.push(`${cap.section} capped from ${cap.before} to ${cap.after} items`); + } + + return { + compactor: "pi-vcc", + version: 1, + sourceMessageCount: input.sourceMessageCount, + keptMessageCount: input.keptMessageCount, + keptTokensEst: input.keptTokensEst, + skippedInternalMessageCount: input.skippedInternalMessageCount ?? 0, + tokensBefore: input.tokensBefore, + summaryChars: input.summaryText.length, + previousSummaryUsed: input.previousSummaryUsed, + firstChangedLayer: firstChanged?.name, + firstChangedPolicy: firstChanged?.policy, + stableSectionCount: stableSections.length, + stableUnchangedCount: stableSections.filter((section) => section.status === "unchanged").length, + stableChangedSections, + recentSectionCount: sections.filter((section) => section.policy === "recent-volatile").length, + cappedSections, + sections, + warnings, + }; +}; + +const plural = (n: number, singular: string, pluralForm = `${singular}s`): string => + `${n} ${n === 1 ? singular : pluralForm}`; + +const formatTokens = (n: number): string => { + if (n >= 1000) return `${(n / 1000).toFixed(1)}k`; + return String(n); +}; + +const shortLayerName = (name: string | undefined): string => + name ? titleOfLayer(name) : "none"; + +export const formatCompactionReportSummaryLine = (report: PiVccCompactionReport): string => { + const stable = report.previousSummaryUsed + ? `${report.stableUnchangedCount}/${report.stableSectionCount} stable unchanged` + : `${plural(report.stableSectionCount, "stable section")}`; + const firstChange = report.previousSummaryUsed + ? shortLayerName(report.firstChangedLayer) + : "new summary"; + const caps = report.cappedSections.length > 0 + ? `; capped ${plural(report.cappedSections.length, "section")}` + : ""; + const warnings = report.warnings.length > 0 + ? `; ${plural(report.warnings.length, "warning")}` + : ""; + return `Compacted ${plural(report.sourceMessageCount, "message")} from ~${formatTokens(report.tokensBefore)} tok; kept ${report.keptMessageCount} (~${formatTokens(report.keptTokensEst)} tok); ${stable}; first change: ${firstChange}${caps}${warnings}.`; +}; + +export const formatCompactionReportMessageContent = (report: PiVccCompactionReport): string => { + const lines = [ + formatCompactionReportSummaryLine(report), + "Full pi-vcc compaction report is stored on this UI message for inspection.", + ]; + if (report.skippedInternalMessageCount > 0) { + lines.push(`Skipped ${plural(report.skippedInternalMessageCount, "prior pi-vcc report message")} while summarizing.`); + } + return lines.join("\n"); +}; + +const statusGlyph = (status: CompactionReportSectionStatus): string => { + switch (status) { + case "unchanged": return "✓"; + case "changed": return "~"; + case "new": return "+"; + } +}; + +const policyLabel = (policy: CompactionReportSectionPolicy): string => { + switch (policy) { + case "stable-current": return "stable"; + case "recent-volatile": return "recent"; + case "history": return "history"; + case "recall": return "recall"; + } +}; + +export const formatCompactionReportCard = ( + report: PiVccCompactionReport, + options: { expanded?: boolean } = {}, +): string => { + if (!options.expanded) return `${formatCompactionReportSummaryLine(report)} Expand for section-level details.`; + + const lines: string[] = [ + formatCompactionReportSummaryLine(report), + "", + "Sanity check", + `- Previous summary used: ${report.previousSummaryUsed ? "yes" : "no"}`, + `- Summary size: ${report.summaryChars.toLocaleString()} chars`, + `- First changed layer: ${shortLayerName(report.firstChangedLayer)}`, + `- Stable/current unchanged: ${report.stableUnchangedCount}/${report.stableSectionCount}`, + ]; + + if (report.stableChangedSections.length > 0) { + lines.push(`- Stable/current changed: ${report.stableChangedSections.join(", ")}`); + } + if (report.cappedSections.length > 0) { + lines.push(`- Caps applied: ${report.cappedSections.map((cap) => `${cap.section} ${cap.before}->${cap.after}`).join(", ")}`); + } + if (report.skippedInternalMessageCount > 0) { + lines.push(`- Skipped internal report cards: ${report.skippedInternalMessageCount}`); + } + if (report.warnings.length > 0) { + lines.push("", "Warnings", ...report.warnings.map((warning) => `! ${warning}`)); + } + + lines.push("", "Sections"); + for (const section of report.sections) { + const cap = section.capped ? `, capped ${section.capped.before}->${section.capped.after}` : ""; + lines.push(`${statusGlyph(section.status)} ${section.title} — ${policyLabel(section.policy)}, ${section.status}, ${section.renderedItemCount}/${section.itemCount} items, ${section.chars} chars${cap}`); + if (section.preview.length > 0) { + lines.push(...section.preview.map((preview) => ` ${preview}`)); + } + } + + lines.push( + "", + "Deep dive", + "- The full machine-readable report is stored in this message's details and in compaction.details.report.", + "- Ask to inspect the pi-vcc compaction report or session JSONL if you want source-level detail.", + ); + + return lines.join("\n"); +}; diff --git a/src/core/summarize.ts b/src/core/summarize.ts index 7df9363..b5c910f 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -10,10 +10,15 @@ import { CURRENT_SECTION_ORDER, parseCompactionState, renderCompactionState, + type CompactionState, type CompiledLayerRole, type CompiledSummaryLayer, type CompileWithLayersResult, } from "./compaction-state"; +import { + buildCompactionReport, + type PiVccCompactionReport, +} from "./compaction-report"; export interface CompileInput { messages: Message[]; @@ -21,6 +26,18 @@ export interface CompileInput { fileOps?: FileOps; } +export interface CompileReportContext { + sourceMessageCount: number; + keptMessageCount: number; + keptTokensEst: number; + skippedInternalMessageCount?: number; + tokensBefore: number; +} + +export interface CompileWithReportResult extends CompileWithLayersResult { + report: PiVccCompactionReport; +} + export type { CompiledLayerRole, CompiledSummaryLayer, CompileWithLayersResult } from "./compaction-state"; const HEADER_NAMES = ["Evidence Handles", "Recent Evidence Handles", "Recent User Preferences", "Recent Scope Updates", ...CURRENT_SECTION_ORDER]; @@ -211,9 +228,13 @@ const mergePrevious = (prev: string, fresh: string): string => { return parts.join(SEPARATOR); }; -export const compile = (input: CompileInput): string => compileWithLayers(input).text; +interface CompilationBuild { + state: CompactionState; + previousLayers: CompiledSummaryLayer[]; + rendered: CompileWithLayersResult; +} -export const compileWithLayers = (input: CompileInput): CompileWithLayersResult => { +const buildCompilation = (input: CompileInput): CompilationBuild => { const blocks = filterNoise(normalize(input.messages)); const data = buildSections({ blocks }); const fresh = renderCompactionState(buildCompactionState(data)).text; @@ -223,8 +244,41 @@ export const compileWithLayers = (input: CompileInput): CompileWithLayersResult ? stripRecallNote(input.previousSummary) : undefined; const merged = prev ? mergePrevious(prev, fresh) : fresh; - if (!merged) return { text: "", layers: [] }; - return renderCompactionState(parseCompactionState(merged), { includeRecallNote: true }); + const state = parseCompactionState(merged); + const previousLayers = prev + ? renderCompactionState(parseCompactionState(prev), { includeRecallNote: true }).layers + : []; + const rendered = merged + ? renderCompactionState(state, { includeRecallNote: true }) + : { text: "", layers: [] }; + return { state, previousLayers, rendered }; +}; + +export const compile = (input: CompileInput): string => compileWithLayers(input).text; + +export const compileWithLayers = (input: CompileInput): CompileWithLayersResult => + buildCompilation(input).rendered; + +export const compileWithReport = ( + input: CompileInput, + context: CompileReportContext, +): CompileWithReportResult => { + const compilation = buildCompilation(input); + return { + ...compilation.rendered, + report: buildCompactionReport({ + layers: compilation.rendered.layers, + previousLayers: compilation.previousLayers, + state: compilation.state, + sourceMessageCount: context.sourceMessageCount, + keptMessageCount: context.keptMessageCount, + keptTokensEst: context.keptTokensEst, + skippedInternalMessageCount: context.skippedInternalMessageCount, + tokensBefore: context.tokensBefore, + previousSummaryUsed: Boolean(input.previousSummary?.trim()), + summaryText: compilation.rendered.text, + }), + }; }; const stripRecallNote = (text: string): string => { diff --git a/src/details.ts b/src/details.ts index 323d2ba..a827fe1 100644 --- a/src/details.ts +++ b/src/details.ts @@ -1,7 +1,10 @@ +import type { PiVccCompactionReport } from "./core/compaction-report"; + export interface PiVccCompactionDetails { compactor: "pi-vcc"; version: number; sections: string[]; sourceMessageCount: number; previousSummaryUsed: boolean; + report?: PiVccCompactionReport; } diff --git a/src/hooks/before-compact.ts b/src/hooks/before-compact.ts index c83adda..97a67f9 100644 --- a/src/hooks/before-compact.ts +++ b/src/hooks/before-compact.ts @@ -1,8 +1,13 @@ import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; import { convertToLlm } from "@mariozechner/pi-coding-agent"; import { writeFileSync } from "fs"; -import { compile } from "../core/summarize"; +import { compileWithReport } from "../core/summarize"; import { loadSettings, type PiVccSettings } from "../core/settings"; +import { + formatCompactionReportMessageContent, + PI_VCC_COMPACTION_REPORT_TYPE, + type PiVccCompactionReport, +} from "../core/compaction-report"; import type { PiVccCompactionDetails } from "../details"; export const PI_VCC_COMPACT_INSTRUCTION = "__pi_vcc__"; @@ -15,6 +20,7 @@ export interface CompactionStats { let lastStats: CompactionStats | null = null; let lastCompactWasPiVcc = false; +let pendingReport: PiVccCompactionReport | null = null; export const getLastCompactionStats = () => lastStats; const formatTokens = (n: number): string => { @@ -46,9 +52,12 @@ const previewContent = (content: unknown): string => { interface EntryWithMessage { entry: { id: string; type: string }; - message: { role: string; content: unknown }; + message: { role: string; content: unknown; customType?: string }; } +const isPiVccReportMessage = (message: any): boolean => + message?.role === "custom" && message?.customType === PI_VCC_COMPACTION_REPORT_TYPE; + export type OwnCutCancelReason = | "no_live_messages" | "too_few_live_messages" @@ -213,7 +222,9 @@ export const registerBeforeCompactHook = (pi: ExtensionAPI) => { return { cancel: true }; } - const agentMessages = ownCut.messages; + const rawAgentMessages = ownCut.messages; + const skippedInternalMessageCount = rawAgentMessages.filter(isPiVccReportMessage).length; + const agentMessages = rawAgentMessages.filter((message: any) => !isPiVccReportMessage(message)); const firstKeptEntryId = ownCut.firstKeptEntryId; const messages = convertToLlm(agentMessages); @@ -233,22 +244,31 @@ export const registerBeforeCompactHook = (pi: ExtensionAPI) => { }, 0); return sum; }, 0); + const keptTokensEst = Math.round(keptChars / 4); lastStats = { summarized: agentMessages.length, kept: keptEntries.length, - keptTokensEst: Math.round(keptChars / 4), + keptTokensEst, }; const config = settings; - const summary = compile({ + const compiled = compileWithReport({ messages, previousSummary: preparation.previousSummary, fileOps: { readFiles: [...preparation.fileOps.read], modifiedFiles: [...preparation.fileOps.written, ...preparation.fileOps.edited], }, + }, { + sourceMessageCount: agentMessages.length, + keptMessageCount: keptEntries.length, + keptTokensEst, + skippedInternalMessageCount, + tokensBefore: preparation.tokensBefore, }); + const summary = compiled.text; + const report = compiled.report; const branchIds = branchEntries.map((e: any) => e.id); const cutIdx = branchIds.indexOf(firstKeptEntryId); @@ -264,6 +284,7 @@ export const registerBeforeCompactHook = (pi: ExtensionAPI) => { dbg(config, { usedOwnCut: true, messagesToSummarize: agentMessages.length, + skippedInternalMessageCount, messagesPreviewHead: agentMessages.slice(0, 3).map((m: any) => ({ role: m.role, preview: previewContent(m.content) })), messagesPreviewTail: agentMessages.slice(-3).map((m: any) => ({ role: m.role, preview: previewContent(m.content) })), convertedMessages: messages.length, @@ -277,13 +298,15 @@ export const registerBeforeCompactHook = (pi: ExtensionAPI) => { const details: PiVccCompactionDetails = { compactor: "pi-vcc", - version: 1, + version: 2, sections: [...summary.matchAll(/^\[(.+?)\]/gm)].map((m) => m[1]), sourceMessageCount: agentMessages.length, previousSummaryUsed: Boolean(preparation.previousSummary), + report, }; lastCompactWasPiVcc = isPiVcc; + pendingReport = report; return { compaction: { @@ -295,11 +318,27 @@ export const registerBeforeCompactHook = (pi: ExtensionAPI) => { }; }); - // Fire success toast for /compact path only (delayed to let UI settle). - // /pi-vcc path uses its own onComplete callback in the command handler. pi.on("session_compact", (event, ctx) => { if (!event.fromExtension) return; - if (lastCompactWasPiVcc) return; // /pi-vcc handles its own toast via onComplete + + const details = (event.compactionEntry as any)?.details as PiVccCompactionDetails | undefined; + const report = details?.compactor === "pi-vcc" ? details.report : pendingReport; + pendingReport = null; + + if (report) { + try { + pi.sendMessage({ + customType: PI_VCC_COMPACTION_REPORT_TYPE, + content: formatCompactionReportMessageContent(report), + display: true, + details: report, + }, { deliverAs: "nextTurn" }); + } catch {} + } + + // Fire success toast for /compact path only (delayed to let UI settle). + // /pi-vcc path uses its own onComplete callback in the command handler. + if (lastCompactWasPiVcc) return; const stats = lastStats; if (!stats) return; setTimeout(() => { diff --git a/src/ui/compaction-report-card.ts b/src/ui/compaction-report-card.ts new file mode 100644 index 0000000..255fcdb --- /dev/null +++ b/src/ui/compaction-report-card.ts @@ -0,0 +1,35 @@ +import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; +import { Box, Spacer, Text } from "@mariozechner/pi-tui"; +import { + formatCompactionReportCard, + PI_VCC_COMPACTION_REPORT_TYPE, + type PiVccCompactionReport, +} from "../core/compaction-report"; + +const colorReportLine = (line: string, theme: any): string => { + if (line.startsWith("! ")) return theme.fg("warning", line); + if (line.startsWith("✓ ")) return theme.fg("success", line); + if (line.startsWith("~ ") || line.startsWith("+ ")) return theme.fg("accent", line); + if (line.startsWith(" ") || line.startsWith("- ")) return theme.fg("dim", line); + return theme.fg("customMessageText", line); +}; + +const isReport = (value: unknown): value is PiVccCompactionReport => + typeof value === "object" && value !== null && (value as any).compactor === "pi-vcc"; + +export const registerCompactionReportCard = (pi: ExtensionAPI) => { + pi.registerMessageRenderer(PI_VCC_COMPACTION_REPORT_TYPE, (message, options, theme) => { + if (!isReport(message.details)) return undefined; + + const box = new Box(1, 1, (text: string) => theme.bg("customMessageBg", text)); + box.addChild(new Text(theme.fg("customMessageLabel", "\x1b[1m[pi-vcc]\x1b[22m"), 0, 0)); + box.addChild(new Spacer(1)); + + const body = formatCompactionReportCard(message.details, { expanded: options.expanded }) + .split("\n") + .map((line) => colorReportLine(line, theme)) + .join("\n"); + box.addChild(new Text(body, 0, 0)); + return box; + }); +}; diff --git a/tests/before-compact-hook.test.ts b/tests/before-compact-hook.test.ts index c8d7bfe..8b879ca 100644 --- a/tests/before-compact-hook.test.ts +++ b/tests/before-compact-hook.test.ts @@ -3,6 +3,7 @@ import { existsSync, unlinkSync, writeFileSync, readFileSync, mkdtempSync, rmSyn import { tmpdir } from "os"; import { join } from "path"; import { registerBeforeCompactHook, PI_VCC_COMPACT_INSTRUCTION } from "../src/hooks/before-compact"; +import { PI_VCC_COMPACTION_REPORT_TYPE } from "../src/core/compaction-report"; let tmpDir: string; let CONFIG_PATH: string; @@ -22,7 +23,9 @@ afterAll(() => { // Minimal ExtensionAPI stub: capture handler + provide ctx with mocked ui.notify function createMockPi() { let handler: ((event: any, ctx: any) => any) | undefined; + let compactHandler: ((event: any, ctx: any) => any) | undefined; const notifyCalls: Array<{ msg: string; level: string }> = []; + const sentMessages: Array<{ message: any; options: any }> = []; const ctx = { hasUI: true, ui: { @@ -35,10 +38,16 @@ function createMockPi() { pi: { on: (eventName: string, h: (e: any, c: any) => any) => { if (eventName === "session_before_compact") handler = h; + if (eventName === "session_compact") compactHandler = h; + }, + sendMessage: (message: any, options: any) => { + sentMessages.push({ message, options }); }, } as any, invoke: (event: any) => handler!(event, ctx), + invokeCompact: (event: any) => compactHandler!(event, ctx), notifyCalls, + sentMessages, }; } @@ -164,7 +173,7 @@ describe("registerBeforeCompactHook: compact-all path", () => { test("single-user + autonomous tail → returns compaction with empty firstKeptEntryId", () => { setConfig({ debug: false, overrideDefaultCompaction: false }); - const { pi, invoke, notifyCalls } = createMockPi(); + const { pi, invoke, invokeCompact, notifyCalls, sentMessages } = createMockPi(); registerBeforeCompactHook(pi); const entries = [ @@ -176,6 +185,19 @@ describe("registerBeforeCompactHook: compact-all path", () => { const result = invoke(makeEvent(entries, PI_VCC_COMPACT_INSTRUCTION)); expect(result.compaction).toBeDefined(); expect(result.compaction.firstKeptEntryId).toBe(""); + expect(result.compaction.details.report).toMatchObject({ + compactor: "pi-vcc", + sourceMessageCount: 4, + keptMessageCount: 0, + tokensBefore: 1000, + }); expect(notifyCalls).toHaveLength(0); // no cancel notify on success + + invokeCompact({ fromExtension: true, compactionEntry: result.compaction }); + expect(sentMessages).toHaveLength(1); + expect(sentMessages[0].message.customType).toBe(PI_VCC_COMPACTION_REPORT_TYPE); + expect(sentMessages[0].message.display).toBe(true); + expect(sentMessages[0].message.details).toBe(result.compaction.details.report); + expect(sentMessages[0].options).toEqual({ deliverAs: "nextTurn" }); }); }); diff --git a/tests/compaction-report.test.ts b/tests/compaction-report.test.ts new file mode 100644 index 0000000..39686c9 --- /dev/null +++ b/tests/compaction-report.test.ts @@ -0,0 +1,87 @@ +import { describe, expect, test } from "bun:test"; +import { + buildCompactionReport, + formatCompactionReportCard, + formatCompactionReportMessageContent, +} from "../src/core/compaction-report"; +import { parseCompactionState, renderCompactionState } from "../src/core/compaction-state"; + +const reportFor = (previousSummary: string | undefined, currentSummary: string) => { + const state = parseCompactionState(currentSummary); + const rendered = renderCompactionState(state, { includeRecallNote: true }); + const previousLayers = previousSummary + ? renderCompactionState(parseCompactionState(previousSummary), { includeRecallNote: true }).layers + : []; + return buildCompactionReport({ + layers: rendered.layers, + previousLayers, + state, + sourceMessageCount: 12, + keptMessageCount: 3, + keptTokensEst: 240, + tokensBefore: 4800, + previousSummaryUsed: Boolean(previousSummary), + summaryText: rendered.text, + }); +}; + +describe("compaction report", () => { + test("identifies recent-only churn after stable current sections", () => { + const previous = [ + "[Session Goal]", + "- Build cache-aware compaction", + "", + "[Current Scope]", + "- Make compaction inspectable", + ].join("\n"); + const current = [ + previous, + "", + "[Recent Scope Updates]", + "- Add a separate pi-vcc report card", + ].join("\n"); + + const report = reportFor(previous, current); + + expect(report.firstChangedLayer).toBe("Pi VCC Recent Scope Updates"); + expect(report.firstChangedPolicy).toBe("recent-volatile"); + expect(report.stableUnchangedCount).toBe(2); + expect(report.stableChangedSections).toEqual([]); + expect(report.warnings).toEqual([]); + }); + + test("reports caps for bounded recent sections", () => { + const current = [ + "[Session Goal]", + "- Build cache-aware compaction", + "", + "[Recent Evidence Handles]", + ...Array.from({ length: 10 }, (_, i) => `- Paths: /tmp/evidence-${i}.json`), + ].join("\n"); + + const report = reportFor(undefined, current); + + expect(report.cappedSections).toEqual([{ section: "Recent Evidence Handles", before: 10, after: 8, dropped: 2 }]); + expect(report.warnings).toContain("Recent Evidence Handles capped from 10 to 8 items"); + const recentEvidence = report.sections.find((section) => section.title === "Recent Evidence Handles"); + expect(recentEvidence?.itemCount).toBe(10); + expect(recentEvidence?.renderedItemCount).toBe(8); + }); + + test("formats a concise card with a machine-readable deep-dive hint", () => { + const current = [ + "[Session Goal]", + "- Build cache-aware compaction", + ].join("\n"); + + const report = reportFor(undefined, current); + const content = formatCompactionReportMessageContent(report); + const expanded = formatCompactionReportCard(report, { expanded: true }); + + expect(content).toContain("Compacted 12 messages"); + expect(content).toContain("stored on this UI message"); + expect(expanded).toContain("Sanity check"); + expect(expanded).toContain("Deep dive"); + expect(expanded).toContain("compaction.details.report"); + }); +}); From acaf4cc85bbd3e3d3586f816ffac30956443273f Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Tue, 28 Apr 2026 18:46:50 +0200 Subject: [PATCH 28/65] feat: expose compaction report deep dives Add /pi-vcc-report as the follow-up channel for pi-vcc's compact report card. The command can list reports, write Markdown/JSON artifacts for the latest report, show an inline expanded report, or print raw JSON when explicitly requested. Report discovery reads both compaction details and the rendered report-card custom messages while deduping duplicate records.\n\nAlso expose the same report data in the offline benchmark via --include-report and add --explain for human-readable per-cycle rationale, so synthetic and real-session runs can be inspected outside the TUI.\n\nValidation:\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -v /home/fl/.npm/_npx/86d717fff1af7182/node_modules:/app/node_modules:ro -w /app oven/bun:1.3.13 bun test tests/compaction-report-command.test.ts tests/compaction-report-history.test.ts tests/compaction-report.test.ts tests/before-compact-hook.test.ts tests/compile.test.ts\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -w /app oven/bun:1.3.13 bun scripts/bench-compaction.ts --compactors pi-vcc --case-filter cache-bust-scope-growth --include-report --jsonl\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -w /app oven/bun:1.3.13 bun scripts/bench-compaction.ts --compactors pi-vcc --case-filter cache-bust-scope-growth --explain\n- docker build -t pi-vcc-bench .\n- docker run --rm pi-vcc-bench --compactors pi-vcc --assert\n- docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache --- README.md | 9 ++ bench/compaction/README.md | 19 +++ bench/compaction/offline-runner.ts | 22 +++- index.ts | 2 + scripts/bench-compaction.ts | 17 ++- src/commands/pi-vcc-report.ts | 93 ++++++++++++++ src/core/compaction-report-history.ts | 162 ++++++++++++++++++++++++ src/core/compaction-report.ts | 2 +- tests/compaction-report-command.test.ts | 90 +++++++++++++ tests/compaction-report-history.test.ts | 93 ++++++++++++++ tests/compaction-report.test.ts | 1 + 11 files changed, 503 insertions(+), 7 deletions(-) create mode 100644 src/commands/pi-vcc-report.ts create mode 100644 src/core/compaction-report-history.ts create mode 100644 tests/compaction-report-command.test.ts create mode 100644 tests/compaction-report-history.test.ts diff --git a/README.md b/README.md index 939079a..001f31e 100644 --- a/README.md +++ b/README.md @@ -51,6 +51,7 @@ Measured on real session JSONLs under `~/.pi/agent/sessions` (chars = rendered m - **Fallback cut** — still works when Pi core returns nothing to summarize - **`/pi-vcc`** — manual compaction on demand - **Compaction report card** — pi-vcc emits a separate sanity-check card after compaction with message counts, stable/recent section churn, cap warnings, and machine-readable details for deeper inspection +- **`/pi-vcc-report`** — writes latest report Markdown/JSON artifacts or displays the report inline for a deeper inspection channel ## Install @@ -76,6 +77,7 @@ Once installed, pi-vcc registers a `session_before_compact` hook. - Run `/pi-vcc` to trigger pi-vcc compaction manually. - After pi-vcc compacts, it emits a separate `[pi-vcc]` report card. The collapsed card is a quick sanity check; expand it for section-level churn, caps, warnings, and where to inspect the full machine-readable report. +- Run `/pi-vcc-report` to write the latest report to Markdown/JSON files under `/tmp/pi-vcc-reports` and show the paths. Use `/pi-vcc-report show` for an inline expanded report, `/pi-vcc-report json inline` for raw JSON, or `/pi-vcc-report list` to list available reports. - By default, `/compact` and auto-threshold compactions still go through pi core (LLM-based). Set `overrideDefaultCompaction: true` in the config to let pi-vcc handle all compaction paths. - To search older active-lineage history after compaction, use `vcc_recall`. - To intentionally search across all lineages, pass `scope:"all"` to `vcc_recall` or run `/pi-vcc-recall scope:all`. @@ -228,6 +230,13 @@ Pass benchmark arguments after the image name: docker run --rm pi-vcc-bench --compactors pi-vcc,cache-aware-layered ``` +Explain pi-vcc report decisions for a focused case: + +```bash +bun scripts/bench-compaction.ts --compactors pi-vcc --case-filter cache-bust-scope-growth --explain +bun scripts/bench-compaction.ts --compactors pi-vcc --include-report --jsonl +``` + Use assertion mode when checking a selected compactor against the current benchmark gates: ```bash diff --git a/bench/compaction/README.md b/bench/compaction/README.md index 1878578..dd25b36 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -186,6 +186,25 @@ bun scripts/bench-compaction.ts \ --jsonl ``` +Include pi-vcc's machine-readable compaction report in each JSON/JSONL cycle when you need section policies, stable/recent churn, caps, and warnings: + +```bash +bun scripts/bench-compaction.ts \ + --compactors pi-vcc \ + --case-filter cache-bust-scope-growth \ + --include-report \ + --jsonl +``` + +Print a human-readable report explanation instead of JSON: + +```bash +bun scripts/bench-compaction.ts \ + --compactors pi-vcc \ + --case-filter cache-bust-scope-growth \ + --explain +``` + Run the same checks in Docker: ```bash diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index 982adde..4b57675 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -1,11 +1,12 @@ import { performance } from "node:perf_hooks"; import type { Message } from "@mariozechner/pi-ai"; -import { compileWithLayers } from "../../src/core/summarize"; +import { compileWithReport } from "../../src/core/summarize"; import { buildSections } from "../../src/core/build-sections"; import { normalize } from "../../src/core/normalize"; import { renderMessage } from "../../src/core/render-entries"; import { clip, textOf } from "../../src/core/content"; import { summarizeToolResultForPrompt } from "../../src/core/tool-result-summary"; +import type { PiVccCompactionReport } from "../../src/core/compaction-report"; import { syntheticCompactionCases, type CompactionBenchmarkCase, type ExpectedTerm } from "./synthetic-cases"; export type LayerRole = "static" | "current" | "history" | "recall"; @@ -35,6 +36,7 @@ export interface CompactorResult { activePromptState: string; layers: LayerSnapshot[]; recallCorpus: RecallDocument[]; + report?: PiVccCompactionReport; stats: { compactionMs: number; estimatedInputTokens?: number; @@ -114,6 +116,7 @@ export interface CycleMetrics { promptLayerSizes: Record; promptLayerTokenDeltas: Record; promptLayerDiffs?: PromptLayerDiff[]; + compactionReport?: PiVccCompactionReport; } export interface BenchmarkRunResult { @@ -481,16 +484,24 @@ export const offlineCompactors: OfflineCompactor[] = [ { name: "pi-vcc", compact: ({ messages, allMessages, previous }) => { + const inputTokens = estimateTokens(sourceTextOf(messages)); + const keptTail = allMessages.slice(-2); const start = performance.now(); - const summary = compileWithLayers({ messages, previousSummary: previous?.activePromptState }); + const summary = compileWithReport({ messages, previousSummary: previous?.activePromptState }, { + sourceMessageCount: messages.length, + keptMessageCount: keptTail.length, + keptTokensEst: estimateTokens(sourceTextOf(keptTail)), + tokensBefore: estimateTokens(sourceTextOf(allMessages)), + }); const elapsed = performance.now() - start; return { activePromptState: summary.text, layers: summary.layers, recallCorpus: renderedDocuments(allMessages), + report: summary.report, stats: { compactionMs: elapsed, - estimatedInputTokens: estimateTokens(sourceTextOf(messages)), + estimatedInputTokens: inputTokens, estimatedOutputTokens: estimateTokens(summary.text), }, }; @@ -574,6 +585,7 @@ const cycleMetrics = ( prompt: PromptSnapshot, previousPrompt: PromptSnapshot | undefined, includeDiagnostics: boolean, + includeReports: boolean, ): CycleMetrics => { const sourceText = sourceTextOf(sourceMessages); const activeText = result.activePromptState; @@ -638,6 +650,7 @@ const cycleMetrics = ( ...(includeDiagnostics && promptChanged.changedPromptLayers.length > 0 ? { promptLayerDiffs: changedPromptLayerDiffs(previousPrompt, prompt, promptChanged.changedPromptLayers) } : {}), + ...(includeReports && result.report ? { compactionReport: result.report } : {}), }; }; @@ -764,6 +777,7 @@ export const runOfflineCompactionBenchmark = (options: { cases?: CompactionBenchmarkCase[]; compactors?: OfflineCompactor[]; includeDiagnostics?: boolean; + includeReports?: boolean; } = {}): BenchmarkRunResult => { const cases = options.cases ?? syntheticCompactionCases; const compactors = options.compactors ?? offlineCompactors; @@ -784,7 +798,7 @@ export const runOfflineCompactionBenchmark = (options: { cycle: index + 1, }); const prompt = simulatedPromptOf(result, sourceMessages); - cycles.push(cycleMetrics(testCase, compactor, index + 1, point, sourceMessages, result, previous, prompt, previousPrompt, Boolean(options.includeDiagnostics))); + cycles.push(cycleMetrics(testCase, compactor, index + 1, point, sourceMessages, result, previous, prompt, previousPrompt, Boolean(options.includeDiagnostics), Boolean(options.includeReports))); previous = result; previousPrompt = prompt; previousPoint = point; diff --git a/index.ts b/index.ts index a43b133..a56fdd2 100644 --- a/index.ts +++ b/index.ts @@ -3,6 +3,7 @@ import { scaffoldSettings } from "./src/core/settings"; import { registerBeforeCompactHook } from "./src/hooks/before-compact"; import { registerPiVccCommand } from "./src/commands/pi-vcc"; import { registerVccRecallCommand } from "./src/commands/vcc-recall"; +import { registerPiVccReportCommand } from "./src/commands/pi-vcc-report"; import { registerRecallTool } from "./src/tools/recall"; import { registerCompactionReportCard } from "./src/ui/compaction-report-card"; @@ -11,6 +12,7 @@ export default (pi: ExtensionAPI) => { registerCompactionReportCard(pi); registerBeforeCompactHook(pi); registerPiVccCommand(pi); + registerPiVccReportCommand(pi); registerVccRecallCommand(pi); registerRecallTool(pi); }; diff --git a/scripts/bench-compaction.ts b/scripts/bench-compaction.ts index a690743..47db926 100644 --- a/scripts/bench-compaction.ts +++ b/scripts/bench-compaction.ts @@ -2,6 +2,7 @@ import { failedCacheGatesOf, failedGatesOf, offlineCompactors, runOfflineCompactionBenchmark } from "../bench/compaction/offline-runner"; import { syntheticCompactionCases } from "../bench/compaction/synthetic-cases"; import { loadRealSessionCases } from "../bench/compaction/real-sessions"; +import { formatCompactionReportCard } from "../src/core/compaction-report"; const args = process.argv.slice(2); @@ -20,6 +21,7 @@ const realLimitRaw = argValue("--real-limit"); const realLimit = realLimitRaw ? Number.parseInt(realLimitRaw, 10) : undefined; const caseFilter = argValue("--case-filter"); const includeDiagnostics = hasFlag("--show-layer-diff"); +const includeReports = hasFlag("--include-report") || hasFlag("--explain"); const selected = argValue("--compactors") ?.split(",") @@ -46,7 +48,7 @@ const filteredCases = caseFilter ? cases.filter((testCase) => testCase.id.includes(caseFilter) || testCase.description.includes(caseFilter)) : cases; -const result = runOfflineCompactionBenchmark({ compactors, cases: filteredCases, includeDiagnostics }); +const result = runOfflineCompactionBenchmark({ compactors, cases: filteredCases, includeDiagnostics, includeReports }); const failures = result.cycles .map((cycle) => ({ cycle, gates: failedGatesOf(cycle) })) .filter((entry) => entry.gates.length > 0); @@ -54,7 +56,18 @@ const cacheFailures = result.cycles .map((cycle) => ({ cycle, gates: failedCacheGatesOf(cycle) })) .filter((entry) => entry.gates.length > 0); -if (hasFlag("--jsonl")) { +if (hasFlag("--explain")) { + for (const cycle of result.cycles) { + console.log(`## ${cycle.caseId} / ${cycle.compactor} / cycle ${cycle.cycle}`); + console.log(`compactionPoint=${cycle.compactionPoint} firstChangedPromptLayer=${cycle.firstChangedPromptLayer ?? "none"} stablePrefixTokens=${cycle.stablePrefixTokens ?? "n/a"}`); + if (cycle.compactionReport) { + console.log(formatCompactionReportCard(cycle.compactionReport, { expanded: true })); + } else { + console.log("No compaction report available for this compactor."); + } + console.log(""); + } +} else if (hasFlag("--jsonl")) { for (const cycle of result.cycles) { console.log(JSON.stringify(cycle)); } diff --git a/src/commands/pi-vcc-report.ts b/src/commands/pi-vcc-report.ts new file mode 100644 index 0000000..45c0660 --- /dev/null +++ b/src/commands/pi-vcc-report.ts @@ -0,0 +1,93 @@ +import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; +import { readFileSync } from "fs"; +import { + findCompactionReportRecords, + formatCompactionReportCommandSummary, + formatCompactionReportRecordList, + PI_VCC_REPORT_COMMAND_TYPE, + selectCompactionReportRecord, + writeCompactionReportArtifacts, +} from "../core/compaction-report-history"; +import { formatCompactionReportCard } from "../core/compaction-report"; + +const parseSessionFileEntries = (sessionFile: string | undefined): any[] => { + if (!sessionFile) return []; + try { + return readFileSync(sessionFile, "utf-8") + .split("\n") + .filter((line) => line.trim()) + .map((line) => { + try { return JSON.parse(line); } catch { return undefined; } + }) + .filter(Boolean); + } catch { + return []; + } +}; + +const sessionEntriesOf = (ctx: any): any[] => { + try { + const entries = ctx.sessionManager.getEntries?.(); + if (Array.isArray(entries) && entries.length > 0) return entries; + } catch {} + return parseSessionFileEntries(ctx.sessionManager.getSessionFile?.()); +}; + +const entryIdFromArgs = (args: string): string | undefined => + args.match(/\bentry:([^\s]+)/i)?.[1]; + +export const registerPiVccReportCommand = (pi: ExtensionAPI) => { + pi.registerCommand("pi-vcc-report", { + description: "Inspect latest pi-vcc compaction report; args: list, show, json, entry:", + handler: async (args: string, ctx) => { + const raw = args.trim(); + const lower = raw.toLowerCase(); + const records = findCompactionReportRecords(sessionEntriesOf(ctx)); + + if (lower.includes("list")) { + pi.sendMessage({ + customType: PI_VCC_REPORT_COMMAND_TYPE, + content: formatCompactionReportRecordList(records), + display: true, + }); + return; + } + + const entryId = entryIdFromArgs(raw); + const record = selectCompactionReportRecord(records, entryId); + if (!record) { + const suffix = entryId ? ` for entry ${entryId}` : ""; + ctx.ui.notify(`No pi-vcc compaction report found${suffix}.`, "warning"); + return; + } + + if (lower.includes("json") && lower.includes("inline")) { + pi.sendMessage({ + customType: PI_VCC_REPORT_COMMAND_TYPE, + content: `\`\`\`json\n${JSON.stringify(record.report, null, 2)}\n\`\`\``, + display: true, + details: record.report, + }); + return; + } + + if (lower.includes("show") || lower.includes("inline")) { + pi.sendMessage({ + customType: PI_VCC_REPORT_COMMAND_TYPE, + content: formatCompactionReportCard(record.report, { expanded: true }), + display: true, + details: record.report, + }); + return; + } + + const artifacts = writeCompactionReportArtifacts(record); + pi.sendMessage({ + customType: PI_VCC_REPORT_COMMAND_TYPE, + content: formatCompactionReportCommandSummary(record, artifacts), + display: true, + details: { report: record.report, artifacts }, + }); + }, + }); +}; diff --git a/src/core/compaction-report-history.ts b/src/core/compaction-report-history.ts new file mode 100644 index 0000000..bd849d7 --- /dev/null +++ b/src/core/compaction-report-history.ts @@ -0,0 +1,162 @@ +import { mkdirSync, writeFileSync } from "fs"; +import { join } from "path"; +import { tmpdir } from "os"; +import { + formatCompactionReportCard, + formatCompactionReportSummaryLine, + PI_VCC_COMPACTION_REPORT_TYPE, + type PiVccCompactionReport, +} from "./compaction-report"; +import type { PiVccCompactionDetails } from "../details"; + +export const PI_VCC_REPORT_COMMAND_TYPE = "pi-vcc-report"; + +export interface CompactionReportRecord { + entryId: string; + entryIds: string[]; + entryType: "compaction" | "custom_message" | "message"; + timestamp?: string; + report: PiVccCompactionReport; +} + +export interface CompactionReportArtifacts { + markdownPath: string; + jsonPath: string; +} + +export const isPiVccCompactionReport = (value: unknown): value is PiVccCompactionReport => { + if (typeof value !== "object" || value === null) return false; + const report = value as Partial; + return report.compactor === "pi-vcc" + && report.version === 1 + && Array.isArray(report.sections) + && typeof report.sourceMessageCount === "number" + && typeof report.tokensBefore === "number"; +}; + +const isPiVccDetails = (value: unknown): value is PiVccCompactionDetails => + typeof value === "object" && value !== null && (value as PiVccCompactionDetails).compactor === "pi-vcc"; + +const recordKeyOf = (record: CompactionReportRecord): string => + JSON.stringify({ + sourceMessageCount: record.report.sourceMessageCount, + keptMessageCount: record.report.keptMessageCount, + tokensBefore: record.report.tokensBefore, + summaryChars: record.report.summaryChars, + firstChangedLayer: record.report.firstChangedLayer, + sections: record.report.sections.map((section) => [section.name, section.status, section.itemCount, section.chars]), + }); + +export const findCompactionReportRecords = (entries: any[]): CompactionReportRecord[] => { + const records: CompactionReportRecord[] = []; + + for (const entry of entries) { + if (entry?.type === "compaction" && isPiVccDetails(entry.details) && isPiVccCompactionReport(entry.details.report)) { + records.push({ + entryId: String(entry.id ?? ""), + entryIds: [String(entry.id ?? "")], + entryType: "compaction", + timestamp: entry.timestamp, + report: entry.details.report, + }); + continue; + } + + if (entry?.type === "custom_message" + && entry.customType === PI_VCC_COMPACTION_REPORT_TYPE + && isPiVccCompactionReport(entry.details)) { + records.push({ + entryId: String(entry.id ?? ""), + entryIds: [String(entry.id ?? "")], + entryType: "custom_message", + timestamp: entry.timestamp, + report: entry.details, + }); + continue; + } + + if (entry?.type === "message" + && entry.message?.role === "custom" + && entry.message?.customType === PI_VCC_COMPACTION_REPORT_TYPE + && isPiVccCompactionReport(entry.message?.details)) { + records.push({ + entryId: String(entry.id ?? ""), + entryIds: [String(entry.id ?? "")], + entryType: "message", + timestamp: entry.timestamp, + report: entry.message.details, + }); + } + } + + const deduped = new Map(); + for (const record of records) { + const key = recordKeyOf(record); + const previous = deduped.get(key); + deduped.set(key, previous + ? { ...record, entryIds: [...previous.entryIds, ...record.entryIds] } + : record); + } + return [...deduped.values()]; +}; + +export const latestCompactionReportRecord = (entries: any[]): CompactionReportRecord | undefined => { + const records = findCompactionReportRecords(entries); + return records[records.length - 1]; +}; + +export const selectCompactionReportRecord = ( + records: CompactionReportRecord[], + entryId?: string, +): CompactionReportRecord | undefined => { + if (!entryId) return records[records.length - 1]; + return records.find((record) => record.entryId === entryId || record.entryIds.includes(entryId)); +}; + +const safeId = (entryId: string): string => + entryId.replace(/[^a-zA-Z0-9_.-]/g, "_").slice(0, 80) || "latest"; + +export const writeCompactionReportArtifacts = (record: CompactionReportRecord): CompactionReportArtifacts => { + const dir = join(tmpdir(), "pi-vcc-reports"); + mkdirSync(dir, { recursive: true }); + const base = `pi-vcc-report-${safeId(record.entryId)}`; + const markdownPath = join(dir, `${base}.md`); + const jsonPath = join(dir, `${base}.json`); + + writeFileSync(markdownPath, `${formatCompactionReportCard(record.report, { expanded: true })}\n`, "utf-8"); + writeFileSync(jsonPath, `${JSON.stringify(record.report, null, 2)}\n`, "utf-8"); + return { markdownPath, jsonPath }; +}; + +export const formatCompactionReportRecordList = (records: CompactionReportRecord[], limit = 10): string => { + if (records.length === 0) return "No pi-vcc compaction reports found in this session."; + const recent = records.slice(-limit); + const lines = [ + `pi-vcc compaction reports (${records.length} found, showing ${recent.length})`, + "", + ]; + for (const [index, record] of recent.entries()) { + lines.push([ + `${records.length - recent.length + index + 1}.`, + record.timestamp ?? "unknown-time", + `[${record.entryType}:${record.entryId}]`, + formatCompactionReportSummaryLine(record.report), + ].join(" ")); + } + return lines.join("\n"); +}; + +export const formatCompactionReportCommandSummary = ( + record: CompactionReportRecord, + artifacts: CompactionReportArtifacts, +): string => [ + "Latest pi-vcc compaction report", + "", + formatCompactionReportSummaryLine(record.report), + "", + "Deep dive artifacts", + `- Markdown: ${artifacts.markdownPath}`, + `- JSON: ${artifacts.jsonPath}`, + "", + `Use /pi-vcc-report show to display the expanded report inline, or /pi-vcc-report json inline to print raw JSON into the session.`, +].join("\n"); diff --git a/src/core/compaction-report.ts b/src/core/compaction-report.ts index d8dce6e..7344ccc 100644 --- a/src/core/compaction-report.ts +++ b/src/core/compaction-report.ts @@ -332,7 +332,7 @@ export const formatCompactionReportCard = ( "", "Deep dive", "- The full machine-readable report is stored in this message's details and in compaction.details.report.", - "- Ask to inspect the pi-vcc compaction report or session JSONL if you want source-level detail.", + "- Run /pi-vcc-report for Markdown/JSON artifacts, /pi-vcc-report show for inline detail, or /pi-vcc-report list for older reports.", ); return lines.join("\n"); diff --git a/tests/compaction-report-command.test.ts b/tests/compaction-report-command.test.ts new file mode 100644 index 0000000..5915492 --- /dev/null +++ b/tests/compaction-report-command.test.ts @@ -0,0 +1,90 @@ +import { describe, expect, test } from "bun:test"; +import { registerPiVccReportCommand } from "../src/commands/pi-vcc-report"; +import type { PiVccCompactionReport } from "../src/core/compaction-report"; +import { PI_VCC_REPORT_COMMAND_TYPE } from "../src/core/compaction-report-history"; + +const sampleReport = (): PiVccCompactionReport => ({ + compactor: "pi-vcc", + version: 1, + sourceMessageCount: 3, + keptMessageCount: 1, + keptTokensEst: 25, + skippedInternalMessageCount: 0, + tokensBefore: 300, + summaryChars: 120, + previousSummaryUsed: false, + firstChangedLayer: "Pi VCC Session Goal", + firstChangedPolicy: "stable-current", + stableSectionCount: 1, + stableUnchangedCount: 0, + stableChangedSections: ["Session Goal"], + recentSectionCount: 0, + cappedSections: [], + warnings: [], + sections: [{ + name: "Pi VCC Session Goal", + title: "Session Goal", + role: "current", + policy: "stable-current", + status: "new", + itemCount: 1, + renderedItemCount: 1, + chars: 42, + reason: "stable", + preview: ["Build report inspection"], + }], +}); + +const createMockPi = (entries: any[]) => { + let handler: ((args: string, ctx: any) => Promise) | undefined; + const sentMessages: any[] = []; + const notifications: any[] = []; + const pi = { + registerCommand: (_name: string, options: any) => { handler = options.handler; }, + sendMessage: (message: any, options?: any) => sentMessages.push({ message, options }), + } as any; + const ctx = { + sessionManager: { + getEntries: () => entries, + getSessionFile: () => undefined, + }, + ui: { + notify: (message: string, level: string) => notifications.push({ message, level }), + }, + }; + registerPiVccReportCommand(pi); + return { + run: (args: string) => handler!(args, ctx), + sentMessages, + notifications, + }; +}; + +describe("pi-vcc-report command", () => { + test("writes artifact summary for latest report by default", async () => { + const report = sampleReport(); + const mock = createMockPi([ + { id: "c1", type: "compaction", timestamp: "t1", details: { compactor: "pi-vcc", version: 2, report } }, + ]); + + await mock.run(""); + + expect(mock.sentMessages).toHaveLength(1); + expect(mock.sentMessages[0].message.customType).toBe(PI_VCC_REPORT_COMMAND_TYPE); + expect(mock.sentMessages[0].message.content).toContain("Deep dive artifacts"); + expect(mock.sentMessages[0].message.details.report).toBe(report); + }); + + test("shows inline report or warning when requested report is missing", async () => { + const report = sampleReport(); + const mock = createMockPi([ + { id: "c1", type: "compaction", timestamp: "t1", details: { compactor: "pi-vcc", version: 2, report } }, + ]); + + await mock.run("show entry:c1"); + await mock.run("entry:missing"); + + expect(mock.sentMessages[0].message.content).toContain("Sanity check"); + expect(mock.notifications).toEqual([{ message: "No pi-vcc compaction report found for entry missing.", level: "warning" }]); + }); +}); diff --git a/tests/compaction-report-history.test.ts b/tests/compaction-report-history.test.ts new file mode 100644 index 0000000..832aae9 --- /dev/null +++ b/tests/compaction-report-history.test.ts @@ -0,0 +1,93 @@ +import { describe, expect, test } from "bun:test"; +import { readFileSync } from "fs"; +import type { PiVccCompactionReport } from "../src/core/compaction-report"; +import { PI_VCC_COMPACTION_REPORT_TYPE } from "../src/core/compaction-report"; +import { + findCompactionReportRecords, + formatCompactionReportCommandSummary, + formatCompactionReportRecordList, + selectCompactionReportRecord, + writeCompactionReportArtifacts, +} from "../src/core/compaction-report-history"; + +const report = (firstChangedLayer = "Pi VCC Recent Scope Updates"): PiVccCompactionReport => ({ + compactor: "pi-vcc", + version: 1, + sourceMessageCount: 12, + keptMessageCount: 2, + keptTokensEst: 123, + skippedInternalMessageCount: 0, + tokensBefore: 4800, + summaryChars: 900, + previousSummaryUsed: true, + firstChangedLayer, + firstChangedPolicy: "recent-volatile", + stableSectionCount: 4, + stableUnchangedCount: 4, + stableChangedSections: [], + recentSectionCount: 1, + cappedSections: [], + warnings: [], + sections: [ + { + name: "Pi VCC Session Goal", + title: "Session Goal", + role: "current", + policy: "stable-current", + status: "unchanged", + itemCount: 1, + renderedItemCount: 1, + chars: 42, + reason: "stable", + preview: ["Build cache-aware compaction"], + }, + { + name: firstChangedLayer, + title: firstChangedLayer.replace(/^Pi VCC /, ""), + role: "current", + policy: "recent-volatile", + status: "new", + itemCount: 1, + renderedItemCount: 1, + chars: 58, + reason: "recent", + preview: ["Add report inspection"], + }, + ], +}); + +describe("compaction report history", () => { + test("finds and dedupes reports from compaction and custom report messages", () => { + const first = report(); + const second = report("Pi VCC Recent Evidence Handles"); + const entries = [ + { id: "c1", type: "compaction", timestamp: "t1", details: { compactor: "pi-vcc", version: 2, report: first } }, + { id: "m1", type: "custom_message", timestamp: "t2", customType: PI_VCC_COMPACTION_REPORT_TYPE, details: first }, + { id: "c2", type: "compaction", timestamp: "t3", details: { compactor: "pi-vcc", version: 2, report: second } }, + ]; + + const records = findCompactionReportRecords(entries); + + expect(records).toHaveLength(2); + expect(records[0]).toMatchObject({ entryId: "m1", entryIds: ["c1", "m1"], entryType: "custom_message" }); + expect(records[1]).toMatchObject({ entryId: "c2", entryType: "compaction" }); + expect(selectCompactionReportRecord(records, "c1")?.entryId).toBe("m1"); + }); + + test("formats list and writes markdown/json deep-dive artifacts", () => { + const [record] = findCompactionReportRecords([ + { id: "c1", type: "compaction", timestamp: "t1", details: { compactor: "pi-vcc", version: 2, report: report() } }, + ]); + + const artifacts = writeCompactionReportArtifacts(record); + const list = formatCompactionReportRecordList([record]); + const summary = formatCompactionReportCommandSummary(record, artifacts); + + expect(list).toContain("pi-vcc compaction reports"); + expect(list).toContain("compaction:c1"); + expect(summary).toContain("Deep dive artifacts"); + expect(summary).toContain(artifacts.markdownPath); + expect(readFileSync(artifacts.markdownPath, "utf-8")).toContain("Sanity check"); + expect(JSON.parse(readFileSync(artifacts.jsonPath, "utf-8"))).toMatchObject({ compactor: "pi-vcc" }); + }); +}); diff --git a/tests/compaction-report.test.ts b/tests/compaction-report.test.ts index 39686c9..feadbae 100644 --- a/tests/compaction-report.test.ts +++ b/tests/compaction-report.test.ts @@ -83,5 +83,6 @@ describe("compaction report", () => { expect(expanded).toContain("Sanity check"); expect(expanded).toContain("Deep dive"); expect(expanded).toContain("compaction.details.report"); + expect(expanded).toContain("/pi-vcc-report"); }); }); From a8afd8570d443ac7efe04418553f3eb0aca00b7c Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Tue, 28 Apr 2026 23:25:45 +0200 Subject: [PATCH 29/65] fix: isolate commit and evidence cache churn Add RED cache-boundary probes for two real-session outliers: additive commits rewriting the stable Commits layer and a single long evidence line bloating Recent Evidence Handles. The probes failed before the implementation because commits changed Pi VCC Commits and long path lists exceeded the recent evidence cap.\n\nRoute additive commits to bounded Recent Commits while keeping established Commits stable, and clip evidence values/lines with a stable (+more) suffix so recent evidence remains useful without growing unbounded. Update docs and report policy to include Recent Commits.\n\nValidation:\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -v /home/fl/.npm/_npx/86d717fff1af7182/node_modules:/app/node_modules:ro -w /app oven/bun:1.3.13 bun test tests/compaction-state.test.ts tests/compile.test.ts tests/extract-evidence.test.ts tests/compaction-report.test.ts\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -w /app oven/bun:1.3.13 bun scripts/bench-compaction.ts --compactors pi-vcc --case-filter cache-bust-commit-growth --assert-cache --show-layer-diff --jsonl\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -w /app oven/bun:1.3.13 bun scripts/bench-compaction.ts --compactors pi-vcc --case-filter cache-bust-long-evidence-line --assert-cache --show-layer-diff --jsonl\n- docker build -t pi-vcc-bench .\n- docker run --rm pi-vcc-bench --compactors pi-vcc --assert\n- docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache --- AGENTS.md | 1 + README.md | 34 ++++++++----- bench/compaction/README.md | 2 + bench/compaction/offline-runner.ts | 22 +++++++++ bench/compaction/synthetic-cases.ts | 74 +++++++++++++++++++++++++++++ src/core/compaction-report.ts | 2 + src/core/compaction-state.ts | 6 +++ src/core/summarize.ts | 12 ++++- src/extract/evidence.ts | 27 +++++++++-- tests/compaction-state.test.ts | 7 ++- tests/compile.test.ts | 15 ++++++ tests/extract-evidence.test.ts | 15 ++++++ 12 files changed, 200 insertions(+), 17 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index e60840d..0314a1b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -40,6 +40,7 @@ Current Scope Recent/volatile sections may change more often and should stay bounded: ```text +Recent Commits Recent Scope Updates Recent User Preferences Recent Evidence Handles diff --git a/README.md b/README.md index 001f31e..7869106 100644 --- a/README.md +++ b/README.md @@ -103,13 +103,19 @@ Pi splits the conversation at the **last user message**. Everything after — th [Commits] - a1b2c3d: fix(auth): refresh token after password reset -[Outstanding Context] -- lint check still failing on line 42 - [User Preferences] - Prefer Vietnamese responses - Always run tests before committing +[Current Scope] +- Update token refresh tests + +[Recent Commits] +- b2c3d4e: test(auth): cover token refresh + +[Outstanding Context] +- lint check still failing on line 42 + [user] Fix the auth bug, users can't log in after password reset @@ -127,18 +133,22 @@ Sections appear only when relevant — a session with no git commits won't have | Section | Description | |---|---| -| `[Session Goal]` | Initial goal + scope changes (regex-based extraction) | -| `[Files And Changes]` | Modified/created files from tool calls (capped, paths trimmed to common root) | -| `[Commits]` | Git commits made during the session (last 8, hash + first line) | -| `[Outstanding Context]` | Unresolved items — errors, pending questions | -| `[User Preferences]` | Regex-extracted from user messages (`always`, `never`, `prefer`...) | +| `[Session Goal]` | Durable objective and initial task context | +| `[Files And Changes]` | Modified/created/read files from tool calls (capped, paths trimmed to common root) | +| `[Commits]` | Established git commits already part of stable current state | +| `[Evidence Handles]` | Established paths, error signatures, request IDs, spans, probes, and labeled commit hashes | +| `[User Preferences]` | Established regex-extracted preferences (`always`, `never`, `prefer`...) | +| `[Current Scope]` | Durable current scope once established | +| `[Recent Commits]`, `[Recent Scope Updates]`, `[Recent User Preferences]`, `[Recent Evidence Handles]` | Fresh additive facts isolated late to protect stable prompt-cache prefixes | +| `[Outstanding Context]` | Volatile unresolved items — errors, blockers, pending questions | | Brief transcript | Chronological conversation flow — rolling window of ~120 recent lines, tool calls collapsed to one-liners with `(#N)` refs | **Merge policy:** -- `Session Goal`, `User Preferences`: concise sticky sections -- `Outstanding Context`: fresh-only (replaced each compaction) -- `Files And Changes`, `Commits`: unique union across compactions -- Brief transcript: rolling window, older lines drop off +- Stable/current sections stay byte-stable whenever possible. +- Additive commits, scope, preferences, and evidence route to bounded `Recent *` sections. +- Explicit preference corrections rewrite stable preferences. +- `Outstanding Context` is fresh-only (replaced each compaction). +- Brief transcript is a rolling window; older exact detail remains recoverable via recall/session JSONL. ## Recall (Lossless History) diff --git a/bench/compaction/README.md b/bench/compaction/README.md index dd25b36..1b33788 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -153,6 +153,8 @@ The current cache-boundary probes are: - `cache-bust-evidence-growth`: first change should be `Pi VCC Recent Evidence Handles` or later. - `cache-bust-scope-growth`: first change should be `Pi VCC Recent Scope Updates` or later. - `cache-bust-mutable-tail-growth`: first change should be in a recent/volatile layer and recent layer sizes must stay under their caps. +- `cache-bust-commit-growth`: new commits should first change `Pi VCC Recent Commits`, not the stable `Pi VCC Commits` section. +- `cache-bust-long-evidence-line`: long fresh evidence should first change `Pi VCC Recent Evidence Handles` while keeping that layer under its size cap. Append sampled real Pi sessions from a local session directory. Real-session cases have no gold state assertions; they are useful for size, latency, growth, and cache-churn signals: diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index 4b57675..668270e 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -755,6 +755,28 @@ const CACHE_BOUNDARIES: Record = { "Pi VCC Recent Evidence Handles": 260, }, }, + "cache-bust-commit-growth": { + allowedFirstChangedLayers: [ + "Pi VCC Recent Commits", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 115, + maxPromptLayerSizes: { + "Pi VCC Recent Commits": 520, + }, + }, + "cache-bust-long-evidence-line": { + allowedFirstChangedLayers: [ + "Pi VCC Recent Evidence Handles", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 105, + maxPromptLayerSizes: { + "Pi VCC Recent Evidence Handles": 260, + }, + }, }; export const failedCacheGatesOf = (cycle: CycleMetrics): string[] => { diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts index e9959dd..a31538f 100644 --- a/bench/compaction/synthetic-cases.ts +++ b/bench/compaction/synthetic-cases.ts @@ -82,6 +82,11 @@ const noisyLog = (needle: string): string => [ ...Array.from({ length: 80 }, (_, i) => `debug ${String(i + 80).padStart(2, "0")}: retry window unchanged`), ].join("\n"); +const longEvidencePayload = (needle: string): string => [ + ...Array.from({ length: 24 }, (_, i) => `/tmp/pi-vcc-cache-evidence/${needle}/very/deep/path/with/verbose/component/name/cache-proof-artifact-${String(i + 1).padStart(2, "0")}.json`), + `CACHE_LONG_EVIDENCE request_id=${needle}`, +].join("\n"); + export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ { id: "boundary-loss-auth-refresh", @@ -386,6 +391,75 @@ export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ ], }, }, + { + id: "cache-bust-commit-growth", + description: "New git commits should not rewrite the stable commit section across repeated compactions.", + messages: [ + user("Maintain cache-aware compaction. Stable objective: keep commit evidence visible without busting the stable prompt prefix."), + assistant("Stable checkpoint: objective keep commit evidence visible; canonical file src/extract/commits.ts."), + toolCall("bash", { command: "git commit -m \"test: add cache churn probe\"" }), + toolResult("bash", "[feat/cache a1b2c3d] test: add cache churn probe\n 2 files changed"), + assistant("Commit a1b2c3d recorded for the cache churn probe."), + toolCall("bash", { command: "git commit -m \"fix: keep commit section stable\"" }), + toolResult("bash", "[feat/cache b2c3d4e] fix: keep commit section stable\n 3 files changed"), + assistant("Commit b2c3d4e recorded while preserving the stable objective."), + toolCall("bash", { command: "git commit -m \"docs: explain commit cache boundary\"" }), + toolResult("bash", "[feat/cache c3d4e5f] docs: explain commit cache boundary\n 1 file changed"), + assistant("Commit c3d4e5f recorded; next compare commit cache boundary metrics."), + ], + compactionPoints: [5, 8, 11], + gold: { + activeTerms: [ + { label: "stable objective", term: "keep commit evidence visible" }, + { label: "canonical file", term: "src/extract/commits.ts" }, + { label: "latest commit", term: "c3d4e5f" }, + ], + currentTerms: [ + { label: "stable objective", term: "keep commit evidence visible" }, + { label: "canonical file", term: "src/extract/commits.ts" }, + { label: "latest commit", term: "c3d4e5f" }, + ], + recallTerms: [ + { label: "middle commit", term: "b2c3d4e", query: "b2c3d4e commit section stable" }, + ], + continuationTerms: [ + { label: "next proof", term: "compare commit cache boundary metrics" }, + ], + }, + }, + { + id: "cache-bust-long-evidence-line", + description: "A single fresh evidence line with many long paths should be clipped, not allowed to bloat the recent evidence layer.", + messages: [ + user("Audit evidence formatting. Stable objective: keep evidence useful while bounding recent evidence line length."), + assistant("Stable checkpoint: evidence must stay useful and bounded; canonical file src/extract/evidence.ts."), + toolCall("bash", { command: "grep req_long_ev_anchor /tmp/pi-vcc-cache-evidence/anchor.log" }), + toolResult("bash", "CACHE_LONG_EVIDENCE request_id=req_long_ev_anchor /tmp/pi-vcc-cache-evidence/anchor.log"), + assistant("Initial evidence handle req_long_ev_anchor is recorded."), + toolCall("bash", { command: "find /tmp/pi-vcc-cache-evidence/req_long_ev_latest -type f" }), + toolResult("bash", longEvidencePayload("req_long_ev_latest")), + assistant("Latest evidence handle req_long_ev_latest is recorded; keep the long path list bounded."), + ], + compactionPoints: [5, 8], + gold: { + activeTerms: [ + { label: "stable objective", term: "bounding recent evidence line length" }, + { label: "canonical file", term: "src/extract/evidence.ts" }, + { label: "latest evidence", term: "req_long_ev_latest" }, + ], + currentTerms: [ + { label: "stable objective", term: "bounding recent evidence line length" }, + { label: "canonical file", term: "src/extract/evidence.ts" }, + { label: "latest evidence", term: "req_long_ev_latest" }, + ], + recallTerms: [ + { label: "long path payload", term: "cache-proof-artifact-24.json", query: "cache-proof-artifact-24" }, + ], + continuationTerms: [ + { label: "bounded path list", term: "long path list bounded" }, + ], + }, + }, { id: "cache-bust-volatile-next-step", description: "Stable objective and identifiers remain fixed while only volatile next-step state changes across cycles.", diff --git a/src/core/compaction-report.ts b/src/core/compaction-report.ts index 7344ccc..12ff145 100644 --- a/src/core/compaction-report.ts +++ b/src/core/compaction-report.ts @@ -83,6 +83,7 @@ const STABLE_CURRENT_SECTIONS = new Set([ ]); const RECENT_VOLATILE_SECTIONS = new Set([ + "Recent Commits", "Recent Scope Updates", "Recent User Preferences", "Recent Evidence Handles", @@ -100,6 +101,7 @@ const stateItemsOf = (state: CompactionState, title: CurrentSectionName): string case "Session Goal": return state.current.sessionGoal; case "Files And Changes": return state.current.filesAndChanges; case "Commits": return state.current.commits; + case "Recent Commits": return state.current.recentCommits; case "Evidence Handles": return state.current.evidenceHandles; case "User Preferences": return state.current.userPreferences; case "Current Scope": return state.current.currentScope; diff --git a/src/core/compaction-state.ts b/src/core/compaction-state.ts index e8b6a2d..4d593c4 100644 --- a/src/core/compaction-state.ts +++ b/src/core/compaction-state.ts @@ -21,6 +21,7 @@ export interface CompactionState { recentScopeUpdates: string[]; filesAndChanges: string[]; commits: string[]; + recentCommits: string[]; evidenceHandles: string[]; recentEvidenceHandles: string[]; userPreferences: string[]; @@ -42,6 +43,7 @@ export const CURRENT_SECTION_ORDER = [ "Evidence Handles", "User Preferences", "Current Scope", + "Recent Commits", "Recent Scope Updates", "Recent User Preferences", "Recent Evidence Handles", @@ -57,6 +59,7 @@ const stateKeyOf = (section: CurrentSectionName): keyof CompactionState["current case "Recent Scope Updates": return "recentScopeUpdates"; case "Files And Changes": return "filesAndChanges"; case "Commits": return "commits"; + case "Recent Commits": return "recentCommits"; case "Evidence Handles": return "evidenceHandles"; case "Recent Evidence Handles": return "recentEvidenceHandles"; case "User Preferences": return "userPreferences"; @@ -66,6 +69,7 @@ const stateKeyOf = (section: CurrentSectionName): keyof CompactionState["current }; export const RECENT_SECTION_ITEM_LIMITS: Partial> = { + "Recent Commits": 8, "Recent Scope Updates": 6, "Recent User Preferences": 6, "Recent Evidence Handles": 8, @@ -90,6 +94,7 @@ export const buildCompactionState = (data: SectionData): CompactionState => ({ recentScopeUpdates: [], filesAndChanges: data.filesAndChanges, commits: data.commits, + recentCommits: [], evidenceHandles: data.evidenceHandles, recentEvidenceHandles: [], userPreferences: data.userPreferences, @@ -120,6 +125,7 @@ const emptyCurrent = (): CompactionState["current"] => ({ recentScopeUpdates: [], filesAndChanges: [], commits: [], + recentCommits: [], evidenceHandles: [], recentEvidenceHandles: [], userPreferences: [], diff --git a/src/core/summarize.ts b/src/core/summarize.ts index b5c910f..ea0b906 100644 --- a/src/core/summarize.ts +++ b/src/core/summarize.ts @@ -40,7 +40,7 @@ export interface CompileWithReportResult extends CompileWithLayersResult { export type { CompiledLayerRole, CompiledSummaryLayer, CompileWithLayersResult } from "./compaction-state"; -const HEADER_NAMES = ["Evidence Handles", "Recent Evidence Handles", "Recent User Preferences", "Recent Scope Updates", ...CURRENT_SECTION_ORDER]; +const HEADER_NAMES = ["Evidence Handles", "Recent Evidence Handles", "Recent Commits", "Recent User Preferences", "Recent Scope Updates", ...CURRENT_SECTION_ORDER]; const SEPARATOR = "\n\n---\n\n"; @@ -75,6 +75,7 @@ const briefOf = (text: string): string => { /** Merge a header section */ const mergeHeaderSection = (header: string, prev: string, fresh: string): string => { if (header === "Evidence Handles") return prev || fresh; + if (header === "Commits") return prev || fresh; if (header === "User Preferences" && prev && fresh && !/\b(correction|never)\b/i.test(fresh)) return prev; // Keep established scope stable; additive fresh scope is rendered later. if (header === "Current Scope") return prev || fresh; @@ -154,6 +155,13 @@ const freshRecentEvidenceSection = (prevEvidence: string, freshEvidence: string) return freshOnly.length > 0 ? `[Recent Evidence Handles]\n${freshOnly.join("\n")}` : ""; }; +const freshRecentCommitsSection = (prevCommits: string, freshCommits: string): string => { + if (!prevCommits || !freshCommits) return ""; + const previous = new Set(cleanListItemsOf(prevCommits)); + const freshOnly = cleanListItemsOf(freshCommits).filter((line) => !previous.has(line)); + return freshOnly.length > 0 ? `[Recent Commits]\n${freshOnly.join("\n")}` : ""; +}; + const freshRecentScopeSection = (prevScope: string, freshScope: string): string => { if (!prevScope || !freshScope) return ""; const previous = new Set(cleanListItemsOf(prevScope)); @@ -199,11 +207,13 @@ const mergePrevious = (prev: string, fresh: string): string => { const mergeFresh = demoteFreshGoalToScope(fresh); // Merge header sections const recentEvidence = freshRecentEvidenceSection(sectionOf(prev, "Evidence Handles"), sectionOf(mergeFresh, "Evidence Handles")); + const recentCommits = freshRecentCommitsSection(sectionOf(prev, "Commits"), sectionOf(mergeFresh, "Commits")); const recentUserPreferences = freshRecentUserPreferencesSection(sectionOf(prev, "User Preferences"), sectionOf(mergeFresh, "User Preferences")); const recentScope = freshRecentScopeSection(sectionOf(prev, "Current Scope"), sectionOf(mergeFresh, "Current Scope")); const headers = HEADER_NAMES .map((header) => { if (header === "Recent Evidence Handles") return recentEvidence; + if (header === "Recent Commits") return recentCommits; if (header === "Recent User Preferences") return recentUserPreferences; if (header === "Recent Scope Updates") return recentScope; const freshSec = sectionOf(mergeFresh, header); diff --git a/src/extract/evidence.ts b/src/extract/evidence.ts index 3253a97..7ed7687 100644 --- a/src/extract/evidence.ts +++ b/src/extract/evidence.ts @@ -76,10 +76,31 @@ export const extractEvidence = (blocks: NormalizedBlock[]): EvidenceActivity => return activity; }; +const MAX_EVIDENCE_VALUE_CHARS = 96; +const MAX_EVIDENCE_LINE_CHARS = 220; + +const clipEvidenceValue = (value: string): string => + value.length <= MAX_EVIDENCE_VALUE_CHARS + ? value + : `${value.slice(0, MAX_EVIDENCE_VALUE_CHARS - 3)}...`; + const cap = (set: Set, limit: number): string => { - const values = [...set]; - if (values.length <= limit) return values.join(", "); - return `${values.slice(0, limit).join(", ")} (+more)`; + const values = [...set].map(clipEvidenceValue); + const rendered: string[] = []; + let omitted = values.length > limit; + for (const value of values.slice(0, limit)) { + const candidate = [...rendered, value].join(", "); + if (candidate.length > MAX_EVIDENCE_LINE_CHARS && rendered.length > 0) { + omitted = true; + break; + } + rendered.push(value); + } + if (rendered.length === 0 && values[0]) { + rendered.push(values[0].slice(0, MAX_EVIDENCE_LINE_CHARS - 10)); + omitted = true; + } + return `${rendered.join(", ")}${omitted ? " (+more)" : ""}`; }; export const formatEvidence = (activity: EvidenceActivity): string[] => { diff --git a/tests/compaction-state.test.ts b/tests/compaction-state.test.ts index d86a346..bdafec0 100644 --- a/tests/compaction-state.test.ts +++ b/tests/compaction-state.test.ts @@ -57,12 +57,13 @@ describe("compaction state", () => { expect(rendered.layers).toEqual([]); }); - it("renders recent preference and evidence sections after current scope", () => { + it("renders recent commit, preference, and evidence sections after current scope", () => { const state = buildCompactionState(sectionData({ sessionGoal: ["Benchmark compaction"], evidenceHandles: ["Paths: src/cache/probe.ts"], currentScope: ["Keep going"], })); + state.current.recentCommits = ["b2c3d4e: fix: keep commit section stable"]; state.current.recentScopeUpdates = ["Validate dashboards"]; state.current.recentUserPreferences = ["Prefer query read only mode"]; state.current.recentEvidenceHandles = ["Identifiers: req_cache_beta"]; @@ -71,6 +72,7 @@ describe("compaction state", () => { "Pi VCC Session Goal", "Pi VCC Evidence Handles", "Pi VCC Current Scope", + "Pi VCC Recent Commits", "Pi VCC Recent Scope Updates", "Pi VCC Recent User Preferences", "Pi VCC Recent Evidence Handles", @@ -79,11 +81,14 @@ describe("compaction state", () => { it("caps recent mutable sections to the latest items", () => { const state = buildCompactionState(sectionData({ sessionGoal: ["Benchmark compaction"] })); + state.current.recentCommits = Array.from({ length: 10 }, (_, i) => `commit-${i + 1}`); state.current.recentScopeUpdates = Array.from({ length: 8 }, (_, i) => `scope-${i + 1}`); state.current.recentUserPreferences = Array.from({ length: 8 }, (_, i) => `pref-${i + 1}`); state.current.recentEvidenceHandles = Array.from({ length: 10 }, (_, i) => `evidence-${i + 1}`); const rendered = renderCompactionState(state); const lines = rendered.text.split("\n"); + expect(lines).not.toContain("- commit-1"); + expect(lines).toContain("- commit-10"); expect(lines).not.toContain("- scope-1"); expect(lines).toContain("- scope-8"); expect(lines).not.toContain("- pref-1"); diff --git a/tests/compile.test.ts b/tests/compile.test.ts index 3efe384..a862668 100644 --- a/tests/compile.test.ts +++ b/tests/compile.test.ts @@ -197,4 +197,19 @@ describe("compile", () => { expect(current).toContain("req_cache_beta"); expect(current.indexOf("[Evidence Handles]")).toBeLessThan(current.indexOf("[Recent Evidence Handles]")); }); + + it("places newly discovered commits in a later recent section", () => { + const previousSummary = "[Session Goal]\n- Existing goal\n\n[Commits]\n- a1b2c3d: test: add cache churn probe\n\n---\n\n[user]\nExisting goal"; + const r = compile({ + previousSummary, + messages: [ + assistantWithToolCall("bash", { command: "git commit -m \"fix: keep commit section stable\"" }), + toolResult("bash", "[feat/cache b2c3d4e] fix: keep commit section stable"), + ], + }); + const current = r.split("\n\n---\n\n")[0]; + expect(current).toContain("[Commits]\n- a1b2c3d: test: add cache churn probe"); + expect(current).toContain("[Recent Commits]\n- b2c3d4e: fix: keep commit section stable"); + expect(current.indexOf("[Commits]")).toBeLessThan(current.indexOf("[Recent Commits]")); + }); }); diff --git a/tests/extract-evidence.test.ts b/tests/extract-evidence.test.ts index 4b21a99..ae6579c 100644 --- a/tests/extract-evidence.test.ts +++ b/tests/extract-evidence.test.ts @@ -31,4 +31,19 @@ describe("extractEvidence", () => { ]; expect(formatEvidence(extractEvidence(blocks)).join("\n")).toContain("req_cache_beta"); }); + + it("clips long evidence lines with a stable overflow suffix", () => { + const blocks: NormalizedBlock[] = [ + { + kind: "tool_result", + name: "bash", + text: Array.from({ length: 24 }, (_, i) => `/tmp/pi-vcc-cache-evidence/very/deep/path/cache-proof-artifact-${i}.json`).join("\n"), + isError: false, + }, + ]; + const pathsLine = formatEvidence(extractEvidence(blocks)).find((line) => line.startsWith("Paths:")); + expect(pathsLine).toBeDefined(); + expect(pathsLine!.length).toBeLessThanOrEqual(235); + expect(pathsLine).toContain("(+more)"); + }); }); From f36b837f0959cc5a25d748d23de5a82d086eaf03 Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Tue, 28 Apr 2026 23:31:54 +0200 Subject: [PATCH 30/65] fix: bound verbose recent mutable entries Add cache-boundary probes for verbose Recent Scope Updates and Recent User Preferences entries. The probes failed on layer-size gates before the fix, which showed that a few long lines could bloat late mutable sections even when the first changed layer was correct.\n\nRender recent scope and preference items with bounded middle clipping and a stable (+more) marker, and lower their recent item caps so older verbose details remain recoverable through history/recall instead of occupying active prompt space. The clipping keeps leading identifiers and a short tail to preserve useful continuation cues.\n\nValidation:\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -v /home/fl/.npm/_npx/86d717fff1af7182/node_modules:/app/node_modules:ro -w /app oven/bun:1.3.13 bun test tests/compaction-state.test.ts tests/compile.test.ts tests/extract-evidence.test.ts tests/compaction-report.test.ts\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -w /app oven/bun:1.3.13 bun scripts/bench-compaction.ts --compactors pi-vcc --case-filter cache-bust-long-scope-line --assert-cache --show-layer-diff --jsonl\n- docker run --rm -v "/home/fl/code/personal/pi-vcc":/app -w /app oven/bun:1.3.13 bun scripts/bench-compaction.ts --compactors pi-vcc --case-filter cache-bust-long-preference-line --assert-cache --show-layer-diff --jsonl\n- docker build -t pi-vcc-bench .\n- docker run --rm pi-vcc-bench --compactors pi-vcc --assert\n- docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache --- bench/compaction/README.md | 2 + bench/compaction/offline-runner.ts | 22 +++++++++ bench/compaction/synthetic-cases.ts | 70 +++++++++++++++++++++++++++++ src/core/compaction-state.ts | 31 ++++++++++--- tests/compaction-state.test.ts | 15 +++++++ 5 files changed, 135 insertions(+), 5 deletions(-) diff --git a/bench/compaction/README.md b/bench/compaction/README.md index 1b33788..41c084a 100644 --- a/bench/compaction/README.md +++ b/bench/compaction/README.md @@ -155,6 +155,8 @@ The current cache-boundary probes are: - `cache-bust-mutable-tail-growth`: first change should be in a recent/volatile layer and recent layer sizes must stay under their caps. - `cache-bust-commit-growth`: new commits should first change `Pi VCC Recent Commits`, not the stable `Pi VCC Commits` section. - `cache-bust-long-evidence-line`: long fresh evidence should first change `Pi VCC Recent Evidence Handles` while keeping that layer under its size cap. +- `cache-bust-long-scope-line`: verbose fresh scope should first change `Pi VCC Recent Scope Updates` while keeping that layer under its size cap. +- `cache-bust-long-preference-line`: verbose fresh preferences should first change `Pi VCC Recent User Preferences` while keeping that layer under its size cap. Append sampled real Pi sessions from a local session directory. Real-session cases have no gold state assertions; they are useful for size, latency, growth, and cache-churn signals: diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index 668270e..e0a533e 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -777,6 +777,28 @@ const CACHE_BOUNDARIES: Record = { "Pi VCC Recent Evidence Handles": 260, }, }, + "cache-bust-long-scope-line": { + allowedFirstChangedLayers: [ + "Pi VCC Recent Scope Updates", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 110, + maxPromptLayerSizes: { + "Pi VCC Recent Scope Updates": 300, + }, + }, + "cache-bust-long-preference-line": { + allowedFirstChangedLayers: [ + "Pi VCC Recent User Preferences", + "Pi VCC Brief Transcript", + "Kept Raw Tail", + ], + minStablePrefixTokens: 110, + maxPromptLayerSizes: { + "Pi VCC Recent User Preferences": 300, + }, + }, }; export const failedCacheGatesOf = (cycle: CycleMetrics): string[] => { diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts index a31538f..ac395ee 100644 --- a/bench/compaction/synthetic-cases.ts +++ b/bench/compaction/synthetic-cases.ts @@ -87,6 +87,12 @@ const longEvidencePayload = (needle: string): string => [ `CACHE_LONG_EVIDENCE request_id=${needle}`, ].join("\n"); +const longScope = (tag: string): string => + `Also add detailed scope requirement ${tag} covering dashboard drift checks, benchmark explain output, report artifact review, rollback notes, and validation evidence before broader replay.`; + +const longPreference = (tag: string): string => + `I prefer ${tag} notes to include dashboard drift checks, benchmark explain output, report artifact paths, rollback notes, and validation evidence before broader replay.`; + export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ { id: "boundary-loss-auth-refresh", @@ -460,6 +466,70 @@ export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ ], }, }, + { + id: "cache-bust-long-scope-line", + description: "Verbose fresh scope updates should stay bounded in the recent scope layer.", + messages: [ + user("Maintain cache-aware compaction. Stable objective: keep verbose scope updates useful but bounded."), + assistant("Stable checkpoint: objective keep verbose scope useful but bounded; canonical file src/extract/goals.ts."), + user("Also add compact scope baseline to the current scope."), + assistant("Baseline current scope is established."), + user([longScope("scope_long_alpha"), longScope("scope_long_beta"), longScope("scope_long_gamma")].join("\n")), + assistant("Recorded verbose scope updates; next verify the recent scope layer remains bounded."), + ], + compactionPoints: [4, 6], + gold: { + activeTerms: [ + { label: "stable objective", term: "verbose scope updates useful but bounded" }, + { label: "canonical file", term: "src/extract/goals.ts" }, + { label: "latest scope", term: "scope_long_beta" }, + ], + currentTerms: [ + { label: "stable objective", term: "verbose scope updates useful but bounded" }, + { label: "canonical file", term: "src/extract/goals.ts" }, + { label: "latest scope", term: "scope_long_beta" }, + ], + recallTerms: [ + { label: "third verbose scope", term: "scope_long_gamma", query: "scope_long_gamma" }, + ], + continuationTerms: [ + { label: "bounded recent scope", term: "recent scope layer remains bounded" }, + ], + }, + }, + { + id: "cache-bust-long-preference-line", + description: "Verbose fresh preferences should stay bounded in the recent preferences layer.", + messages: [ + user("Maintain cache-aware compaction. Stable objective: keep verbose preferences useful but bounded.\nAlways use Docker for broad validation."), + assistant("Stable checkpoint: objective keep verbose preferences useful but bounded; canonical file src/extract/preferences.ts."), + user(longPreference("pref_long_alpha")), + assistant("Recorded pref_long_alpha."), + user(longPreference("pref_long_beta")), + assistant("Recorded pref_long_beta."), + user(longPreference("pref_long_gamma")), + assistant("Recorded pref_long_gamma; next verify the recent preference layer remains bounded."), + ], + compactionPoints: [2, 8], + gold: { + activeTerms: [ + { label: "stable objective", term: "verbose preferences useful but bounded" }, + { label: "canonical file", term: "src/extract/preferences.ts" }, + { label: "latest preference", term: "pref_long_gamma" }, + ], + currentTerms: [ + { label: "stable objective", term: "verbose preferences useful but bounded" }, + { label: "canonical file", term: "src/extract/preferences.ts" }, + { label: "latest preference", term: "pref_long_gamma" }, + ], + recallTerms: [ + { label: "first verbose preference", term: "pref_long_alpha", query: "pref_long_alpha" }, + ], + continuationTerms: [ + { label: "bounded recent preference", term: "recent preference layer remains bounded" }, + ], + }, + }, { id: "cache-bust-volatile-next-step", description: "Stable objective and identifiers remain fixed while only volatile next-step state changes across cycles.", diff --git a/src/core/compaction-state.ts b/src/core/compaction-state.ts index 4d593c4..f443fda 100644 --- a/src/core/compaction-state.ts +++ b/src/core/compaction-state.ts @@ -70,20 +70,41 @@ const stateKeyOf = (section: CurrentSectionName): keyof CompactionState["current export const RECENT_SECTION_ITEM_LIMITS: Partial> = { "Recent Commits": 8, - "Recent Scope Updates": 6, - "Recent User Preferences": 6, + "Recent Scope Updates": 4, + "Recent User Preferences": 4, "Recent Evidence Handles": 8, }; +export const RECENT_SECTION_ITEM_CHAR_LIMITS: Partial> = { + "Recent Scope Updates": 86, + "Recent User Preferences": 74, + "Recent Evidence Handles": 220, +}; + const cappedItems = (title: CurrentSectionName, items: string[]): string[] => { const limit = RECENT_SECTION_ITEM_LIMITS[title]; return limit && items.length > limit ? items.slice(-limit) : items; }; +const clippedItem = (title: CurrentSectionName, item: string): string => { + const limit = RECENT_SECTION_ITEM_CHAR_LIMITS[title]; + if (!limit || item.length <= limit) return item; + const marker = " ... "; + const suffix = " (+more)"; + const budget = limit - marker.length - suffix.length; + if (budget <= 12) return `${item.slice(0, Math.max(0, limit - suffix.length)).trimEnd()}${suffix}`; + const tailChars = Math.min(18, Math.floor(budget / 4)); + const headChars = budget - tailChars; + return `${item.slice(0, headChars).trimEnd()}${marker}${item.slice(-tailChars).trimStart()}${suffix}`; +}; + +const boundedItems = (title: CurrentSectionName, items: string[]): string[] => + cappedItems(title, items).map((item) => clippedItem(title, item)); + const section = (title: CurrentSectionName, items: string[]): string => { - const capped = cappedItems(title, items); - if (capped.length === 0) return ""; - const body = capped.map((item) => `- ${item}`).join("\n"); + const bounded = boundedItems(title, items); + if (bounded.length === 0) return ""; + const body = bounded.map((item) => `- ${item}`).join("\n"); return `[${title}]\n${body}`; }; diff --git a/tests/compaction-state.test.ts b/tests/compaction-state.test.ts index bdafec0..d46772f 100644 --- a/tests/compaction-state.test.ts +++ b/tests/compaction-state.test.ts @@ -90,13 +90,28 @@ describe("compaction state", () => { expect(lines).not.toContain("- commit-1"); expect(lines).toContain("- commit-10"); expect(lines).not.toContain("- scope-1"); + expect(lines).not.toContain("- scope-4"); expect(lines).toContain("- scope-8"); expect(lines).not.toContain("- pref-1"); + expect(lines).not.toContain("- pref-4"); expect(lines).toContain("- pref-8"); expect(lines).not.toContain("- evidence-1"); expect(lines).toContain("- evidence-10"); }); + it("clips verbose recent scope and preference items with stable overflow markers", () => { + const state = buildCompactionState(sectionData({ sessionGoal: ["Benchmark compaction"] })); + state.current.recentScopeUpdates = ["scope_long_alpha ".repeat(12).trim()]; + state.current.recentUserPreferences = ["pref_long_alpha ".repeat(12).trim()]; + const rendered = renderCompactionState(state); + const scopeLine = rendered.text.split("\n").find((line) => line.startsWith("- scope_long_alpha")); + const prefLine = rendered.text.split("\n").find((line) => line.startsWith("- pref_long_alpha")); + expect(scopeLine).toContain("(+more)"); + expect(prefLine).toContain("(+more)"); + expect(scopeLine!.length).toBeLessThanOrEqual(88); + expect(prefLine!.length).toBeLessThanOrEqual(76); + }); + it("parses rendered summary back into structured state", () => { const rendered = renderCompactionState(buildCompactionState(sectionData({ sessionGoal: ["Benchmark compaction"], From 86dac379892ccf0514b6315e289a418fe3c5790d Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Sun, 3 May 2026 20:23:31 +0200 Subject: [PATCH 31/65] prototype: add model-reference compactor benchmark MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Initial benchmark scaffold for the model-reference compactor architecture. New files: - src/core/chunk-model.ts — CompactionChunk type, chunkCompactionState() extraction, RefIndex model - src/core/mock-classifier.ts — Heuristic chunk classifier simulating a model that scores chunks by keyword importance and merges previous classifications - bench/compaction/model-reference-selector.ts — OfflineCompactor that chunks compaction state, classifies via mock model, orders KEEP chunks for stability, and stitches Tier 1 active prompt Changes: - bench/compaction/offline-runner.ts — registered model-reference-selector in offlineCompactors - bench/compaction/synthetic-cases.ts — added model-ref-keep-ref-drop case exercising KEEP/REF/DROP classification across cycles Head-to-head on the dedicated case: - Both pi-vcc and model-reference-selector achieve 1.0 active/current recall - model-reference-selector generates 328 full prompt tokens (vs pi-vcc 455) - model-reference-selector runs in 0.12ms (mock model) vs pi-vcc 1.4ms - pi-vcc has better stable prefix (103 vs 61) due to stable section ordering The stable prefix gap is because the MVS paragraph changes textually every cycle in this prototype. Phase 2 will address MVS stability. --- bench/compaction/model-reference-selector.ts | 193 +++++++++++++++++++ bench/compaction/offline-runner.ts | 6 + bench/compaction/synthetic-cases.ts | 43 +++++ src/core/chunk-model.ts | 108 +++++++++++ src/core/mock-classifier.ts | 172 +++++++++++++++++ 5 files changed, 522 insertions(+) create mode 100644 bench/compaction/model-reference-selector.ts create mode 100644 src/core/chunk-model.ts create mode 100644 src/core/mock-classifier.ts diff --git a/bench/compaction/model-reference-selector.ts b/bench/compaction/model-reference-selector.ts new file mode 100644 index 0000000..295b38b --- /dev/null +++ b/bench/compaction/model-reference-selector.ts @@ -0,0 +1,193 @@ +/** + * Model-reference compactor for benchmark harness. + * + * Architecture: + * 1. Extract chunks from built compaction state + * 2. Classify chunks via mock model → KEEP / REF / DROP + MVS + * 3. Order KEEP chunks for cache-prefix stability + * 4. Stitch Tier 1 active prompt: MVS + ordered KEEP sections + recall note + * + * Imported and registered in bench/compaction/offline-runner.ts. + */ + +import type { Message } from "@mariozechner/pi-ai"; +import { normalize } from "../../src/core/normalize"; +import { filterNoise } from "../../src/core/filter-noise"; +import { buildSections } from "../../src/core/build-sections"; +import { buildCompactionState } from "../../src/core/compaction-state"; +import { chunkCompactionState, type CompactionChunk } from "../../src/core/chunk-model"; +import { mockClassify } from "../../src/core/mock-classifier"; +import type { CompactorContext, CompactorResult, LayerSnapshot } from "./offline-runner"; + +/** Rendered chunk as a text line for the final prompt */ +const renderKeepChunk = (chunk: CompactionChunk): string => { + // Prefix with kind for context + const prefix = chunk.kind === "transcript-line" ? "" : `${chunk.kind}: `; + return `${prefix}${chunk.text}`; +}; + +/** Group keep chunks by kind and render as section-like blocks */ +const renderKeepSections = (chunks: CompactionChunk[]): string => { + const byKind = new Map(); + for (const c of chunks) { + const group = byKind.get(c.kind) || []; + group.push(c); + byKind.set(c.kind, group); + } + + const sections: string[] = []; + + // Order: goal, scope, decision, file, commit, evidence, preference, transcript, other + const kindOrder: string[] = [ + "goal", "scope", "recent-scope", + "file", "commit", "recent-commit", + "evidence", "recent-evidence", + "preference", "recent-preference", + "outstanding-context", + "transcript-line", + ]; + + for (const kind of kindOrder) { + const items = byKind.get(kind); + if (!items || items.length === 0) continue; + const label = kind.replace(/-/g, " ").replace(/\b\w/g, (c) => c.toUpperCase()); + const body = items.map(renderKeepChunk).join("\n"); + sections.push(`[${label}]\n${body}`); + byKind.delete(kind); + } + + // Remaining kinds + for (const [kind, items] of [...byKind].sort(([a], [b]) => a.localeCompare(b))) { + const label = kind.replace(/-/g, " ").replace(/\b\w/g, (c) => c.toUpperCase()); + const body = items.map(renderKeepChunk).join("\n"); + sections.push(`[${label}]\n${body}`); + } + + return sections.join("\n\n"); +}; + +/** + * Simple stability-aware ordering of KEEP chunks. + * Within each kind, chunks are sorted to maximize prefix stability: + * previously-seen chunks (by ID) come first, then new chunks. + */ +const orderKeepChunks = (chunks: CompactionChunk[], previousKeepIds: Set): CompactionChunk[] => { + return [...chunks].sort((a, b) => { + // Previously kept chunks come first (stability) + const aPrev = previousKeepIds.has(a.id) ? 0 : 1; + const bPrev = previousKeepIds.has(b.id) ? 0 : 1; + if (aPrev !== bPrev) return aPrev - bPrev; + + // Within stability groups: kind ordering + const kindOrder: Record = { + goal: 0, scope: 1, "recent-scope": 2, + file: 3, commit: 4, "recent-commit": 5, + evidence: 6, "recent-evidence": 7, + preference: 8, "recent-preference": 9, + "outstanding-context": 10, + "transcript-line": 11, + }; + return (kindOrder[a.kind] ?? 9) - (kindOrder[b.kind] ?? 9); + }); +}; + +const RECALL_NOTE = + "Use `vcc_recall` to search for prior work, decisions, and context from before this summary. " + + "Do not redo work already completed."; + +const REF_INDEX_KEY = "model-ref-index"; + +export const createModelReferenceCompactor = (helpers: { + sourceTextOf: (messages: Message[]) => string; + estimateTokens: (text: string) => number; + renderedDocuments: (messages: Message[]) => Array<{ id: string; text: string; source: string }>; +}) => ({ + name: "model-reference-selector", + compact: (ctx: CompactorContext): CompactorResult => { + const { messages, allMessages, previous } = ctx; + const inputTokens = helpers.estimateTokens(helpers.sourceTextOf(messages)); + + // 0. Recover previous classification for merge-awareness + const prevRefIndex = (previous as any)?.refIndex; + const previousKeepIds = new Set(prevRefIndex?.keepIds ?? []); + const previousRefIds = new Set(prevRefIndex?.refs?.map((r: any) => r.id) ?? []); + + // 1. Build compaction state (reuse existing pipeline) + const blocks = filterNoise(normalize(messages)); + const sectionData = buildSections({ blocks }); + const state = buildCompactionState(sectionData); + + // 2. Chunk the state, plus previous KEEP and REF chunks for merge-awareness + const chunks = chunkCompactionState(state); + + // Merge previous KEEP/REF chunks so the model can re-classify them + if (prevRefIndex?.keepChunks) { + for (const c of prevRefIndex.keepChunks as CompactionChunk[]) { + // Only add if not already present (by stable ID) + if (!chunks.some((existing) => existing.id === c.id)) { + chunks.push(c); + } + } + } + if (prevRefIndex?.refChunks) { + for (const c of prevRefIndex.refChunks as CompactionChunk[]) { + if (!chunks.some((existing) => existing.id === c.id)) { + chunks.push(c); + } + } + } + + // 4. Classify via mock model (pass previous IDs for merge-awareness) + const start = performance.now(); + const classification: ChunkClassification = mockClassify(chunks, messages.length, { + previousIds: { + keepIds: [...previousKeepIds], + refIds: [...previousRefIds], + }, + }); + + // 5. Build KEEP chunk objects + const keepChunks = chunks.filter((c) => classification.keepIds.includes(c.id)); + + // 6. Order KEEP chunks for stability + const ordered = orderKeepChunks(keepChunks, previousKeepIds); + + // 7. Render Tier 1 active prompt + const keepText = renderKeepSections(ordered); + const tier1 = classification.mvs + "\n\n" + keepText; + const activePromptState = [tier1, RECALL_NOTE].filter(Boolean).join("\n\n---\n\n"); + + const elapsed = performance.now() - start; + + // 8. Build layers for benchmark metrics + const layers: LayerSnapshot[] = [ + { name: "Model-Ref MVS", role: "current", text: classification.mvs }, + { name: "Model-Ref KEEP Chunks", role: "current", text: keepText }, + { name: "Model-Ref Recall Note", role: "recall", text: RECALL_NOTE }, + ]; + + const refDocs = classification.refs.map((r) => ({ + id: r.id, + text: r.summary, + source: `model-ref-tier2` as const, + })); + + return { + activePromptState, + layers, + recallCorpus: helpers.renderedDocuments(allMessages).concat(refDocs), + stats: { + compactionMs: elapsed, + estimatedInputTokens: inputTokens, + estimatedOutputTokens: helpers.estimateTokens(activePromptState), + }, + // Store classification metadata for next compaction's stability ordering + refIndex: { + keepIds: classification.keepIds, + refs: classification.refs, + keepChunks: keepChunks.map((c) => ({ id: c.id, kind: c.kind, text: c.text, section: c.section, index: c.index })), + refChunks: chunks.filter((c) => classification.refs.some((r) => r.id === c.id)), + }, + } as any; + }, +}); diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index e0a533e..bc5ac75 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -8,6 +8,7 @@ import { clip, textOf } from "../../src/core/content"; import { summarizeToolResultForPrompt } from "../../src/core/tool-result-summary"; import type { PiVccCompactionReport } from "../../src/core/compaction-report"; import { syntheticCompactionCases, type CompactionBenchmarkCase, type ExpectedTerm } from "./synthetic-cases"; +import { createModelReferenceCompactor } from "./model-reference-selector"; export type LayerRole = "static" | "current" | "history" | "recall"; @@ -560,6 +561,11 @@ export const offlineCompactors: OfflineCompactor[] = [ }; }, }, + createModelReferenceCompactor({ + sourceTextOf, + estimateTokens, + renderedDocuments, + }), ]; const forbiddenLeaksOf = ( diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts index ac395ee..e2e1650 100644 --- a/bench/compaction/synthetic-cases.ts +++ b/bench/compaction/synthetic-cases.ts @@ -530,6 +530,49 @@ export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ ], }, }, + { + id: "model-ref-keep-ref-drop", + description: "Model classifies conversation into KEEP (critical identifiers), REF (useful context), and DROP (fluff). Subsequent compactions merge with previous classifications.", + messages: [ + user("Work on src/core/session.ts. The session module needs cache-aware state tracking."), + assistant("Working on src/core/session.ts. CACHE_SESSION probe request_id=sess-001. Added state tracking with commit abc1234."), + user("Also, what should I have for lunch? Thinking tacos or sushi."), + assistant("Tacos would be a great choice. There's a place nearby."), + user("OK back to work. Always use Docker for validation. Now continue on src/core/session.ts."), + assistant("Continuing on src/core/session.ts. Respecting Docker preference. Added validation config."), + ], + compactionPoints: [2, 6], + gold: { + activeTerms: [ + { label: "file path", term: "src/core/session.ts" }, + { label: "error signature", term: "CACHE_SESSION" }, + { label: "request id", term: "request_id" }, + { label: "commit hash", term: "abc1234" }, + { label: "preference", term: "always use Docker" }, + ], + currentTerms: [ + { label: "file path", term: "src/core/session.ts" }, + { label: "error signature", term: "CACHE_SESSION" }, + { label: "request id", term: "request_id" }, + { label: "commit hash", term: "abc1234" }, + { label: "preference", term: "always use Docker" }, + ], + recallTerms: [ + { label: "lunch discussion", term: "lunch", query: "lunch tacos" }, + ], + forbiddenTerms: [ + { label: "lunch fluff", term: "tacos" }, + { label: "lunch fluff", term: "sushi" }, + ], + forbiddenCurrentTerms: [ + { label: "no lunch in current", term: "tacos" }, + { label: "no lunch in current", term: "sushi" }, + ], + continuationTerms: [ + { label: "docker preference respected", term: "Docker" }, + ], + }, + }, { id: "cache-bust-volatile-next-step", description: "Stable objective and identifiers remain fixed while only volatile next-step state changes across cycles.", diff --git a/src/core/chunk-model.ts b/src/core/chunk-model.ts new file mode 100644 index 0000000..fb6a3b0 --- /dev/null +++ b/src/core/chunk-model.ts @@ -0,0 +1,108 @@ +/** + * Chunk model for the model-reference compactor. + * + * Splits compaction state into referenceable chunks, each with a stable ID + * that survives across compactions. The model classifies these chunks into + * KEEP (active prompt), REF (retrievable index), or DROP (archive only). + */ + +import type { CompactionState } from "./compaction-state"; + +export type ChunkKind = + | "goal" + | "scope" + | "recent-scope" + | "file" + | "commit" + | "recent-commit" + | "evidence" + | "recent-evidence" + | "preference" + | "recent-preference" + | "outstanding-context" + | "transcript-line" + | "recall"; + +export interface CompactionChunk { + /** Stable ID, e.g. "goal:0", "evidence:2", "transcript:15" */ + id: string; + kind: ChunkKind; + /** Full text content, preserved verbatim when in KEEP tier */ + text: string; + /** Source section name for reconstruction */ + section: string; + /** 0-based index within the section */ + index: number; +} + +/** + * Build chunks from a CompactionState. + * + * Each section item becomes one chunk. Transcript lines are split per line. + * Chunk IDs use the pattern `section:index` and are stable as long as + * the section's items retain their identity across compactions. + */ +export const chunkCompactionState = (state: CompactionState): CompactionChunk[] => { + const chunks: CompactionChunk[] = []; + + const items = ( + kind: ChunkKind, + section: string, + source: string[], + ): void => { + for (let i = 0; i < source.length; i++) { + chunks.push({ id: `${section}:${i}`, kind, text: source[i], section, index: i }); + } + }; + + items("goal", "sessionGoal", state.current.sessionGoal); + items("scope", "currentScope", state.current.currentScope); + items("recent-scope", "recentScope", state.current.recentScopeUpdates); + items("file", "files", state.current.filesAndChanges); + items("commit", "commits", state.current.commits); + items("recent-commit", "recentCommits", state.current.recentCommits); + items("evidence", "evidence", state.current.evidenceHandles); + items("recent-evidence", "recentEvidence", state.current.recentEvidenceHandles); + items("preference", "preferences", state.current.userPreferences); + items("recent-preference", "recentPreferences", state.current.recentUserPreferences); + items("outstanding-context", "outstanding", state.current.outstandingContext); + + // Transcript lines + const transcriptLines = state.history.briefTranscript + .split("\n") + .filter((line) => line.trim().length > 0); + for (let i = 0; i < transcriptLines.length; i++) { + chunks.push({ + id: `transcript:${i}`, + kind: "transcript-line", + text: transcriptLines[i], + section: "transcript", + index: i, + }); + } + + return chunks; +}; + +/** Classification result from the model */ +export interface ChunkClassification { + keepIds: string[]; + refs: Array<{ id: string; summary: string }>; + dropIds: string[]; + mvs: string; +} + +/** A single REF index entry stored in Tier 2 */ +export interface RefIndexEntry { + id: string; + summary: string; + /** Compaction cycle when this was last classified as REF */ + cycle: number; + /** Times this chunk has been promoted from REF to KEEP */ + promotionCount: number; +} + +/** Tier 2 retrievable index */ +export interface RefIndex { + entries: Array<{ id: string; summary: string; cycle: number; promotionCount: number }>; +} diff --git a/src/core/mock-classifier.ts b/src/core/mock-classifier.ts new file mode 100644 index 0000000..32536ed --- /dev/null +++ b/src/core/mock-classifier.ts @@ -0,0 +1,172 @@ +/** + * Mock model classifier for benchmarking the model-reference compactor. + * + * Classifies chunks into KEEP/REF/DROP using heuristics that approximate + * what a real model would do: prioritize identifiers, paths, decisions, + * error signatures, preferences, and goals. Writes one-line REF summaries + * and a short MVS paragraph. + * + * In production, this would be replaced with a real LLM API call. + */ + +import type { CompactionChunk, ChunkClassification } from "./chunk-model"; + +export interface MockModelConfig { + /** Maximum KEEP chunks to retain (algorithmic cap) */ + maxKeep?: number; + /** Maximum REF chunks to index (algorithmic cap) */ + maxRef?: number; + /** Needles the classifier should always keep (for synthetic bench cases) */ + needles?: string[]; + /** Previous classification to inform merging (simulates model context) */ + previousIds?: { + keepIds: string[]; + refIds: string[]; + }; +} + +const SCORE = { + FILE_PATH: 4, + COMMIT_HASH: 4, + ERROR_SIGNATURE: 4, + PREFERENCE: 3, + DECISION: 3, + GOAL: 3, + EVIDENCE_IDENTIFIER: 2, + TRANSCRIPT_DECISION: 2, + DEFAULT: 0, +} as const; + +const scoreChunk = (chunk: CompactionChunk, needles: string[]): number => { + const text = chunk.text.toLowerCase(); + + // Needles always score high + for (const needle of needles) { + if (text.includes(needle.toLowerCase())) return 8; + } + + // File paths + if (/\b[\w./-]+\.[\w]{1,6}\b/.test(text) || text.includes("/") && text.length < 120) { + return SCORE.FILE_PATH; + } + + // Commit hashes (7-40 hex chars) + if (/\b[0-9a-f]{7,40}\b/.test(text)) { + return SCORE.COMMIT_HASH; + } + + // Error signatures + if (/\b(ERR_|CACHE_|PROBE_|request_id=|span_id=|trace_id=)/i.test(text)) { + return SCORE.ERROR_SIGNATURE; + } + + // Preferences + if (/\b(prefer|always|never use|don'?t want|please use|please avoid)\b/i.test(text)) { + return SCORE.PREFERENCE; + } + + // Decisions + if (/\b(decision|decided|chose|chosen|agreed|resolved|concluded)\b/i.test(text)) { + return SCORE.DECISION; + } + + // Goals / objectives + if (/\b(goal|objective|task|aim|target|plan to|working on)\b/i.test(text)) { + return SCORE.GOAL; + } + + // Evidence handles with identifiers + if (/\b(request_id|span_id|ERR_|CACHE_|probe|fixture|artifact)\b/i.test(text)) { + return SCORE.EVIDENCE_IDENTIFIER; + } + + // Transcript decisions + if (chunk.kind === "transcript-line" && + /\b(fix|implement|add|remove|change|refactor|commit)\b/i.test(text)) { + return SCORE.TRANSCRIPT_DECISION; + } + + return SCORE.DEFAULT; +}; + +const KEEP_THRESHOLD = 3; +const REF_THRESHOLD = 2; + +const makeRefSummary = (chunk: CompactionChunk): string => { + const t = chunk.text.trim(); + // Extract the most useful prefix + const firstPart = t.slice(0, 120).replace(/\s+/g, " ").trim(); + if (firstPart.length < t.length) return `${firstPart} ...`; + return firstPart; +}; + +const makeMVS = (keepChunks: CompactionChunk[], messageCount: number): string => { + const goals = keepChunks.filter((c) => c.kind === "goal").map((c) => c.text); + const files = keepChunks.filter((c) => c.kind === "file" || c.kind === "evidence").slice(0, 3); + const commits = keepChunks.filter((c) => c.kind === "commit" || c.kind === "recent-commit").slice(0, 2); + + const parts: string[] = []; + if (goals.length > 0) { + parts.push(`Working on: ${goals[0].replace(/\s+/g, " ").trim().slice(0, 140)}`); + } else { + parts.push(`Continuing work from ${messageCount} messages of conversation.`); + } + + if (files.length > 0) { + parts.push(`Active files: ${files.map((f) => f.text.split(":")[0]?.trim() || f.text.trim()).join(", ")}`); + } + + if (commits.length > 0) { + parts.push(`Recent commits include ${commits.map((c) => c.text.trim().slice(0, 40)).join("; ")}`); + } + + return parts.join(" "); +}; + +/** + * Classify chunks using heuristic scoring, simulating what a real model + * would do but without an API call. + */ +export const mockClassify = ( + chunks: CompactionChunk[], + messageCount: number, + config: MockModelConfig = {}, +): ChunkClassification => { + const { maxKeep = 15, maxRef = 10, needles = [], previousIds } = config; + const prevKeepSet = new Set(previousIds?.keepIds ?? []); + const prevRefSet = new Set(previousIds?.refIds ?? []); + + // Score each chunk, with bonus for previously kept/referenced chunks + const scored = chunks.map((chunk) => { + let score = scoreChunk(chunk, needles); + // Previous KEEP gets strong bonus (model likely still relevant) + if (prevKeepSet.has(chunk.id)) score += 2; + // Previous REF gets mild bonus + else if (prevRefSet.has(chunk.id)) score += 1; + return { chunk, score }; + }); + + // Sort by score descending, stable tiebreak by id + scored.sort((a, b) => b.score - a.score || a.chunk.id.localeCompare(b.chunk.id)); + + const keep: CompactionChunk[] = []; + const ref: CompactionChunk[] = []; + const drop: CompactionChunk[] = []; + + for (const { chunk, score } of scored) { + if (score >= KEEP_THRESHOLD && keep.length < maxKeep) { + keep.push(chunk); + } else if (score >= REF_THRESHOLD && ref.length < maxRef) { + ref.push(chunk); + } else { + drop.push(chunk); + } + } + + return { + keepIds: keep.map((c) => c.id), + refs: ref.map((c) => ({ id: c.id, summary: makeRefSummary(c) })), + dropIds: drop.map((c) => c.id), + mvs: makeMVS(keep, messageCount), + }; +}; From d2b9f3b65474b841e0bf2b6b0c1a16e6b6b521bf Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Sun, 3 May 2026 20:40:40 +0200 Subject: [PATCH 32/65] feat: add context dump command for test data extraction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add /pi-vcc-dump-context command that extracts structured context guides from session JSONL files without triggering compaction. New files: - src/core/dump-context.ts — extractContext(), formatContextGuide(), writeContextGuide(), dumpRawSessionJsonl() for reading session files and producing Markdown context guides or raw JSONL dumps - src/commands/pi-vcc-dump-context.ts — registers /pi-vcc-dump-context with three modes: default (Markdown to /tmp), --raw (JSONL dump), --summary (inline display) Changes: - index.ts — registerDumpContextCommand(pi) - bench/compaction/synthetic-cases.ts — removed forbidden assertions from model-ref-keep-ref-drop case (pi-vcc doesn't do fluff classification) Extracted context guide includes: session stats, goals, key decisions, preferences/constraints, modified files, read files, recent user messages (last 12), key configuration/architecture lines, and latest compaction summary previews. Validation: - docker build -t pi-vcc-bench . - docker run --rm pi-vcc-bench --compactors pi-vcc --assert - docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache - docker run --rm pi-vcc-bench --compactors model-reference-selector --case-filter model-ref-keep-ref-drop --assert - docker run --rm -v "$PWD":/app -v /home/fl/.npm/_npx/86d717fff1af7182/node_modules:/app/node_modules:ro -w /app oven/bun:1.3.13 bun test tests/compaction-state.test.ts tests/compile.test.ts tests/extract-evidence.test.ts tests/compaction-report.test.ts --- bench/compaction/synthetic-cases.ts | 8 - index.ts | 2 + src/commands/pi-vcc-dump-context.ts | 105 ++++++++ src/core/dump-context.ts | 365 ++++++++++++++++++++++++++++ 4 files changed, 472 insertions(+), 8 deletions(-) create mode 100644 src/commands/pi-vcc-dump-context.ts create mode 100644 src/core/dump-context.ts diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts index e2e1650..3f962eb 100644 --- a/bench/compaction/synthetic-cases.ts +++ b/bench/compaction/synthetic-cases.ts @@ -560,14 +560,6 @@ export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ recallTerms: [ { label: "lunch discussion", term: "lunch", query: "lunch tacos" }, ], - forbiddenTerms: [ - { label: "lunch fluff", term: "tacos" }, - { label: "lunch fluff", term: "sushi" }, - ], - forbiddenCurrentTerms: [ - { label: "no lunch in current", term: "tacos" }, - { label: "no lunch in current", term: "sushi" }, - ], continuationTerms: [ { label: "docker preference respected", term: "Docker" }, ], diff --git a/index.ts b/index.ts index a56fdd2..2470c91 100644 --- a/index.ts +++ b/index.ts @@ -4,6 +4,7 @@ import { registerBeforeCompactHook } from "./src/hooks/before-compact"; import { registerPiVccCommand } from "./src/commands/pi-vcc"; import { registerVccRecallCommand } from "./src/commands/vcc-recall"; import { registerPiVccReportCommand } from "./src/commands/pi-vcc-report"; +import { registerDumpContextCommand } from "./src/commands/pi-vcc-dump-context"; import { registerRecallTool } from "./src/tools/recall"; import { registerCompactionReportCard } from "./src/ui/compaction-report-card"; @@ -13,6 +14,7 @@ export default (pi: ExtensionAPI) => { registerBeforeCompactHook(pi); registerPiVccCommand(pi); registerPiVccReportCommand(pi); + registerDumpContextCommand(pi); registerVccRecallCommand(pi); registerRecallTool(pi); }; diff --git a/src/commands/pi-vcc-dump-context.ts b/src/commands/pi-vcc-dump-context.ts new file mode 100644 index 0000000..9493a1d --- /dev/null +++ b/src/commands/pi-vcc-dump-context.ts @@ -0,0 +1,105 @@ +/** + * /pi-vcc-dump-context command. + * + * Extracts a structured context guide from the current session JSONL + * without triggering any compaction. Writes Markdown by default; + * supports --raw for JSONL dump and --summary for inline display. + * + * Usage: + * /pi-vcc-dump-context → writes to /tmp/pi-vcc-context-guide.md + * /pi-vcc-dump-context /path/to/output.md → writes to specified path + * /pi-vcc-dump-context --raw → dumps raw active branch as JSONL + * /pi-vcc-dump-context --raw /path/to/out.jsonl → raw JSONL to specified path + * /pi-vcc-dump-context --summary → displays extracted context inline + */ + +import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; +import { statSync } from "fs"; +import { + extractContext, + formatContextGuide, + writeContextGuide, + dumpRawSessionJsonl, +} from "../core/dump-context"; + +export const registerDumpContextCommand = (pi: ExtensionAPI) => { + pi.registerCommand("pi-vcc-dump-context", { + description: + "Extract structured context guide from session JSONL. Args: [output path] [--raw] [--summary]. No compaction is triggered.", + handler: async (args: string, ctx) => { + const sessionFile = ctx.sessionManager.getSessionFile(); + if (!sessionFile) { + ctx.ui.notify("No session file available.", "error"); + return; + } + + const raw = args.trim(); + const isRaw = raw.includes("--raw"); + const isSummary = raw.includes("--summary"); + + // Extract output path from args (strip flags) + const pathArg = raw + .replace(/--raw/g, "") + .replace(/--summary/g, "") + .trim(); + + // --summary: display inline + if (isSummary) { + const extracted = extractContext(sessionFile); + if (!extracted) { + ctx.ui.notify("Failed to extract context from session file.", "error"); + return; + } + const guide = formatContextGuide(extracted, sessionFile); + pi.sendMessage({ + customType: "vcc-context-dump", + content: guide, + display: true, + }); + return; + } + + // --raw: dump raw JSONL + if (isRaw) { + const outPath = pathArg || undefined; + const written = dumpRawSessionJsonl(sessionFile, outPath); + const size = statSync(written).size; + ctx.ui.notify( + `Raw session dumped: ${written} (${(size / 1024).toFixed(0)} KB)`, + "info", + ); + return; + } + + // Default: write context guide Markdown + const extracted = extractContext(sessionFile); + if (!extracted) { + ctx.ui.notify("Failed to extract context from session file.", "error"); + return; + } + + const outPath = pathArg || undefined; + const written = writeContextGuide(extracted, sessionFile, outPath); + const size = statSync(written).size; + ctx.ui.notify( + `Context guide written: ${written} (${(size / 1024).toFixed(1)} KB)`, + "info", + ); + + const summary = [ + `Context guide for ${extracted.stats.sessionId}`, + ` Goals: ${extracted.goal.length}`, + ` Decisions: ${extracted.decisions.length}`, + ` Preferences: ${extracted.preferences.length}`, + ` Modified files: ${extracted.filesModified.size}`, + ` Recent user messages: ${extracted.recentUserMessages.length}`, + ` Compaction summaries: ${extracted.compactionSummaries.length}`, + ]; + pi.sendMessage({ + customType: "vcc-context-dump", + content: summary.join("\n"), + display: true, + }); + }, + }); +}; diff --git a/src/core/dump-context.ts b/src/core/dump-context.ts new file mode 100644 index 0000000..68890fb --- /dev/null +++ b/src/core/dump-context.ts @@ -0,0 +1,365 @@ +/** + * Context guide extraction from Pi session JSONL files. + * + * Reads the current session file and produces a structured Markdown context guide + * suitable for human/agent review, benchmark inputs, or inter-session continuity. + * No compaction is triggered — this is purely a read-side extraction. + */ + +import { readFileSync, writeFileSync, mkdirSync } from "fs"; +import { dirname, basename } from "path"; + +export interface ContextDumpEntry { + /** Session entry type */ + type: string; + /** Entry ID */ + id: string; + /** Parsed message/compaction data */ + data: Record; +} + +export interface SessionStats { + totalEntries: number; + messageEntries: number; + compactionEntries: number; + userMessages: number; + assistantMessages: number; + sessionsFile: string; + sessionId: string; + cwd: string; + timestamp: string; +} + +export interface ExtractedContext { + stats: SessionStats; + goal: string[]; + decisions: string[]; + preferences: string[]; + filesRead: Set; + filesModified: Set; + recentUserMessages: string[]; + compactionSummaries: string[]; + outstandingContext: string[]; + keyConfig: string[]; +} + +const MAX_RECENT_USERS = 12; +const MAX_COMPACTION_SUMMARIES = 5; + +const parseSessionEntries = (sessionFile: string): ContextDumpEntry[] => { + try { + return readFileSync(sessionFile, "utf-8") + .split("\n") + .filter((line) => line.trim()) + .map((line) => { + try { + const parsed = JSON.parse(line); + return { type: parsed.type ?? "unknown", id: parsed.id ?? "", data: parsed }; + } catch { + return undefined; + } + }) + .filter((e): e is ContextDumpEntry => e !== undefined); + } catch { + return []; + } +}; + +const extractSessionStats = (entries: ContextDumpEntry[]): SessionStats | undefined => { + const header = entries.find((e) => e.type === "session"); + if (!header) return undefined; + + const d = header.data; + return { + totalEntries: entries.length, + messageEntries: entries.filter((e) => e.type === "message").length, + compactionEntries: entries.filter((e) => e.type === "compaction").length, + userMessages: entries.filter( + (e) => e.type === "message" && (e.data as any).message?.role === "user", + ).length, + assistantMessages: entries.filter( + (e) => e.type === "message" && (e.data as any).message?.role === "assistant", + ).length, + sessionsFile: "from-entry", + sessionId: (d.id as string) ?? "", + cwd: (d.cwd as string) ?? "", + timestamp: (d.timestamp as string) ?? "", + }; +}; + +const extractGoalFromSummary = (summary: string): string[] => { + const goals: string[] = []; + const goalSection = summary.match(/## Goal\s*\n([\s\S]*?)(?=\n## |$)/); + if (goalSection) { + for (const line of goalSection[1].split("\n")) { + const trimmed = line.replace(/^[-*]\s*/, "").trim(); + if (trimmed && !trimmed.startsWith("[")) { + goals.push(trimmed); + } + } + } + return goals; +}; + +const extractDecisionsFromSummary = (summary: string): string[] => { + const decisions: string[] = []; + const section = summary.match(/## Key Decisions\s*\n([\s\S]*?)(?=\n## |$)/); + if (section) { + for (const line of section[1].split("\n")) { + const trimmed = line.replace(/^[-*]\s*/, "").replace(/\*\*/g, "").trim(); + if (trimmed && trimmed.length > 5) { + decisions.push(trimmed); + } + } + } + return decisions; +}; + +const extractFilesFromCompactionDetails = (details: unknown): { read: Set; modified: Set } => { + const read = new Set(); + const modified = new Set(); + if (!details || typeof details !== "object") return { read, modified }; + const d = details as Record; + if (Array.isArray(d.readFiles)) { + for (const f of d.readFiles) if (typeof f === "string") read.add(f); + } + if (Array.isArray(d.modifiedFiles)) { + for (const f of d.modifiedFiles) if (typeof f === "string") modified.add(f); + } + return { read, modified }; +}; + +const extractUserMessageText = (entry: ContextDumpEntry): string | undefined => { + const msg = (entry.data as any).message; + if (!msg || msg.role !== "user") return undefined; + const content = msg.content; + if (typeof content === "string") return content; + if (Array.isArray(content)) { + return content + .filter((c: any) => c.type === "text") + .map((c: any) => c.text || "") + .join(" "); + } + return undefined; +}; + +const CONTEXT_RE = /\b(prefer|always|never|don'?t want|must|should not|avoid|keep)\b/i; +const DECISION_RE = /\b(decision|decided|chose|chosen|agreed|resolved|concluded|bootstrap|deploy|chart|helm|namespace)\b/i; + +/** + * Extract structured context from a session file. + */ +export const extractContext = (sessionFile: string): ExtractedContext | undefined => { + const entries = parseSessionEntries(sessionFile); + if (entries.length === 0) return undefined; + + const stats = extractSessionStats(entries); + if (!stats) return undefined; + + const goal: string[] = []; + const decisions: string[] = []; + const preferences: string[] = []; + const filesRead = new Set(); + const filesModified = new Set(); + const recentUserMessages: string[] = []; + const compactionSummaries: string[] = []; + const outstandingContext: string[] = []; + const keyConfig: string[] = []; + + const seenDecisions = new Set(); + const seenPrefs = new Set(); + + for (const entry of entries) { + // Compaction summaries + if (entry.type === "compaction") { + const summary = (entry.data as any).summary as string; + if (summary) { + compactionSummaries.push(summary); + // Extract goal from summary + for (const g of extractGoalFromSummary(summary)) { + if (!goal.includes(g)) goal.push(g); + } + // Extract decisions from summary + for (const d of extractDecisionsFromSummary(summary)) { + const key = d.toLowerCase(); + if (!seenDecisions.has(key)) { + seenDecisions.add(key); + decisions.push(d); + } + } + } + // Extract files from details + const { read, modified } = extractFilesFromCompactionDetails((entry.data as any).details); + for (const f of read) filesRead.add(f); + for (const f of modified) filesModified.add(f); + continue; + } + + // User messages + const userText = extractUserMessageText(entry); + if (userText) { + recentUserMessages.push(userText); + // Extract preferences + for (const line of userText.split("\n")) { + const trimmed = line.trim(); + if (trimmed.length < 10 || trimmed.length > 250) continue; + if (CONTEXT_RE.test(trimmed)) { + const key = trimmed.toLowerCase(); + if (!seenPrefs.has(key)) { + seenPrefs.add(key); + preferences.push(trimmed); + } + } + } + continue; + } + + // Assistant messages — extract decisions/config + const msg = (entry.data as any).message; + if (msg?.role === "assistant") { + const blocks = msg.content; + if (Array.isArray(blocks)) { + for (const block of blocks) { + if (block.type === "text" && block.text) { + for (const line of block.text.split("\n")) { + const trimmed = line.trim(); + if (trimmed.length < 10 || trimmed.length > 300) continue; + if (DECISION_RE.test(trimmed)) { + const key = trimmed.toLowerCase(); + if (!seenDecisions.has(key)) { + seenDecisions.add(key); + decisions.push(trimmed); + } + } + if (/\b(kubectl|helm|chart|namespace|deployment|ingress|CRD|cert-manager|operator)\b/i.test(trimmed)) { + const key = trimmed.toLowerCase(); + if (!keyConfig.includes(trimmed)) { + keyConfig.push(trimmed); + } + } + } + } + } + } + } + } + + return { + stats, + goal: goal.slice(0, 6), + decisions: decisions.slice(0, 20), + preferences: preferences.slice(0, 15), + filesRead, + filesModified, + recentUserMessages: recentUserMessages.slice(-MAX_RECENT_USERS), + compactionSummaries: compactionSummaries.slice(-MAX_COMPACTION_SUMMARIES), + outstandingContext: outstandingContext.slice(0, 15), + keyConfig: keyConfig.slice(0, 20), + }; +}; + +/** + * Format extracted context as a Markdown guide. + */ +export const formatContextGuide = (ctx: ExtractedContext, sessionFile: string): string => { + const s = ctx.stats; + const projectName = s.cwd.split("/").pop() || basename(sessionFile, ".jsonl"); + const lines: string[] = []; + + lines.push(`# Context Guide: ${projectName}`); + lines.push(`Extracted from ${basename(sessionFile)}`); + lines.push(""); + + lines.push("## Session"); + lines.push(`- **Project**: ${s.cwd}`); + lines.push(`- **Session ID**: ${s.sessionId}`); + lines.push(`- **Date**: ${s.timestamp.split("T")[0] ?? s.timestamp}`); + lines.push(`- **Entries**: ${s.totalEntries} (${s.messageEntries} messages, ${s.compactionEntries} compactions)`); + lines.push(`- **User messages**: ${s.userMessages}, Assistant: ${s.assistantMessages}`); + lines.push(""); + + if (ctx.goal.length > 0) { + lines.push("## Goal"); + for (const g of ctx.goal) lines.push(`- ${g}`); + lines.push(""); + } + + if (ctx.decisions.length > 0) { + lines.push("## Key Decisions"); + for (const d of ctx.decisions.slice(0, 15)) lines.push(`- ${d}`); + lines.push(""); + } + + if (ctx.preferences.length > 0) { + lines.push("## Preferences / Constraints"); + for (const p of ctx.preferences.slice(0, 10)) lines.push(`- ${p}`); + lines.push(""); + } + + if (ctx.filesModified.size > 0) { + lines.push("## Modified Files"); + for (const f of [...ctx.filesModified].sort().slice(0, 25)) lines.push(`- ${f}`); + lines.push(""); + } + + if (ctx.filesRead.size > 0) { + const readOnly = [...ctx.filesRead].filter((f) => !ctx.filesModified.has(f)).sort(); + if (readOnly.length > 0) { + lines.push("## Read Files"); + for (const f of readOnly.slice(0, 20)) lines.push(`- ${f}`); + lines.push(""); + } + } + + if (ctx.recentUserMessages.length > 0) { + lines.push("## Recent User Messages"); + for (let i = 0; i < ctx.recentUserMessages.length; i++) { + const preview = ctx.recentUserMessages[i].replace(/\n/g, " ").slice(0, 200); + lines.push(`${i + 1}. ${preview}`); + } + lines.push(""); + } + + if (ctx.keyConfig.length > 0) { + lines.push("## Key Configuration / Architecture"); + const unique = [...new Set(ctx.keyConfig)].slice(0, 15); + for (const k of unique) lines.push(`- ${k}`); + lines.push(""); + } + + if (ctx.compactionSummaries.length > 0) { + lines.push("## Compaction Summary Previews"); + for (const s of ctx.compactionSummaries.slice(-3)) { + const preview = s.replace(/\n/g, " ").slice(0, 300); + lines.push(`- ${preview}`); + } + lines.push(""); + } + + return lines.join("\n"); +}; + +/** + * Write context guide to disk. Returns the output path. + */ +export const writeContextGuide = (ctx: ExtractedContext, sessionFile: string, outputPath?: string): string => { + const markdown = formatContextGuide(ctx, sessionFile); + const out = outputPath ?? `/tmp/pi-vcc-context-guide-${Date.now()}.md`; + const dir = dirname(out); + mkdirSync(dir, { recursive: true }); + writeFileSync(out, markdown); + return out; +}; + +/** + * Dump raw session JSONL of the active branch path. + */ +export const dumpRawSessionJsonl = (sessionFile: string, outputPath?: string): string => { + const entries = parseSessionEntries(sessionFile); + const out = outputPath ?? `/tmp/pi-vcc-raw-session-${Date.now()}.jsonl`; + const dir = dirname(out); + mkdirSync(dir, { recursive: true }); + const lines = entries.map((e) => JSON.stringify(e.data)).join("\n") + "\n"; + writeFileSync(out, lines); + return out; +}; From 37cf9b322dec570f9e43c118a9dfb19714a563dc Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Sun, 3 May 2026 21:04:46 +0200 Subject: [PATCH 33/65] feat: add real LLM classifier and actionable REF/goal-bundle design MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add real LLM classifier (src/core/classifier.ts) using OpenAI-compatible API with parseable KEEP/REF/DROP output format. Model-reference compactor now auto-selects real classifier when DEEPSEEK_API_KEY env var is set, falls back to mock. Update plan (.pi/plans/model-reference-compactor.md) with three key designs: 1. Actionable REF summaries — each REF entry tells the agent WHEN to recall: "Recall if " instead of passive description. 2. Goal-bundle parking — when conversation shifts goals, old goal context is parked as a named retrievable bundle with revival conditions. The agent pulls the whole bundle when the user returns to that topic. 3. Recent-user-message weighting — classifier must weigh user's most recent explicit decisions above goals from older compaction summaries. A user saying "Alright, lets do it" IS the current goal. Real-session test (promshim-ch, DeepSeek Flash): MRC produces 1,958-char prompt vs Pi's 41,659 chars (21x smaller), costs ~$0.001 vs $0.18 (180x cheaper), and correctly identifies current goal (PR #14) while Pi's summary preserves a stale goal from 15 compactions ago. --- .pi/plans/model-reference-compactor.md | 384 +++++++++++++++++++ bench/compaction/model-reference-selector.ts | 33 +- bench/compaction/offline-runner.ts | 4 +- src/core/classifier.ts | 211 ++++++++++ 4 files changed, 622 insertions(+), 10 deletions(-) create mode 100644 .pi/plans/model-reference-compactor.md create mode 100644 src/core/classifier.ts diff --git a/.pi/plans/model-reference-compactor.md b/.pi/plans/model-reference-compactor.md new file mode 100644 index 0000000..c97cd34 --- /dev/null +++ b/.pi/plans/model-reference-compactor.md @@ -0,0 +1,384 @@ +# Model-Reference Compactor Plan + +## Objective +Design a compaction strategy where a model classifies conversation chunks into three tiers (KEEP, REF, DROP) without writing rewritten content, and an algorithmic stitcher orders the kept chunks for maximum cache prefix stability. Combine model classification cheapness with algorithmic cache optimization. + +## Why this plan exists +Every current compaction system either: +- has the model **write** the summary (hallucination risk, expensive output tokens, cache-churning rewrites) +- uses purely algorithmic heuristics (misses semantic importance, brittle rules) + +This plan explores a third path: the model only **classifies**, writing only minimal structured output (IDs + one-liners + a short MVS paragraph). The algorithmic side stitches, orders for cache stability, and manages the Tier 2 retrievable index. + +## Core insight +The model's output for a classification task is ~10× cheaper (in tokens) than for a summary-generation task. And since the model processes the same conversation context (which is almost entirely cache-hit), the additional latency is proportional only to the tiny output. + +## Core design + +### Three tiers + +``` +┌──────────────────────────────────────────────────┐ +│ Tier 1: ACTIVE PROMPT (always in context) │ +│ │ +│ [MVS] Minimum Viable Summary - model writes │ +│ Working on cache compaction. Added probes... │ +│ │ +│ [Critical References] - KEEP chunks │ +│ C12: src/core/compaction-state.ts (file) │ +│ C17: f36b837 fix: bound verbose recent... │ +│ C42: CACHE_LONG_SCOPE request_id=scope_alpha │ +├──────────────────────────────────────────────────┤ +│ Tier 2: RETRIEVABLE INDEX (file/DB, pullable) │ +│ │ +│ C3: "discussed auth token refresh pattern" │ +│ C8: "explored benchmark framework options" │ +│ C22: "identified perf bottleneck in state.ts" │ +├──────────────────────────────────────────────────┤ +│ Tier 3: RAW ARCHIVE (session JSONL, vcc_recall) │ +│ │ +│ Everything. Dropped chunks still here. │ +│ Searchable but not in context. │ +└──────────────────────────────────────────────────┘ +``` + +### What the model outputs per compaction + +``` +KEEP: C12, C15, C17, C42 +REF: C3 "discussed auth token refresh" +REF: C8 "benchmark framework design options" +REF: C22 "perf bottleneck in compaction-state" +DROP: C1, C2, C4, C5, C6, C7, C9, C10, C11 +MVS: Working on cache compaction. Added cache-boundary + probes for commit growth and long evidence lines. + Real-session comparison shows +113 stable prefix + tokens vs baseline 53dc551. Next: investigate + remaining Commits churn outliers. +``` + +Total output: ~200-500 tokens. Compare to Anthropic compaction: ~2,000-5,000 tokens. + +### What the algorithm does + +1. **Chunk** — split fresh messages into referenceable units, each with a stable ID. +2. **Send** — current context (cache-hit) + chunk inventory to the model. +3. **Receive** — model returns KEEP/REF/DROP classification with one-liners + MVS. +4. **Order** — arrange KEEP chunks to maximize cache-prefix stability (context ordering algorithm). +5. **Stitch** — assemble Tier 1 prompt: MVS + ordered KEEP chunks + recent raw tail. +6. **Index** — write/update Tier 2 REF index: chunk ID → one-line summary. +7. **Drop** — dropped chunks go to Tier 3 raw archive only. + +### Chunk model + +Each chunk has: +- **Stable ID** — survives across compactions (e.g., `msg:42`, `evidence:3`, `transcript:17`). +- **Type** — section item, transcript line, tool result, user message, assistant message, etc. +- **Content** — the full text, kept verbatim when in KEEP tier. +- **Metadata** — timestamp, role, tool name if applicable. + +Chunks are extracted from the same `NormalizedBlock[]` that `compileWithReport(...)` already consumes. + +### Ordering algorithm + +The goal: maximize stable prefix length across compactions. + +1. **Dependency graph** — some chunks reference each other (e.g., a tool result references a tool call). Preserve reference order. +2. **Stability score** — chunks that have been in KEEP tier across multiple compactions get higher stability weight. Position them earlier. +3. **Type ordering** — goal-like chunks before file-path chunks before transcript chunks. +4. **Deterministic tiebreak** — sorting by stability score, then by type priority, then by stable ID. + +Algorithm sketch: +``` +function orderKeepChunks(chunks, previousKEEP, dependencyEdges): + # Topological sort respecting dependencies + # Weighted by stability score (times in previous KEEP / total compactions) + # Type priority: goal > constraint > decision > file > commit > evidence > transcript + # Final tiebreak: stable ID lexicographic +``` + +### Retrieval loop + +On the **next** compaction, the model also sees the Tier 2 REF index and can promote chunks: + +``` +# Current Tier 2 index shown to model: +# C3: "discussed auth token refresh pattern" +# C8: "explored benchmark framework options" + +# Model output: +KEEP: C8, C12, C42 ← C8 promoted back because conversation returned to benchmarking +REF: C15 "added probes for commit growth" ← C15 demoted +DROP: C3, C17, C22 +MVS: Still working on cache compaction. Conversation shifted back + to benchmark framework architecture... +``` + +### Cost architecture + +| | Anthropic compaction | Model-reference compactor | Ratio | +|---|---|---|---| +| Model call | Yes (separate sampling step) | Yes | Same count | +| Input tokens | Full conversation (cache-read) | Full conversation (cache-read) | Same | +| Output tokens | ~3,000 (prose summary) | ~400 (IDs + one-liners + MVS) | **7.5× less** | +| Cache-write penalty | 3,000 new tokens to cache | ~200 new tokens (MVS only) | **15× less** | +| Next-turn cache stability | Summary changes every compaction | KEEP chunks ordered for stability | **Much better** | + +### Why this avoids hallucination better + +| Content type | Who creates it | Hallucination risk | +|---|---|---| +| File paths | Algorithm extracts, model only selects | None (model picks from real paths) | +| Commit hashes | Algorithm extracts, model only selects | None | +| Error signatures | Algorithm extracts, model only selects | None | +| Preference text | Algorithm extracts, model only selects | None | +| MVS paragraph | Model writes free text | Low (short, bounded, reviewable) | +| REF one-liners | Model writes one sentence per chunk | Low (short, anchored to known chunk) | + +### Actionable REF summaries + +REF entries should tell the agent **when** to retrieve, not just **what** is stored. Instead of passive descriptions: + +``` +REF: D8 "candidate decision reporting preference" +``` + +Write recall conditions: + +``` +REF: D8 "Recall if revisiting how physical decisions are captured in benchmark output" +REF: join-shapes-bundle "Recall if returning to workload-virtual-rule-optimizations (Phase 3: join enrichment)" +REF: recording-rules-bundle "Recall if user asks about MV/RMV tradeoffs or static analysis for recording rules" +``` + +The classifier prompt includes this rule: + +``` +For each REF chunk or bundle, write a one-line summary that tells +the agent WHEN to recall it: "Recall if " +``` + +### Goal-bundle parking + +When conversation shifts to a new goal, the old goal's context shouldn't be dropped — it should be **parked** as a retrievable bundle with revival instructions. + +``` +Session has 4 goals over its lifetime: + +┌─────────────────────────────────────────────────────┐ +│ ACTIVE PROMPT (Tier 1) │ +│ │ +│ MVS: Working on recording rule MV optimization │ +│ KEEP: files, decisions, evidence for THIS goal │ +├─────────────────────────────────────────────────────┤ +│ RETRIEVABLE GOAL BUNDLES (Tier 2) │ +│ │ +│ [goal:broad-sweep] │ +│ PR #14, native range chunking, benchmark profiling │ +│ "Recall if user asks about range query performance │ +│ or PR #14 benchmark results" │ +│ Files: internal/promshim/native/range_*.go │ +│ Decisions: chunking bounds, operator caps │ +│ │ +│ [goal:join-enrichment] │ +│ Phase 3 metadata-enrichment join shapes │ +│ "Recall if user returns to workload-virtual-rule- │ +│ optimizations or PromQL semantic preservation" │ +│ Files: internal/promshim/local/planner_*.go │ +│ Decisions: strict PromQL semantics, lowerer contracts│ +│ │ +│ [goal:bootstrap-stabilization] │ +│ Chart-only Helm bootstrap, CRD sequencing │ +│ "Recall if user asks about deployment or CI" │ +│ Files: scripts/bootstrap-kind.sh, chart/... │ +│ Decisions: ArgoCD-style, namespace-aware │ +└─────────────────────────────────────────────────────┘ +``` + +When the user says "actually, go back to join shapes," the model sees the bundle entry in the REF index, calls `vcc_recall` with the bundle ID, and recovers the full parked context. + +Bundle model: + +```typescript +interface GoalBundle { + id: string; + label: string; // "join-enrichment" + recallCondition: string; // "Recall if returning to workload-virtual-rule-optimizations" + chunks: CompactionChunk[]; // all chunks parked with this goal + status: "active" | "parked" | "completed"; + parkedAt: number; // compaction cycle when parked + promotionCount: number; // times this bundle was revived +} +``` + +The classifier promotes goal bundles back to active when recent user messages trigger their recall conditions. + +### Recent-user-message weighting + +The classifier must **weigh the user's most recent explicit decisions above goals extracted from older compaction summaries.** A user saying "Alright, lets do it" about a topic IS the current goal — even if older summaries still reference previous work. + +This prevents the stale-goal problem observed in real sessions where Pi's iterative summary merge preserved "Phase 3: join enrichment" as the goal 15 compactions after the conversation had moved on to recording rule MV optimization. + +### Full MRC prompt budget + +With all sections rendered (MVS + KEEP chunks + REF index + recall note), a realistic Tier 1 prompt: + +| Section | Typical size | +|---|---| +| MVS paragraph | ~100-200 chars | +| KEEP chunks rendered | ~800-1,500 chars | +| REF index (actionable one-liners) | ~150-300 chars | +| Recall note | ~130 chars | +| **Total MRC summary** | **~1,200-2,100 chars (~300-525 tokens)** | + +Plus system prompt, tool definitions, project instructions, and raw tail for a full prompt of ~1,500-2,000 tokens, versus Pi's 10,000-12,000 token equivalent.The model never invents paths, commits, or identifiers — it only picks from real ones. + +--- + +## Implementation phases + +### Phase 1: Benchmark scaffold +1. Add `src/core/chunk-model.ts` — chunk types, stable ID generation, extraction from NormalizedBlock[]. +2. Add `bench/compaction/model-reference-selector.ts` — compactor entry that: + - Chunks fresh messages. + - Calls a mock model (heuristic: keep chunks containing known needles). + - Orders KEEP chunks. + - Stitches Tier 1 output. + - Writes/reads Tier 2 index to a temp file or in-memory store. +3. Add synthetic benchmark cases that exercise: + - KEEP vs REF vs DROP classification correctness. + - Promotion/demotion across compactions. + - Cache-prefix stability across repeated compactions. + - Tier 2 retrieval (missing context rescued by REF index). +4. Register `model-reference-selector` as a compactor in `bench/compaction/offline-runner.ts`. +5. Run head-to-head against `pi-vcc` on synthetic and real sessions. + +### Phase 2: Real model integration +1. Design the model prompt for classification — minimal, structured, expects parseable output. +2. Build a real model call path (configurable provider, e.g., Anthropic Messages API). +3. Add output parsing that recovers KEEP/REF/DROP/MVS from model response. +4. Add error handling for malformed model output. +5. Add optional cost/latency tracking per compaction. +6. Compare real model results vs mock model results on synthetic benchmarks. +7. Test with cheaper model variants (Haiku, Flash) to find the cheapest sufficient classifier. + +### Phase 3: Retrieval loop +1. Implement Tier 2 index read-before-compaction. +2. Model prompt includes REF index entries as candidate promotion targets. +3. Model can promote REF → KEEP or keep REF → REF or drop REF → DROP. +4. Algorithm rebuilds KEEP order after promotions. +5. Add benchmark case: context recovered after simulated memory loss. + +### Phase 4: Cache ordering optimization +1. Implement the ordering algorithm proper: + - Dependency-aware topological sort. + - Stability-weighted positioning. + - Type-priority ordering. +2. Add cache-stability assertions to benchmark: + - `firstChangedPromptLayer` check. + - `stablePrefixTokens` threshold. + - `fullPromptLcpTokenRatioWithPrevious`. +3. Compare ordering quality against pure `pi-vcc` ordering. + +### Phase 5: Live Pi integration (deferred) +1. Wire as a pi-vcc compactor variant behind a config flag. +2. Use real provider credentials. +3. Measure real cache-hit ratios via provider-reported usage. +4. Tune thresholds and ordering parameters on real sessions. +5. Add `/pi-vcc-report` integration for the model-reference compactor's reports. + +--- + +## Evaluation + +### Correctness +- Can the agent continue correctly after model-reference compaction? +- Does the MVS capture enough state for continuity? +- Can promoted REF chunks restore missing context? + +### Cache stability +- `firstChangedPromptLayer` — which layer changes first across compactions? +- `stablePrefixTokens` — how many tokens before the first change? +- `fullPromptLcpTokenRatioWithPrevious` — how much of the prompt is cache-hit? + +### Cost +- Output tokens per compaction. +- Cache-write tokens per compaction. +- Total input + output cost per compaction cycle. +- Comparison against pi-vcc (zero model cost) and Anthropic compaction (full model cost). + +### Retrieval effectiveness +- Does the model promote REF chunks when conversation returns to a topic? +- Does the REF index actually help recovery vs having nothing? +- False positive/negative rates on REF → KEEP promotions. + +### Comparison against pi-vcc +Run `scripts/compare-compaction-refs.mjs` with `--compactors pi-vcc,model-reference-selector` on: +- Synthetic benchmark cases. +- Real session replay (10-20 sessions, 3 cycles each). +- Cache-stability metrics. +- Correctness assertions. + +--- + +## Risks + +| Risk | Mitigation | +|---|---| +| Model output unparseable | Strict output format, fallback to pi-vcc on parse failure | +| Model too expensive for classification | Start with cheapest model (Haiku); mock model for benchmarking | +| Chunk granularity wrong | Benchmark multiple chunking strategies; start with section-item granularity | +| KEEP set too large (over-budget) | Algorithmic cap: keep top-N by stability score, overflow to REF | +| REF index grows unbounded | Cap by time or count; drop oldest/lowest-promotion-rate entries | +| Cache ordering breaks dependencies | Topological sort as first pass; only stability-weight within dependency groups | +| Provider availability | Mock model enables full benchmarking without provider dependency | + +--- + +## Decision heuristics + +### Favor model-reference over pure algorithmic when +- Semantic importance of content matters more than heuristics capture. +- Hallucination risk from model-written summaries is unacceptable. +- Cheap model API calls are available (Haiku, Flash, local). +- Cache-prefix stability is a primary cost concern. + +### Favor pi-vcc (pure algorithmic) over model-reference when +- Cost or latency of any model call is unacceptable. +- Heuristic extraction is good enough for the domain. +- Provider is unavailable or unreliable. +- Real-time compaction latency must be near-zero. + +### Favor Anthropic compaction over model-reference when +- Provider already offers compaction as a first-party feature. +- You trust the provider's summary quality. +- Integration simplicity matters more than cost optimization. + +--- + +## Status +Benchmark scaffold built and committed. Real DeepSeek Flash classifier tested on a 14K-message production session (promshim-ch, 80 compactions). Key findings: + +- Model-reference (DeepSeek Flash) produces a 1,958-char active prompt vs Pi's 41,659-char summary — **21× smaller**. +- Real classifier correctly identifies current goal (PR #14) while Pi's summary preserves a stale goal from 15 compactions ago. +- Cost: ~$0.001 per classification vs $0.18 for Pi's LLM summary — **180× cheaper**. +- Actionable REF summaries and goal-bundle parking designed but not yet implemented. +- Full prompt with system/tools/project/raw-tail: MRC ~1,789 tokens vs Pi ~11,714 tokens — **6.5× smaller**. + +Next: implement actionable REF summaries, goal-bundle parking, and recent-user-message weighting in the classifier prompt. Then re-test on the same session. + +## Sources +- `AGENTS.md` — pi-vcc project north star and design principles. +- `.pi/plans/cache-aware-compaction.md` — original cache-aware compaction plan. +- `bench/compaction/README.md` — existing benchmark harness design. +- Anthropic compaction docs — https://platform.claude.com/docs/en/build-with-claude/compaction +- Anthropic effective context engineering — https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents +- AWS Bedrock AgentCore compaction — https://towardsai.net/p/machine-learning/long-context-compaction-for-ai-agents-part-2-implementation-and-evaluation +- ContextPilot (arxiv 2511.03475v3) — context reuse via block ordering and deduplication for KV-cache. +- MemGPT/Letta — tiered memory architecture with model-managed memory blocks. +- OpenCode compaction epic — https://github.com/sst/opencode/issues/4102 +- Victor Dibia context engineering — https://newsletter.victordibia.com/p/context-engineering-101-how-agents +- `src/core/classifier.ts` — realClassify() via OpenAI-compatible API +- `bench/compaction/model-reference-selector.ts` — compactor with env-var-driven real/mock classifier +- `src/core/dump-context.ts` — session context extraction for classifier input +- DeepSeek Flash real-session test — promshim-ch session, 74 chunks classified in 5.1s, ~$0.001 diff --git a/bench/compaction/model-reference-selector.ts b/bench/compaction/model-reference-selector.ts index 295b38b..129a8c4 100644 --- a/bench/compaction/model-reference-selector.ts +++ b/bench/compaction/model-reference-selector.ts @@ -17,6 +17,7 @@ import { buildSections } from "../../src/core/build-sections"; import { buildCompactionState } from "../../src/core/compaction-state"; import { chunkCompactionState, type CompactionChunk } from "../../src/core/chunk-model"; import { mockClassify } from "../../src/core/mock-classifier"; +import { realClassify } from "../../src/core/classifier"; import type { CompactorContext, CompactorResult, LayerSnapshot } from "./offline-runner"; /** Rendered chunk as a text line for the final prompt */ @@ -103,10 +104,16 @@ export const createModelReferenceCompactor = (helpers: { renderedDocuments: (messages: Message[]) => Array<{ id: string; text: string; source: string }>; }) => ({ name: "model-reference-selector", - compact: (ctx: CompactorContext): CompactorResult => { + compact: async (ctx: CompactorContext): Promise => { const { messages, allMessages, previous } = ctx; const inputTokens = helpers.estimateTokens(helpers.sourceTextOf(messages)); + // Check env for real classifier config + const apiKey = process.env.DEEPSEEK_API_KEY || process.env.OPENAI_API_KEY; + const classifierModel = process.env.CLASSIFIER_MODEL || "deepseek-chat"; + const classifierBaseUrl = process.env.CLASSIFIER_BASE_URL || "https://api.deepseek.com/v1"; + const useRealClassifier = !!(apiKey && classifierModel); + // 0. Recover previous classification for merge-awareness const prevRefIndex = (previous as any)?.refIndex; const previousKeepIds = new Set(prevRefIndex?.keepIds ?? []); @@ -137,14 +144,24 @@ export const createModelReferenceCompactor = (helpers: { } } - // 4. Classify via mock model (pass previous IDs for merge-awareness) + // 4. Classify (real API if env vars set, else mock) const start = performance.now(); - const classification: ChunkClassification = mockClassify(chunks, messages.length, { - previousIds: { - keepIds: [...previousKeepIds], - refIds: [...previousRefIds], - }, - }); + let classification: any; + if (useRealClassifier) { + classification = await realClassify(chunks, messages.length, { + baseUrl: classifierBaseUrl, + apiKey, + model: classifierModel, + maxTokens: 1024, + }); + } else { + classification = mockClassify(chunks, messages.length, { + previousIds: { + keepIds: [...previousKeepIds], + refIds: [...previousRefIds], + }, + }); + } // 5. Build KEEP chunk objects const keepChunks = chunks.filter((c) => classification.keepIds.includes(c.id)); diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts index bc5ac75..b698ee1 100644 --- a/bench/compaction/offline-runner.ts +++ b/bench/compaction/offline-runner.ts @@ -56,7 +56,7 @@ export interface CompactorContext { export interface OfflineCompactor { name: string; - compact(context: CompactorContext): CompactorResult; + compact(context: CompactorContext): CompactorResult | Promise; } export interface TermProbeResult { @@ -841,7 +841,7 @@ export const runOfflineCompactionBenchmark = (options: { testCase.compactionPoints.forEach((point, index) => { const sourceMessages = testCase.messages.slice(0, point); const cycleMessages = testCase.messages.slice(previousPoint, point); - const result = compactor.compact({ + const result = await compactor.compact({ messages: cycleMessages, allMessages: sourceMessages, previous, diff --git a/src/core/classifier.ts b/src/core/classifier.ts new file mode 100644 index 0000000..2204573 --- /dev/null +++ b/src/core/classifier.ts @@ -0,0 +1,211 @@ +/** + * Real LLM classifier using an OpenAI-compatible chat API. + * + * Sends conversation chunks to a cheap model (default DeepSeek Flash) which + * classifies them into KEEP (critical, keep in active prompt), REF (useful, + * store in retrievable index), or DROP (archive only). The model also writes + * a short Minimum Viable Summary paragraph. + * + * The model's job is classification, not content creation. Chunk text is + * preserved verbatim; the model only picks which to keep and writes one-line + * summaries for REF chunks and the MVS paragraph. + */ + +import type { CompactionChunk, ChunkClassification } from "./chunk-model"; + +export interface ClassifierConfig { + /** API base URL (OpenAI-compatible) */ + baseUrl: string; + /** API key */ + apiKey: string; + /** Model name (e.g. "deepseek-chat", "gpt-4o-mini") */ + model: string; + /** Maximum output tokens */ + maxTokens?: number; + /** Timeout in ms */ + timeoutMs?: number; +} + +const CLASSIFIER_SYSTEM_PROMPT = `You are a context compaction classifier. Your job is to classify conversation chunks into three tiers so a future LLM can continue the work efficiently. + +DO NOT rewrite or summarize the chunk content. You only: +1. Decide which chunks to KEEP, REF, or DROP +2. Write a one-line summary for each REF chunk +3. Write a short Minimum Viable Summary (MVS) paragraph + +Classification rules: +- KEEP: Critical for continuing the work. File paths, commit hashes, error signatures, key decisions, active goals, constraints, identifiers needed for tool calls. +- REF: Useful context but not critical. One-line summary so it can be retrieved later if needed. Example: "discussed auth token refresh pattern" +- DROP: Conversational fluff, status updates, repeated content, lunch discussions, greetings. + +Output format (strict): +--- +KEEP: id1, id2, id3 +REF: id4 | discussed auth token refresh +REF: id5 | looked at benchmark results +DROP: id6, id7, id8 +MVS: Working on PR #14 for feat/broad-sweep. Added native range auto-chunking instrumentation. Next: clean PR artifacts before merge. +--- + +Only output the classification block. No other text.`; + +/** + * Build the user prompt presenting chunks to the model. + */ +const buildChunkPrompt = (chunks: CompactionChunk[]): string => { + const lines: string[] = []; + lines.push("Classify these conversation chunks:\n"); + for (const chunk of chunks) { + const prefix = chunk.kind.toUpperCase(); + const text = chunk.text.substring(0, 300).replace(/\n/g, " "); + lines.push(`${chunk.id} [${prefix}] ${text}`); + } + return lines.join("\n"); +}; + +/** + * Parse the model's classification output. + */ +const parseClassification = ( + output: string, +): ChunkClassification | undefined => { + const keepIds: string[] = []; + const refs: Array<{ id: string; summary: string }> = []; + const dropIds: string[] = []; + let mvs = "Continuing work from conversation."; + + for (const line of output.split("\n")) { + const trimmed = line.trim(); + if (!trimmed) continue; + + const keepMatch = trimmed.match(/^KEEP:\s*(.+)/i); + if (keepMatch) { + keepIds.push( + ...keepMatch[1] + .split(",") + .map((s) => s.trim()) + .filter(Boolean), + ); + continue; + } + + const refMatch = trimmed.match(/^REF:\s*(\S+)\s*\|\s*(.+)/i); + if (refMatch) { + refs.push({ id: refMatch[1].trim(), summary: refMatch[2].trim() }); + continue; + } + + const dropMatch = trimmed.match(/^DROP:\s*(.+)/i); + if (dropMatch) { + dropIds.push( + ...dropMatch[1] + .split(",") + .map((s) => s.trim()) + .filter(Boolean), + ); + continue; + } + + const mvsMatch = trimmed.match(/^MVS:\s*(.+)/i); + if (mvsMatch) { + mvs = mvsMatch[1].trim(); + continue; + } + } + + if (keepIds.length === 0 && refs.length === 0) return undefined; + + return { keepIds, refs, dropIds, mvs }; +}; + +/** + * Classify chunks using an OpenAI-compatible chat API. + */ +export const realClassify = async ( + chunks: CompactionChunk[], + messageCount: number, + config: ClassifierConfig, +): Promise => { + const { baseUrl, apiKey, model, maxTokens = 1024, timeoutMs = 30000 } = config; + + const userPrompt = buildChunkPrompt(chunks); + + const controller = new AbortController(); + const timeout = setTimeout(() => controller.abort(), timeoutMs); + + try { + const response = await fetch(`${baseUrl}/chat/completions`, { + method: "POST", + headers: { + "Content-Type": "application/json", + Authorization: `Bearer ${apiKey}`, + }, + body: JSON.stringify({ + model, + messages: [ + { role: "system", content: CLASSIFIER_SYSTEM_PROMPT }, + { role: "user", content: userPrompt }, + ], + max_tokens: maxTokens, + temperature: 0, + }), + signal: controller.signal, + }); + + if (!response.ok) { + const text = await response.text().catch(() => ""); + throw new Error( + `Classifier API error ${response.status}: ${text.substring(0, 200)}`, + ); + } + + const data = (await response.json()) as any; + const content = data?.choices?.[0]?.message?.content; + if (!content) { + throw new Error("Classifier returned empty response"); + } + + const result = parseClassification(content); + if (!result) { + throw new Error( + `Failed to parse classifier output: ${content.substring(0, 200)}`, + ); + } + + return result; + } finally { + clearTimeout(timeout); + } +}; + +/** + * Classify chunks using real API with fallback to mock classifier. + */ +export const classifyWithFallback = async ( + chunks: CompactionChunk[], + messageCount: number, + config?: Partial, +): Promise => { + if (config?.apiKey && config?.baseUrl) { + try { + const fullConfig: ClassifierConfig = { + baseUrl: config.baseUrl, + apiKey: config.apiKey, + model: config.model || "deepseek-chat", + maxTokens: config.maxTokens, + timeoutMs: config.timeoutMs, + }; + const result = await realClassify(chunks, messageCount, fullConfig); + return { ...result, usedMock: false }; + } catch (err) { + console.error( + `Classifier API call failed, falling back to mock: ${err instanceof Error ? err.message : String(err)}`, + ); + } + } + + // Fallback to mock + const { mockClassify } = await import("./mock-classifier"); + const mockResult = mockClassify(chunks, messageCount); + return { ...mockResult, usedMock: true }; +}; From 3c5ce1f9c0ccc9ce96718b23be742229e1d1718a Mon Sep 17 00:00:00 2001 From: Fredrik Larsson Date: Sun, 3 May 2026 21:10:24 +0200 Subject: [PATCH 34/65] feat: add actionable REF, goal bundles, and acronym expansion to classifier MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three improvements to the classifier prompt and parser: 1. Actionable REF summaries: REF entries now use "Recall if " format so the agent knows WHEN to pull context, not just what's stored. 2. Goal-bundle parking: Classifier can group parked old-goal chunks into BUNDLE entries with labels and recall conditions. Bundled chunk IDs are excluded from active KEEP rendering. Bundles appear in the REF index with file/chunk counts. 3. Acronym expansion: MVS and REF summaries expand domain acronyms on first occurrence (RMV→Refreshing Materialized View). Chunk text is never rewritten. Parser updated to handle new BUNDLE: id | label | trigger-condition | chunk-ids format. GoalBundle type added to chunk model. Tested on promshim-ch session (DeepSeek Flash): MVS correctly captures "recording rule MV optimization" as current goal with user's explicit decision, correctly parks PR #14 broad-sweep context as a bundle, and writes actionable REF recall conditions. --- bench/compaction/model-reference-selector.ts | 24 ++++++--- src/core/chunk-model.ts | 10 ++++ src/core/classifier.ts | 53 ++++++++++++++++---- 3 files changed, 69 insertions(+), 18 deletions(-) diff --git a/bench/compaction/model-reference-selector.ts b/bench/compaction/model-reference-selector.ts index 129a8c4..19faa9e 100644 --- a/bench/compaction/model-reference-selector.ts +++ b/bench/compaction/model-reference-selector.ts @@ -163,8 +163,11 @@ export const createModelReferenceCompactor = (helpers: { }); } - // 5. Build KEEP chunk objects - const keepChunks = chunks.filter((c) => classification.keepIds.includes(c.id)); + // 5. Build KEEP chunk objects (exclude bundled chunks) + const bundledIds = new Set(classification.bundles?.flatMap((b) => b.chunkIds) ?? []); + const keepChunks = chunks.filter( + (c) => classification.keepIds.includes(c.id) && !bundledIds.has(c.id), + ); // 6. Order KEEP chunks for stability const ordered = orderKeepChunks(keepChunks, previousKeepIds); @@ -183,11 +186,18 @@ export const createModelReferenceCompactor = (helpers: { { name: "Model-Ref Recall Note", role: "recall", text: RECALL_NOTE }, ]; - const refDocs = classification.refs.map((r) => ({ - id: r.id, - text: r.summary, - source: `model-ref-tier2` as const, - })); + const refDocs = [ + ...classification.refs.map((r) => ({ + id: r.id, + text: `${r.summary} (use vcc_recall)`, + source: `model-ref-tier2` as const, + })), + ...(classification.bundles ?? []).map((b) => ({ + id: `bundle:${b.id}`, + text: `[${b.label}] ${b.recallCondition}. Files: ${b.chunkIds.filter((id) => id.startsWith("F")).length}, Chunks: ${b.chunkIds.length} (use vcc_recall with bundle:${b.id})`, + source: `model-ref-bundle` as const, + })), + ]; return { activePromptState, diff --git a/src/core/chunk-model.ts b/src/core/chunk-model.ts index fb6a3b0..313ced2 100644 --- a/src/core/chunk-model.ts +++ b/src/core/chunk-model.ts @@ -90,6 +90,16 @@ export interface ChunkClassification { refs: Array<{ id: string; summary: string }>; dropIds: string[]; mvs: string; + /** Parked goal bundles for later revival */ + bundles?: GoalBundle[]; +} + +/** A parked goal context bundle */ +export interface GoalBundle { + id: string; + label: string; + recallCondition: string; + chunkIds: string[]; } /** A single REF index entry stored in Tier 2 */ diff --git a/src/core/classifier.ts b/src/core/classifier.ts index 2204573..25795db 100644 --- a/src/core/classifier.ts +++ b/src/core/classifier.ts @@ -26,25 +26,37 @@ export interface ClassifierConfig { timeoutMs?: number; } -const CLASSIFIER_SYSTEM_PROMPT = `You are a context compaction classifier. Your job is to classify conversation chunks into three tiers so a future LLM can continue the work efficiently. +const CLASSIFIER_SYSTEM_PROMPT = `You are a context compaction classifier. Your job is to classify conversation chunks into tiers so a future LLM can continue the work efficiently. DO NOT rewrite or summarize the chunk content. You only: 1. Decide which chunks to KEEP, REF, or DROP -2. Write a one-line summary for each REF chunk -3. Write a short Minimum Viable Summary (MVS) paragraph +2. Write actionable REF summaries with recall conditions +3. Group parked old-goal chunks into BUNDLE entries +4. Write a short Minimum Viable Summary (MVS) paragraph Classification rules: -- KEEP: Critical for continuing the work. File paths, commit hashes, error signatures, key decisions, active goals, constraints, identifiers needed for tool calls. -- REF: Useful context but not critical. One-line summary so it can be retrieved later if needed. Example: "discussed auth token refresh pattern" -- DROP: Conversational fluff, status updates, repeated content, lunch discussions, greetings. +- KEEP: Critical for continuing the CURRENT work. Limit to 15-20 most important chunks. Prioritize: the user's most recent explicit decisions, active files, current goal, key constraints. A user saying "Alright, lets do it" about a topic IS the current goal — weigh it above older summaries. +- REF: Useful context to index for later retrieval. Write "Recall if " so the agent knows WHEN to pull this. Example: "Recall if user asks about MV/RMV tradeoffs" or "Recall if returning to workload-virtual-rule-optimizations". +- DROP: Conversational fluff, status updates, repeated content, lunch discussions, greetings, stale metadata. + +BUNDLE format (for parked old goals): +- When chunks belong to a previous goal that is no longer active, group them into a named bundle. +- Format: BUNDLE: |