diff --git a/.dockerignore b/.dockerignore new file mode 100644 index 0000000..4e20976 --- /dev/null +++ b/.dockerignore @@ -0,0 +1,10 @@ +.git +node_modules +dist +*.tsbuildinfo +bun.lock +bench-results*.jsonl +bench-results*.json +.pi* +research +docs diff --git a/.pi/plans/model-reference-compactor.md b/.pi/plans/model-reference-compactor.md new file mode 100644 index 0000000..c97cd34 --- /dev/null +++ b/.pi/plans/model-reference-compactor.md @@ -0,0 +1,384 @@ +# Model-Reference Compactor Plan + +## Objective +Design a compaction strategy where a model classifies conversation chunks into three tiers (KEEP, REF, DROP) without writing rewritten content, and an algorithmic stitcher orders the kept chunks for maximum cache prefix stability. Combine model classification cheapness with algorithmic cache optimization. + +## Why this plan exists +Every current compaction system either: +- has the model **write** the summary (hallucination risk, expensive output tokens, cache-churning rewrites) +- uses purely algorithmic heuristics (misses semantic importance, brittle rules) + +This plan explores a third path: the model only **classifies**, writing only minimal structured output (IDs + one-liners + a short MVS paragraph). The algorithmic side stitches, orders for cache stability, and manages the Tier 2 retrievable index. + +## Core insight +The model's output for a classification task is ~10× cheaper (in tokens) than for a summary-generation task. And since the model processes the same conversation context (which is almost entirely cache-hit), the additional latency is proportional only to the tiny output. + +## Core design + +### Three tiers + +``` +┌──────────────────────────────────────────────────┐ +│ Tier 1: ACTIVE PROMPT (always in context) │ +│ │ +│ [MVS] Minimum Viable Summary - model writes │ +│ Working on cache compaction. Added probes... │ +│ │ +│ [Critical References] - KEEP chunks │ +│ C12: src/core/compaction-state.ts (file) │ +│ C17: f36b837 fix: bound verbose recent... │ +│ C42: CACHE_LONG_SCOPE request_id=scope_alpha │ +├──────────────────────────────────────────────────┤ +│ Tier 2: RETRIEVABLE INDEX (file/DB, pullable) │ +│ │ +│ C3: "discussed auth token refresh pattern" │ +│ C8: "explored benchmark framework options" │ +│ C22: "identified perf bottleneck in state.ts" │ +├──────────────────────────────────────────────────┤ +│ Tier 3: RAW ARCHIVE (session JSONL, vcc_recall) │ +│ │ +│ Everything. Dropped chunks still here. │ +│ Searchable but not in context. │ +└──────────────────────────────────────────────────┘ +``` + +### What the model outputs per compaction + +``` +KEEP: C12, C15, C17, C42 +REF: C3 "discussed auth token refresh" +REF: C8 "benchmark framework design options" +REF: C22 "perf bottleneck in compaction-state" +DROP: C1, C2, C4, C5, C6, C7, C9, C10, C11 +MVS: Working on cache compaction. Added cache-boundary + probes for commit growth and long evidence lines. + Real-session comparison shows +113 stable prefix + tokens vs baseline 53dc551. Next: investigate + remaining Commits churn outliers. +``` + +Total output: ~200-500 tokens. Compare to Anthropic compaction: ~2,000-5,000 tokens. + +### What the algorithm does + +1. **Chunk** — split fresh messages into referenceable units, each with a stable ID. +2. **Send** — current context (cache-hit) + chunk inventory to the model. +3. **Receive** — model returns KEEP/REF/DROP classification with one-liners + MVS. +4. **Order** — arrange KEEP chunks to maximize cache-prefix stability (context ordering algorithm). +5. **Stitch** — assemble Tier 1 prompt: MVS + ordered KEEP chunks + recent raw tail. +6. **Index** — write/update Tier 2 REF index: chunk ID → one-line summary. +7. **Drop** — dropped chunks go to Tier 3 raw archive only. + +### Chunk model + +Each chunk has: +- **Stable ID** — survives across compactions (e.g., `msg:42`, `evidence:3`, `transcript:17`). +- **Type** — section item, transcript line, tool result, user message, assistant message, etc. +- **Content** — the full text, kept verbatim when in KEEP tier. +- **Metadata** — timestamp, role, tool name if applicable. + +Chunks are extracted from the same `NormalizedBlock[]` that `compileWithReport(...)` already consumes. + +### Ordering algorithm + +The goal: maximize stable prefix length across compactions. + +1. **Dependency graph** — some chunks reference each other (e.g., a tool result references a tool call). Preserve reference order. +2. **Stability score** — chunks that have been in KEEP tier across multiple compactions get higher stability weight. Position them earlier. +3. **Type ordering** — goal-like chunks before file-path chunks before transcript chunks. +4. **Deterministic tiebreak** — sorting by stability score, then by type priority, then by stable ID. + +Algorithm sketch: +``` +function orderKeepChunks(chunks, previousKEEP, dependencyEdges): + # Topological sort respecting dependencies + # Weighted by stability score (times in previous KEEP / total compactions) + # Type priority: goal > constraint > decision > file > commit > evidence > transcript + # Final tiebreak: stable ID lexicographic +``` + +### Retrieval loop + +On the **next** compaction, the model also sees the Tier 2 REF index and can promote chunks: + +``` +# Current Tier 2 index shown to model: +# C3: "discussed auth token refresh pattern" +# C8: "explored benchmark framework options" + +# Model output: +KEEP: C8, C12, C42 ← C8 promoted back because conversation returned to benchmarking +REF: C15 "added probes for commit growth" ← C15 demoted +DROP: C3, C17, C22 +MVS: Still working on cache compaction. Conversation shifted back + to benchmark framework architecture... +``` + +### Cost architecture + +| | Anthropic compaction | Model-reference compactor | Ratio | +|---|---|---|---| +| Model call | Yes (separate sampling step) | Yes | Same count | +| Input tokens | Full conversation (cache-read) | Full conversation (cache-read) | Same | +| Output tokens | ~3,000 (prose summary) | ~400 (IDs + one-liners + MVS) | **7.5× less** | +| Cache-write penalty | 3,000 new tokens to cache | ~200 new tokens (MVS only) | **15× less** | +| Next-turn cache stability | Summary changes every compaction | KEEP chunks ordered for stability | **Much better** | + +### Why this avoids hallucination better + +| Content type | Who creates it | Hallucination risk | +|---|---|---| +| File paths | Algorithm extracts, model only selects | None (model picks from real paths) | +| Commit hashes | Algorithm extracts, model only selects | None | +| Error signatures | Algorithm extracts, model only selects | None | +| Preference text | Algorithm extracts, model only selects | None | +| MVS paragraph | Model writes free text | Low (short, bounded, reviewable) | +| REF one-liners | Model writes one sentence per chunk | Low (short, anchored to known chunk) | + +### Actionable REF summaries + +REF entries should tell the agent **when** to retrieve, not just **what** is stored. Instead of passive descriptions: + +``` +REF: D8 "candidate decision reporting preference" +``` + +Write recall conditions: + +``` +REF: D8 "Recall if revisiting how physical decisions are captured in benchmark output" +REF: join-shapes-bundle "Recall if returning to workload-virtual-rule-optimizations (Phase 3: join enrichment)" +REF: recording-rules-bundle "Recall if user asks about MV/RMV tradeoffs or static analysis for recording rules" +``` + +The classifier prompt includes this rule: + +``` +For each REF chunk or bundle, write a one-line summary that tells +the agent WHEN to recall it: "Recall if " +``` + +### Goal-bundle parking + +When conversation shifts to a new goal, the old goal's context shouldn't be dropped — it should be **parked** as a retrievable bundle with revival instructions. + +``` +Session has 4 goals over its lifetime: + +┌─────────────────────────────────────────────────────┐ +│ ACTIVE PROMPT (Tier 1) │ +│ │ +│ MVS: Working on recording rule MV optimization │ +│ KEEP: files, decisions, evidence for THIS goal │ +├─────────────────────────────────────────────────────┤ +│ RETRIEVABLE GOAL BUNDLES (Tier 2) │ +│ │ +│ [goal:broad-sweep] │ +│ PR #14, native range chunking, benchmark profiling │ +│ "Recall if user asks about range query performance │ +│ or PR #14 benchmark results" │ +│ Files: internal/promshim/native/range_*.go │ +│ Decisions: chunking bounds, operator caps │ +│ │ +│ [goal:join-enrichment] │ +│ Phase 3 metadata-enrichment join shapes │ +│ "Recall if user returns to workload-virtual-rule- │ +│ optimizations or PromQL semantic preservation" │ +│ Files: internal/promshim/local/planner_*.go │ +│ Decisions: strict PromQL semantics, lowerer contracts│ +│ │ +│ [goal:bootstrap-stabilization] │ +│ Chart-only Helm bootstrap, CRD sequencing │ +│ "Recall if user asks about deployment or CI" │ +│ Files: scripts/bootstrap-kind.sh, chart/... │ +│ Decisions: ArgoCD-style, namespace-aware │ +└─────────────────────────────────────────────────────┘ +``` + +When the user says "actually, go back to join shapes," the model sees the bundle entry in the REF index, calls `vcc_recall` with the bundle ID, and recovers the full parked context. + +Bundle model: + +```typescript +interface GoalBundle { + id: string; + label: string; // "join-enrichment" + recallCondition: string; // "Recall if returning to workload-virtual-rule-optimizations" + chunks: CompactionChunk[]; // all chunks parked with this goal + status: "active" | "parked" | "completed"; + parkedAt: number; // compaction cycle when parked + promotionCount: number; // times this bundle was revived +} +``` + +The classifier promotes goal bundles back to active when recent user messages trigger their recall conditions. + +### Recent-user-message weighting + +The classifier must **weigh the user's most recent explicit decisions above goals extracted from older compaction summaries.** A user saying "Alright, lets do it" about a topic IS the current goal — even if older summaries still reference previous work. + +This prevents the stale-goal problem observed in real sessions where Pi's iterative summary merge preserved "Phase 3: join enrichment" as the goal 15 compactions after the conversation had moved on to recording rule MV optimization. + +### Full MRC prompt budget + +With all sections rendered (MVS + KEEP chunks + REF index + recall note), a realistic Tier 1 prompt: + +| Section | Typical size | +|---|---| +| MVS paragraph | ~100-200 chars | +| KEEP chunks rendered | ~800-1,500 chars | +| REF index (actionable one-liners) | ~150-300 chars | +| Recall note | ~130 chars | +| **Total MRC summary** | **~1,200-2,100 chars (~300-525 tokens)** | + +Plus system prompt, tool definitions, project instructions, and raw tail for a full prompt of ~1,500-2,000 tokens, versus Pi's 10,000-12,000 token equivalent.The model never invents paths, commits, or identifiers — it only picks from real ones. + +--- + +## Implementation phases + +### Phase 1: Benchmark scaffold +1. Add `src/core/chunk-model.ts` — chunk types, stable ID generation, extraction from NormalizedBlock[]. +2. Add `bench/compaction/model-reference-selector.ts` — compactor entry that: + - Chunks fresh messages. + - Calls a mock model (heuristic: keep chunks containing known needles). + - Orders KEEP chunks. + - Stitches Tier 1 output. + - Writes/reads Tier 2 index to a temp file or in-memory store. +3. Add synthetic benchmark cases that exercise: + - KEEP vs REF vs DROP classification correctness. + - Promotion/demotion across compactions. + - Cache-prefix stability across repeated compactions. + - Tier 2 retrieval (missing context rescued by REF index). +4. Register `model-reference-selector` as a compactor in `bench/compaction/offline-runner.ts`. +5. Run head-to-head against `pi-vcc` on synthetic and real sessions. + +### Phase 2: Real model integration +1. Design the model prompt for classification — minimal, structured, expects parseable output. +2. Build a real model call path (configurable provider, e.g., Anthropic Messages API). +3. Add output parsing that recovers KEEP/REF/DROP/MVS from model response. +4. Add error handling for malformed model output. +5. Add optional cost/latency tracking per compaction. +6. Compare real model results vs mock model results on synthetic benchmarks. +7. Test with cheaper model variants (Haiku, Flash) to find the cheapest sufficient classifier. + +### Phase 3: Retrieval loop +1. Implement Tier 2 index read-before-compaction. +2. Model prompt includes REF index entries as candidate promotion targets. +3. Model can promote REF → KEEP or keep REF → REF or drop REF → DROP. +4. Algorithm rebuilds KEEP order after promotions. +5. Add benchmark case: context recovered after simulated memory loss. + +### Phase 4: Cache ordering optimization +1. Implement the ordering algorithm proper: + - Dependency-aware topological sort. + - Stability-weighted positioning. + - Type-priority ordering. +2. Add cache-stability assertions to benchmark: + - `firstChangedPromptLayer` check. + - `stablePrefixTokens` threshold. + - `fullPromptLcpTokenRatioWithPrevious`. +3. Compare ordering quality against pure `pi-vcc` ordering. + +### Phase 5: Live Pi integration (deferred) +1. Wire as a pi-vcc compactor variant behind a config flag. +2. Use real provider credentials. +3. Measure real cache-hit ratios via provider-reported usage. +4. Tune thresholds and ordering parameters on real sessions. +5. Add `/pi-vcc-report` integration for the model-reference compactor's reports. + +--- + +## Evaluation + +### Correctness +- Can the agent continue correctly after model-reference compaction? +- Does the MVS capture enough state for continuity? +- Can promoted REF chunks restore missing context? + +### Cache stability +- `firstChangedPromptLayer` — which layer changes first across compactions? +- `stablePrefixTokens` — how many tokens before the first change? +- `fullPromptLcpTokenRatioWithPrevious` — how much of the prompt is cache-hit? + +### Cost +- Output tokens per compaction. +- Cache-write tokens per compaction. +- Total input + output cost per compaction cycle. +- Comparison against pi-vcc (zero model cost) and Anthropic compaction (full model cost). + +### Retrieval effectiveness +- Does the model promote REF chunks when conversation returns to a topic? +- Does the REF index actually help recovery vs having nothing? +- False positive/negative rates on REF → KEEP promotions. + +### Comparison against pi-vcc +Run `scripts/compare-compaction-refs.mjs` with `--compactors pi-vcc,model-reference-selector` on: +- Synthetic benchmark cases. +- Real session replay (10-20 sessions, 3 cycles each). +- Cache-stability metrics. +- Correctness assertions. + +--- + +## Risks + +| Risk | Mitigation | +|---|---| +| Model output unparseable | Strict output format, fallback to pi-vcc on parse failure | +| Model too expensive for classification | Start with cheapest model (Haiku); mock model for benchmarking | +| Chunk granularity wrong | Benchmark multiple chunking strategies; start with section-item granularity | +| KEEP set too large (over-budget) | Algorithmic cap: keep top-N by stability score, overflow to REF | +| REF index grows unbounded | Cap by time or count; drop oldest/lowest-promotion-rate entries | +| Cache ordering breaks dependencies | Topological sort as first pass; only stability-weight within dependency groups | +| Provider availability | Mock model enables full benchmarking without provider dependency | + +--- + +## Decision heuristics + +### Favor model-reference over pure algorithmic when +- Semantic importance of content matters more than heuristics capture. +- Hallucination risk from model-written summaries is unacceptable. +- Cheap model API calls are available (Haiku, Flash, local). +- Cache-prefix stability is a primary cost concern. + +### Favor pi-vcc (pure algorithmic) over model-reference when +- Cost or latency of any model call is unacceptable. +- Heuristic extraction is good enough for the domain. +- Provider is unavailable or unreliable. +- Real-time compaction latency must be near-zero. + +### Favor Anthropic compaction over model-reference when +- Provider already offers compaction as a first-party feature. +- You trust the provider's summary quality. +- Integration simplicity matters more than cost optimization. + +--- + +## Status +Benchmark scaffold built and committed. Real DeepSeek Flash classifier tested on a 14K-message production session (promshim-ch, 80 compactions). Key findings: + +- Model-reference (DeepSeek Flash) produces a 1,958-char active prompt vs Pi's 41,659-char summary — **21× smaller**. +- Real classifier correctly identifies current goal (PR #14) while Pi's summary preserves a stale goal from 15 compactions ago. +- Cost: ~$0.001 per classification vs $0.18 for Pi's LLM summary — **180× cheaper**. +- Actionable REF summaries and goal-bundle parking designed but not yet implemented. +- Full prompt with system/tools/project/raw-tail: MRC ~1,789 tokens vs Pi ~11,714 tokens — **6.5× smaller**. + +Next: implement actionable REF summaries, goal-bundle parking, and recent-user-message weighting in the classifier prompt. Then re-test on the same session. + +## Sources +- `AGENTS.md` — pi-vcc project north star and design principles. +- `.pi/plans/cache-aware-compaction.md` — original cache-aware compaction plan. +- `bench/compaction/README.md` — existing benchmark harness design. +- Anthropic compaction docs — https://platform.claude.com/docs/en/build-with-claude/compaction +- Anthropic effective context engineering — https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents +- AWS Bedrock AgentCore compaction — https://towardsai.net/p/machine-learning/long-context-compaction-for-ai-agents-part-2-implementation-and-evaluation +- ContextPilot (arxiv 2511.03475v3) — context reuse via block ordering and deduplication for KV-cache. +- MemGPT/Letta — tiered memory architecture with model-managed memory blocks. +- OpenCode compaction epic — https://github.com/sst/opencode/issues/4102 +- Victor Dibia context engineering — https://newsletter.victordibia.com/p/context-engineering-101-how-agents +- `src/core/classifier.ts` — realClassify() via OpenAI-compatible API +- `bench/compaction/model-reference-selector.ts` — compactor with env-var-driven real/mock classifier +- `src/core/dump-context.ts` — session context extraction for classifier input +- DeepSeek Flash real-session test — promshim-ch session, 74 chunks classified in 5.1s, ~$0.001 diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..0314a1b --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,107 @@ +# AGENTS.md + +## Project North Star + +`pi-vcc` is an algorithmic conversation compactor for Pi. Its goal is not merely to make summaries shorter; it is to maximize expected continuation value after compaction. + +Optimize compaction across these objectives: + +1. **Recall fidelity** — important goals, constraints, files, identifiers, evidence handles, decisions, blockers, and next actions remain available either in active context or recall. +2. **Semantic coherence** — the compacted state should let the agent understand what is happening, why it matters, and what to do next. +3. **Post-compaction working room** — active prompt state should stay compact enough to leave useful room for future work. +4. **Retrieval dependence** — bulky or older detail may move out of active context only when it remains recoverable through transcript, recall, files, or artifacts. +5. **Cache preservation** — stable prompt prefixes should remain byte/token stable across ordinary compactions; volatile updates should be isolated into late recent/volatile sections. + +A shorter summary is not better if it loses continuity, exact identifiers, recoverability, or cache reuse. + +## Compaction Design Principles + +- Prefer stable structured state over full-summary rewrites. +- Keep durable facts before volatile facts. +- Keep volatile updates in explicit recent/volatile sections. +- Preserve exact paths, identifiers, error signatures, request IDs, span/probe IDs, and commit references when they are relevant evidence. +- Offload bulky re-fetchable details to recall/history with pointers rather than active prompt bodies. +- Separate current truth from historical transcript. Stale or corrected facts may remain recallable, but must not remain current guidance. +- Treat prompt-cache churn as a first-class performance and cost concern. + +## Current Cache-Aware Layout + +Stable/current sections should remain as stable as possible: + +```text +Session Goal +Files And Changes +Commits +Evidence Handles +User Preferences +Current Scope +``` + +Recent/volatile sections may change more often and should stay bounded: + +```text +Recent Commits +Recent Scope Updates +Recent User Preferences +Recent Evidence Handles +Outstanding Context +Brief Transcript +Kept Raw Tail +``` + +Do not move volatile content back into stable sections without benchmark-backed evidence. + +## Benchmarking Expectations + +Use the Docker benchmark path as the primary validation route: + +```bash +docker build -t pi-vcc-bench . +docker run --rm pi-vcc-bench --compactors pi-vcc --assert +docker run --rm pi-vcc-bench --compactors pi-vcc --assert-cache +``` + +For original-vs-current comparisons: + +```bash +node scripts/compare-compaction-refs.mjs \ + --baseline 53dc551 \ + --head HEAD \ + --compactors pi-vcc \ + --out /tmp/pi-vcc-compaction-compare +``` + +For real-session cache behavior: + +```bash +node scripts/compare-compaction-refs.mjs \ + --baseline 53dc551 \ + --head HEAD \ + --compactors pi-vcc \ + --real-only \ + --real-sessions-dir ~/.pi/agent/sessions \ + --real-limit 5 \ + --show-layer-diff \ + --out /tmp/pi-vcc-real-compare +``` + +## Interpreting Results + +Good changes should generally: + +- preserve or improve correctness assertions +- preserve or improve cache-boundary assertions +- move `firstChangedPromptLayer` later, not earlier +- increase stable-prefix tokens for repeated compactions +- avoid growing full prompt tokens unless the added state is justified +- keep recent/volatile sections bounded + +If a change improves one metric while hurting another, judge it by expected continuation value, not by any single metric alone. + +## Development Guidance + +- Add a focused RED probe before or alongside compaction behavior changes. +- Keep synthetic probes for exact correctness and cache-boundary behavior. +- Use real-session replay to find outliers and avoid overfitting synthetic cases. +- Prefer small semantic commits that can be reviewed and reverted independently. +- Do not claim cache improvements without fresh benchmark evidence. diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 0000000..1490f34 --- /dev/null +++ b/Dockerfile @@ -0,0 +1,22 @@ +# syntax=docker/dockerfile:1 + +# renovate: datasource=docker depName=oven/bun versioning=semver +ARG BUN_VERSION=1.3.13 + +FROM oven/bun:${BUN_VERSION} AS source +WORKDIR /app + +COPY --link package.json README.md index.ts ./ +COPY --link src ./src +COPY --link bench ./bench +COPY --link scripts ./scripts + +FROM oven/bun:${BUN_VERSION} AS final +ENV NODE_ENV=production + +COPY --link --from=source --chown=1000:1000 /app /app +WORKDIR /app +USER bun + +ENTRYPOINT ["bun", "scripts/bench-compaction.ts"] +CMD ["--jsonl"] diff --git a/README.md b/README.md index 66c184a..5aad72b 100644 --- a/README.md +++ b/README.md @@ -1,214 +1,296 @@ -# pi-vcc +# pi-mrc -[![npm](https://img.shields.io/npm/v/@sting8k/pi-vcc)](https://www.npmjs.com/package/@sting8k/pi-vcc) +This is a fork of `@sting8k/pi-vcc`, currently installed from GitHub or a local clone. -Algorithmic conversation compactor for [Pi](https://github.com/badlogic/pi-mono). No LLM calls — produces a brief transcript via extraction and formatting. +`pi-mrc` is a Model-Reference Compactor for [Pi](https://github.com/badlogic/pi-mono). It compacts conversation history into a small continuation state, stashes recoverable detail behind exact handles, and appends only the latest needed lookup index at the end of the model context. -Inspired by [VCC](https://github.com/lllyasviel/VCC) **(View-oriented Conversation Compiler)**. +The goal is not fuzzy transcript search or the shortest possible summary. The goal is: **after compaction, the next agent should know what to do, have room to work, and recover exact hidden context by handle when needed.** -## Demo +## What pi-mrc optimizes -![pi-vcc demo](./demo.gif) - -## Why pi-vcc - -| | Pi default | pi-vcc | -|---|---|---| -| **Method** | LLM-generated summary | Algorithmic extraction, no LLM | -| **Determinism** | Non-deterministic, can hallucinate | Same input = same output, always | -| **Token reduction** | Varies | 35-99% on real sessions (higher on longer sessions) | -| **Compaction latency** | Waits for LLM call | 30-470ms, no API calls | -| **History after compaction** | Gone — agent only sees summary | Active lineage searchable via `vcc_recall` (`scope:"all"` available) | -| **Repeated compactions** | Each rewrite risks losing more | Sections merge and accumulate | -| **Cost** | Burns tokens on summarization call | Zero — no API calls | -| **Structure** | Free-form prose | Brief transcript + 4 semantic sections | - -### Real session metrics - -Measured on real session JSONLs under `~/.pi/agent/sessions` (chars = rendered message text). - -| Session | Messages | Before | After | Reduction | Time | -|---|---|---|---|---|---| -| Session A | 2,943 | 997,162 | 7,959 | 99.2% | 64ms | -| Session B | 1,703 | 428,334 | 7,762 | 98.2% | 29ms | -| Session C | 1,657 | 424,183 | 9,577 | 97.7% | 54ms | -| Session D | 1,004 | 2,258,477 | 4,439 | 99.8% | 30ms | -| Session E | 486 | 295,006 | 11,163 | 96.2% | 30ms | -| Session F | 46 | 5,234 | 3,364 | 35.7% | 5ms | -| Session G | 27 | 8,595 | 2,489 | 71.0% | 2ms | - -## Features - -- **No LLM** — purely algorithmic, zero extra API cost -- **Brief transcript** — chronological conversation flow, each tool call collapsed to a one-liner with `(#N)` refs, text truncated to keep it compact -- **5 semantic sections** — session goal, files & changes, commits, outstanding context, user preferences -- **Bounded merge** — rolling sections re-capped after merge instead of growing unbounded -- **Lossless recall** — `vcc_recall` reads raw session JSONL, so active-lineage history stays searchable across compactions -- **Scoped recall** — default search is active lineage; use `scope:"all"` / `scope:all` to intentionally search across all lineages -- **Regex search** — `vcc_recall` supports regex patterns (`hook|inject`, `fail.*build`) and OR-ranked multi-word queries -- **Result ranking** — search results ranked by term relevance, rare terms weighted higher than common ones -- **`/pi-vcc-recall`** — slash command to search history directly, results shown as collapsible message and auto-fed to agent as context -- **Fallback cut** — still works when Pi core returns nothing to summarize -- **`/pi-vcc`** — manual compaction on demand +- **Continuation fidelity** — active goals, constraints, decisions, evidence handles, blockers, and next actions survive compaction. +- **Working room** — bulky old context is moved out of the active prompt. +- **Exact recoverability** — stashed details are resolved through `mrc_lookup`, not broad fuzzy search. +- **Cache stability** — stable guidance and KEEP chunks stay in the summary; volatile reference lists are appended as a late ephemeral suffix. +- **Source recoverability** — repository source is authoritative and rereadable, so source refs preserve locators instead of stale copied code bodies. ## Install +Install this fork directly from GitHub: + ```bash -pi install npm:@sting8k/pi-vcc +pi install https://github.com/BadLiveware/pi-model-reference-compactor ``` -Or from GitHub: +Or clone the fork and install/use the local checkout: ```bash -pi install https://github.com/sting8k/pi-vcc +git clone https://github.com/BadLiveware/pi-model-reference-compactor.git +cd pi-model-reference-compactor +pi install . ``` -Or try without installing: +For one-off local testing from the checkout: ```bash -pi -e https://github.com/sting8k/pi-vcc +pi -e . ``` -## Usage +## Quick use -Once installed, pi-vcc registers a `session_before_compact` hook. +Manual MRC compaction: -- Run `/pi-vcc` to trigger pi-vcc compaction manually. -- By default, `/compact` and auto-threshold compactions still go through pi core (LLM-based). Set `overrideDefaultCompaction: true` in the config to let pi-vcc handle all compaction paths. -- To search older active-lineage history after compaction, use `vcc_recall`. -- To intentionally search across all lineages, pass `scope:"all"` to `vcc_recall` or run `/pi-vcc-recall scope:all`. -- To search and feed results to agent yourself, run `/pi-vcc-recall [page:N]`. - - Tip: type `/recall` and Pi will autocomplete to `/pi-vcc-recall`. +```text +/pi-mrc +``` -### How compaction works +Disable automatic pi-mrc interception for this session: -Pi splits the conversation at the **last user message**. Everything after — the **kept tail** — stays intact and untouched. pi-vcc only summarizes the older portion before that cut point. +```text +/pi-mrc-off +``` -### Compacted message structure +Re-enable it: +```text +/pi-mrc-on ``` -[Session Goal] -- Fix the authentication bug in login flow -- [Scope change] -- Also update the session token refresh logic -[Files And Changes] -- Modified: src/auth/session.ts -- Created: tests/auth-refresh.test.ts +Inspect compaction reports: -[Commits] -- a1b2c3d: fix(auth): refresh token after password reset +```text +/pi-mrc-report show +/pi-mrc-report json inline +/pi-mrc-report list +``` -[Outstanding Context] -- lint check still failing on line 42 +Resolve an exact handle: -[User Preferences] -- Prefer Vietnamese responses -- Always run tests before committing +```text +mrc_lookup({ ref: "evidence:79dq9m" }) +mrc_lookup({ ref: "ref:read-context:df20oq" }) +``` -[user] -Fix the auth bug, users can't log in after password reset +List recent known handles: -[assistant] -Root cause is a missing token refresh after password reset... -* bash "bun test tests/auth.test.ts" (#12) -* edit "src/auth/session.ts" (#14) -* bash "bun test tests/auth.test.ts" (#16) -...(28 earlier lines omitted) +```text +mrc_lookup({ list: true, limit: 10 }) ``` -Sections appear only when relevant — a session with no git commits won't have `[Commits]`. +`mrc_lookup` is exact lookup over MRC references in the active lineage. It is intentionally not fuzzy transcript search. -**Sections:** +## How MRC compaction works -| Section | Description | -|---|---| -| `[Session Goal]` | Initial goal + scope changes (regex-based extraction) | -| `[Files And Changes]` | Modified/created files from tool calls (capped, paths trimmed to common root) | -| `[Commits]` | Git commits made during the session (last 8, hash + first line) | -| `[Outstanding Context]` | Unresolved items — errors, pending questions | -| `[User Preferences]` | Regex-extracted from user messages (`always`, `never`, `prefer`...) | -| Brief transcript | Chronological conversation flow — rolling window of ~120 recent lines, tool calls collapsed to one-liners with `(#N)` refs | +pi-mrc turns conversation state into referenceable chunks and classifies them into three tiers: -**Merge policy:** -- `Session Goal`, `User Preferences`: concise sticky sections -- `Outstanding Context`: fresh-only (replaced each compaction) -- `Files And Changes`, `Commits`: unique union across compactions -- Brief transcript: rolling window, older lines drop off +- **KEEP** — directly needed for the next read/edit/bash call. +- **REF** — useful later, but recoverable by handle. +- **DROP** — stale, duplicate, source-visible, or otherwise not worth preserving. -## Recall (Lossless History) +The compaction summary contains: -Pi's default compaction discards old messages permanently. After compaction, the agent only sees the summary. +1. a minimum viable summary (MVS), +2. selected KEEP chunks, +3. stable instructions for interpreting refs, +4. no dynamic full ref inventory. -`vcc_recall` bypasses this by reading the raw session JSONL file directly. By default it searches only the active conversation lineage, regardless of how many compactions have happened. Use `scope:"all"` only when you intentionally want to include off-lineage branches. +Dynamic refs are deliberately kept out of the summary. If the summary rewrote a changing list of refs on every compaction, it would churn early prompt context and reduce provider cache reuse. -### Search +## Context shape -Queries support **regex** and **multi-word OR logic** ranked by relevance: +During normal turns, pi-mrc stores full reference bodies in non-context session state and adds tiny handle anchors near the turn. After compaction, it advertises only refs that were stashed by the latest compaction and are not already visible. -``` -vcc_recall({ query: "auth token" }) // active-lineage OR search, ranked -vcc_recall({ query: "auth token", page: 2 }) // paginated (5 results/page) -vcc_recall({ query: "hook|inject" }) // regex pattern -vcc_recall({ query: "fail.*build" }) // regex pattern -vcc_recall({ query: "auth token", scope: "all" }) // search all lineages +Provider payload after a compaction looks like: + +```text +SYSTEM / tools / AGENTS.md / skills ++ +Compaction summary with MVS, KEEP chunks, and stable ref guidance ++ +Kept recent transcript tail ++ +User: Continue the implementation ++ +[MRC refs] +Internal latest-compaction stash. Prefer visible context; use mrc_lookup only if needed. Source refs are locators; reread files for code. Do not expose handles unless asked. +- ref:evidence:79dq9m — lookup if evidence details are needed: Error signatures: ERR_FOO_123 +- ref:read-context:df20oq — lookup if recent read-file locator is needed: Source locator: src/core/foo.ts; symbols: buildFoo, parseFoo; reread the repo... ``` -Manual slash command: +Before compaction, tiny anchors may appear near prior turns: -``` -/pi-vcc-recall auth token scope:all +```text +Assistant: I patched src/core/foo.ts and reran the focused test. +[MRC anchors: ref:evidence:79dq9m ref:read-context:df20oq] ``` -### Browse +Those anchors are intentionally small. They let a future compaction preserve lookup continuity without copying large hidden bodies into prompt text. -Without a query, returns the last 25 entries as brief summaries: +## Reference lifecycle +| Piece | Persisted? | Sent to model? | Purpose | +| --- | --- | --- | --- | +| Hidden ref state | Yes, non-context custom entries | No | Stores exact bodies for `mrc_lookup`. | +| `[MRC anchors: ...]` | Yes, tiny custom messages | Yes, near the turn | Gives compaction handle breadcrumbs. | +| Compaction stash | Yes, in compaction details | No direct prompt body | Records refs cut away by the latest compaction. | +| `[MRC refs]` suffix | No, rebuilt per model call | Yes, always last | Advertises latest-compaction stashed refs only. | + +Design decisions: + +- **Exact handles beat fuzzy search.** The model should recover known stashed facts by handle, not search the whole transcript. +- **Anchors are not user-facing.** The model is told not to mention or expose handles unless explicitly asked about compaction internals. +- **A handle is not evidence.** The model should call `mrc_lookup` before relying on hidden contents. +- **The suffix is ephemeral.** It is appended after the current user message so earlier context remains cacheable. + +## Source recoverability + +Repository source can be reread and may change. pi-mrc therefore stores source refs as locators, not copied source bodies. + +Example hidden body for a read-file ref: + +```text +Source locator: src/core/foo.ts; symbols: veryImportantHandler, helper; reread the repository file for authoritative source. ``` -vcc_recall() -vcc_recall({ scope: "all" }) // browse recent entries across all lineages -``` -### Expand +This preserves the route back to the source without making stale snippets look authoritative. + +pi-mrc keeps full hidden bodies for context that is not cheaply recoverable from files: -Returns full untruncated content for specific indices found via search: +- exact error output, +- benchmark results, +- request IDs, span IDs, trace IDs, and probe IDs, +- user decisions and constraints, +- deleted or dirty edits not present in current files, +- non-obvious investigation conclusions. +## `mrc_lookup` + +`mrc_lookup` resolves exact handles from hidden ref state and latest compaction stash details. + +Lookup by handle: + +```text +mrc_lookup({ ref: "evidence:79dq9m" }) ``` -vcc_recall({ expand: [41, 42] }) // active-lineage expand -vcc_recall({ expand: [41, 42], scope: "all" }) // expand across all lineages + +Example result: + +```text +## ref:evidence:79dq9m +kind: evidence +source: compaction +entry: 42 @ 2026-05-10T12:34:56.000Z +summary: lookup if evidence details are needed: Error signatures: ERR_FOO_123 + +Error signatures: ERR_FOO_123 ``` -Typical workflow: **search → find relevant entry indices → expand those indices for full content**. +List recent refs: -> Some tool results are truncated by Pi core at save time. `expand` returns everything in the JSONL but can't recover what Pi already cut. +```text +mrc_lookup({ list: true, limit: 10 }) +``` -## Pipeline +No fuzzy query mode is provided. If broad transcript search is wanted later, it should be a separate tool with a separate name and policy. -1. **Normalize** — raw Pi messages → uniform blocks (user, assistant, tool_call, tool_result, thinking) -2. **Filter noise** — strip system messages, empty blocks -3. **Build sections** — extract goal, file paths, blockers, preferences -4. **Brief transcript** — chronological conversation flow, tool calls collapsed to one-liners, text truncated -5. **Format** — render into bracketed sections + transcript -6. **Merge** — if previous summary exists: sticky sections merge, volatile sections replace, transcript rolls +## Commands and tools -## Config +| Name | Kind | Description | +| --- | --- | --- | +| `/pi-mrc` | command | Run MRC compaction manually. | +| `/pi-mrc-off` | command | Disable pi-mrc interception for this session. | +| `/pi-mrc-on` | command | Re-enable pi-mrc interception for this session. | +| `/pi-mrc-report` | command | Show or write latest compaction report artifacts. | +| `/pi-mrc-dump-context` | command | Debug current real context buffer or extracted session context. | +| `mrc_lookup` | tool | Resolve exact MRC `ref:*` handles and hidden bodies. | -Config lives at `~/.pi/agent/pi-vcc-config.json` (auto-scaffolded on first load with safe defaults): +## Configuration + +Config lives at `~/.pi/agent/pi-mrc-config.json` and is scaffolded on first load: ```json { - "overrideDefaultCompaction": false, + "overrideDefaultCompaction": true, "debug": false } ``` -- **`overrideDefaultCompaction`** *(default `false`)*: when `false`, pi-vcc only runs for `/pi-vcc`; `/compact` and auto-threshold compactions fall through to pi core. Set `true` to make pi-vcc handle all compaction paths. -- **`debug`** *(default `false`)*: when `true`, each compaction writes detailed info to `/tmp/pi-vcc-debug.json` — message counts, cut boundary, summary preview, sections. +| Key | Default | Meaning | +| --- | --- | --- | +| `overrideDefaultCompaction` | `true` | When true, pi-mrc handles `/compact`, auto-threshold, overflow retry compactions, and `/pi-mrc`. When false, only `/pi-mrc` is intercepted. | +| `debug` | `false` | Write `/tmp/pi-mrc-debug.json` after compaction with cut boundary, counts, summary preview, and stash stats. | + +## Compaction reports + +After pi-mrc compacts, it emits a report card with: + +- source and kept message counts, +- skipped internal message counts, +- summary size and total MRC compaction timing, +- compaction details containing the hidden `modelReferenceIndex` stash. + +Artifacts are written under `/tmp/pi-mrc-reports`. + +## Benchmarking and validation + +Build the benchmark image: + +```bash +docker build -t pi-mrc-bench . +``` + +Run MRC assertion gates: + +```bash +docker run --rm pi-mrc-bench --compactors model-reference-selector --assert +``` + +The old structured compactor remains in the benchmark harness as an internal baseline, not the public product surface: + +```bash +docker run --rm pi-mrc-bench --compactors pi-vcc --assert +docker run --rm pi-mrc-bench --compactors pi-vcc --assert-cache +``` + +Compare revisions: + +```bash +node scripts/compare-compaction-refs.mjs \ + --baseline 53dc551 \ + --head HEAD \ + --compactors pi-vcc \ + --out /tmp/pi-mrc-compaction-compare +``` + +Real-session replay: + +```bash +docker run --rm \ + -v ~/.pi/agent/sessions:/sessions:ro \ + pi-mrc-bench \ + --real-only \ + --real-sessions-dir /sessions \ + --real-limit 5 \ + --compactors pi-vcc \ + --jsonl +``` + +Recent validation for the MRC path passed: + +- `model-reference-selector --assert`, +- focused smokes for anchors, latest-compaction stash, no-precompaction refs, guidance, exact lookup, and source-locator refs, +- legacy structured `pi-vcc --assert` and `pi-vcc --assert-cache` while that baseline remains in the harness. + +`53dc551` is the pre-MRC structured baseline used for repo-local comparisons. Pi's built-in compactor is not exported as a callable API, so this benchmark does not directly compare against Pi internal compaction. -## Related Work +## Design principles -- [VCC](https://github.com/lllyasviel/VCC) — the original transcript-preserving conversation compiler -- [Pi](https://github.com/badlogic/pi-mono) — the AI coding agent this extension is built for +- **MRC + exact lookup is the product.** Fuzzy recall is intentionally out of scope. +- **Keep dynamic refs late.** The latest ref index is an ephemeral postfix, not summary text. +- **Keep handles internal.** Refs are agent continuity metadata, not user-facing prose. +- **Reread source.** File/symbol locators are safer than copied code snippets. +- **Preserve unrecoverable facts.** Exact errors, constraints, benchmark results, and user decisions must remain in prompt or lookup. +- **Validate cache behavior.** Use Docker gates and real-session replay before claiming continuation or cache wins. ## License diff --git a/bench/compaction/README.md b/bench/compaction/README.md new file mode 100644 index 0000000..41c084a --- /dev/null +++ b/bench/compaction/README.md @@ -0,0 +1,299 @@ +# Compaction Benchmark + +This benchmark evaluates conversation compaction as a continuation system, not only as a compression routine. It focuses on whether a compacted agent state preserves recoverable work while keeping cacheable prompt prefixes stable. + +The design borrows the pressure-test loop used for skill validation: first make the current behavior fail in a controlled scenario, then implement the smallest compaction change that fixes the observed failure, and rerun the same scenario plus nearby variants. + +## Evaluation loop + +Use the benchmark as a RED-GREEN-REFACTOR loop for compaction behavior: + +1. **RED**: run the current compactor and record exact failures such as missing identifiers, stale current facts, bulky active text, or unstable early layers. +2. **GREEN**: add the smallest targeted compaction change that fixes the observed failure. +3. **REFACTOR**: pressure-test adjacent cases so the fix does not only satisfy one string probe. +4. **ITERATE**: keep the failing scenario in the benchmark and repeat until the desired compactor passes or the intended semantics need to change. + +Do not implement broad cache-aware layering only from design intuition. Add or keep a failing probe for each behavior the implementation is meant to improve. + +## Compactors under comparison + +The runner uses a common offline interface: + +- `pi-vcc`: current deterministic `compile()` output. +- `full-rewrite-checkpoint`: deterministic stand-in for a regenerated structured summary plus transcript, without external recall. +- `cache-aware-layered`: deterministic layered prototype that separates stable schema, durable memory, structured checkpoint, rolling transcript, raw tail, and recall pointers. + +LLM-backed compactors can be added behind the same interface. Live model calls should be kept separate from the default offline run so local validation remains cheap and deterministic. + +## Benchmark levels + +The current harness covers the first level and some cache-churn signals. Later levels should be added before using benchmark results to claim end-to-end agent quality. + +1. **Offline state probes** + - exact active terms + - current-state terms + - recall-only terms + - forbidden current-state terms + - terms that must stay out of active prompt text + - layer churn and longest common prefix + +2. **Micro-continuation probes** + - compacted context plus a tiny disposable fixture + - agent gets a one-to-three action budget + - pass/fail by expected command, file, or decision + +3. **Hermetic Pi replay** + - isolated `PI_CODING_AGENT_DIR` + - actual compaction hook and session context construction + - optional default-model and small-model continuation probes + +4. **Live provider cache probes** + - provider-reported cached and uncached tokens + - latency to first token and total latency + - effective input cost over the next few turns + +## Scenario shape + +Each synthetic case contains: + +- an ordered message transcript +- one or more compaction points to replay repeated compactions +- exact terms that should remain somewhere in active prompt state +- exact terms that should be in current-state layers, not only historical transcript or raw tail +- exact terms that may be absent from active state but must be recoverable from recall +- terms that must not appear in current-state layers after corrections or branch-sensitive updates +- terms that must stay out of active prompt text because recall should carry them +- continuation terms that indicate the agent can resume the next action + +Real Pi sessions can be added later as fixtures or sampled from local session JSONL files, but synthetic cases provide gold expectations for regressions. + +## Scoped assertions + +The runner distinguishes scopes so historical fidelity is not confused with current state: + +- `activeTerms`: must appear anywhere in the active compacted prompt. +- `currentTerms`: must appear in current-state layers. +- `recallTerms`: must be recoverable from recall corpus search. +- `forbiddenTerms`: must not appear anywhere in the active compacted prompt. +- `forbiddenCurrentTerms`: must not appear in current-state layers, but may exist in historical transcript/tail or recall corpus. +- `activeAbsentTerms`: must not appear in active prompt text; they are expected to live in recall only. + +This matters for corrections. For example, an old preference may remain in historical transcript, but it must not remain in durable memory or the current checkpoint after a user correction. + +## Metrics + +Each compaction cycle records: + +- active state size in characters and approximate tokens +- current-state size in characters and approximate tokens +- compaction latency +- longest common prefix with the previous compacted prompt +- first changed layer and changed layer names when a compactor exposes layers +- active exact-term recall against gold terms +- current-state exact-term recall against gold terms +- forbidden active and current-state leakage +- active leakage of terms expected to be recall-only +- recall top-k recovery for externalized terms +- continuation-term recovery + +The cache-oriented metrics are offline approximations. They do not replace provider-reported cached-token accounting, but they highlight prompt churn that is likely to hurt prefix-based caching. + +## Full-prompt cache simulation + +Each cycle also builds a simulated provider prompt so cache churn can be measured outside the compacted summary alone. The simulated prompt contains stable provider/tool/project layers, the compactor's rendered layers, and a small kept raw tail. For `pi-vcc`, current summary sections are split into separate simulated prompt layers so the report can identify which section changes first. This does not exactly reproduce Pi's production request, but it catches the main prefix-cache risk: a volatile update moving earlier than necessary. + +Additional cache fields include: + +- `fullPromptChars` and `fullPromptTokensEst` +- `fullPromptLcpTokensWithPrevious` +- `fullPromptLcpTokenRatioWithPrevious` +- `firstChangedPromptLayer` +- `changedPromptLayers` +- `stablePrefixTokens` +- `promptLayerSizes` +- `promptLayerTokenDeltas` + +Use these fields to compare section ordering and stable/volatile splits before adding live provider probes. A better cache-aware layout should generally increase `stablePrefixTokens`, push `firstChangedPromptLayer` later, and keep volatile deltas out of static/current prefix layers when the underlying facts did not change. + +## Running + +Run all offline compactors: + +```bash +bun scripts/bench-compaction.ts +``` + +Emit one JSON record per compaction cycle: + +```bash +bun scripts/bench-compaction.ts --jsonl > bench-results.jsonl +``` + +Limit the comparison to selected compactors: + +```bash +bun scripts/bench-compaction.ts --compactors pi-vcc,cache-aware-layered +``` + +Run assertion mode. This exits non-zero if any selected compactor misses active/current/recall/continuation expectations or leaks forbidden/offloaded terms: + +```bash +bun scripts/bench-compaction.ts --compactors pi-vcc --assert +``` + +Run cache assertion mode for synthetic cache-stability probes. This is separate from correctness assertions and checks that each cache probe first changes only at its intended recent/volatile boundary, with a minimum stable-prefix token floor: + +```bash +bun scripts/bench-compaction.ts --compactors pi-vcc --assert-cache +``` + +The current cache-boundary probes are: + +- `cache-bust-volatile-next-step`: first change should be `Pi VCC Outstanding Context` or later. +- `cache-bust-evidence-growth`: first change should be `Pi VCC Recent Evidence Handles` or later. +- `cache-bust-scope-growth`: first change should be `Pi VCC Recent Scope Updates` or later. +- `cache-bust-mutable-tail-growth`: first change should be in a recent/volatile layer and recent layer sizes must stay under their caps. +- `cache-bust-commit-growth`: new commits should first change `Pi VCC Recent Commits`, not the stable `Pi VCC Commits` section. +- `cache-bust-long-evidence-line`: long fresh evidence should first change `Pi VCC Recent Evidence Handles` while keeping that layer under its size cap. +- `cache-bust-long-scope-line`: verbose fresh scope should first change `Pi VCC Recent Scope Updates` while keeping that layer under its size cap. +- `cache-bust-long-preference-line`: verbose fresh preferences should first change `Pi VCC Recent User Preferences` while keeping that layer under its size cap. + +Append sampled real Pi sessions from a local session directory. Real-session cases have no gold state assertions; they are useful for size, latency, growth, and cache-churn signals: + +```bash +bun scripts/bench-compaction.ts \ + --real-sessions-dir ~/.pi/agent/sessions \ + --real-limit 2 \ + --compactors pi-vcc +``` + +Run only sampled real sessions: + +```bash +bun scripts/bench-compaction.ts \ + --real-only \ + --real-sessions-dir ~/.pi/agent/sessions \ + --real-limit 2 \ + --compactors pi-vcc \ + --jsonl +``` + +Filter cases and include concise layer diffs when investigating cache churn: + +```bash +bun scripts/bench-compaction.ts \ + --real-only \ + --real-sessions-dir ~/.pi/agent/sessions \ + --case-filter ch-observability \ + --compactors pi-vcc \ + --show-layer-diff \ + --jsonl +``` + +Include pi-vcc's machine-readable compaction report in each JSON/JSONL cycle when you need section policies, stable/recent churn, caps, and warnings: + +```bash +bun scripts/bench-compaction.ts \ + --compactors pi-vcc \ + --case-filter cache-bust-scope-growth \ + --include-report \ + --jsonl +``` + +Print a human-readable report explanation instead of JSON: + +```bash +bun scripts/bench-compaction.ts \ + --compactors pi-vcc \ + --case-filter cache-bust-scope-growth \ + --explain +``` + +Run the same checks in Docker: + +```bash +docker build -t pi-vcc-bench . +docker run --rm pi-vcc-bench +docker run --rm pi-vcc-bench --compactors pi-vcc --assert +docker run --rm \ + -v ~/.pi/agent/sessions:/sessions:ro \ + pi-vcc-bench \ + --real-only \ + --real-sessions-dir /sessions \ + --real-limit 2 \ + --compactors pi-vcc \ + --jsonl +``` + +Assertion failures are expected for current baselines while the RED scenarios are documenting known gaps. Use selected compactors when checking one implementation at a time. + +## Comparing refs + +Use the ref comparison runner when you need an original-vs-implementation benchmark instead of a single working-tree run. It creates isolated git worktrees, builds each ref as its own Docker image, runs the same benchmark command in both images, and writes paired JSONL plus a Markdown delta report. + +A practical runnable baseline is `53dc551`, the cache-stability assertion checkpoint before the later production layout/extraction refinements. Compare it with the current checkout: + +```bash +node scripts/compare-compaction-refs.mjs \ + --baseline 53dc551 \ + --head HEAD \ + --compactors pi-vcc \ + --out /tmp/pi-vcc-compaction-compare +``` + +Older refs can be useful historically, but they must contain a runnable version of the benchmark harness and its source dependencies. + +Include sampled real sessions with the same Docker-only benchmark path: + +```bash +node scripts/compare-compaction-refs.mjs \ + --baseline 53dc551 \ + --head HEAD \ + --compactors pi-vcc \ + --real-only \ + --real-sessions-dir ~/.pi/agent/sessions \ + --real-limit 1 \ + --show-layer-diff \ + --out /tmp/pi-vcc-compaction-compare-real +``` + +The output directory contains: + +- `baseline.jsonl`: per-cycle metrics for the baseline ref +- `head.jsonl`: per-cycle metrics for the implementation ref +- `comparison.md`: aggregate deltas and notable changed cycles +- `baseline.stderr.log` / `head.stderr.log`: benchmark diagnostics from each Docker run + +For cache-aware compaction, the most useful report signals are: + +- increased mean stable-prefix tokens +- later `firstChangedPromptLayer` in matched cycles +- fewer cache failure cycles +- no increase in correctness failure cycles +- lower or justified full-prompt token counts + +## Interpreting results + +A useful compactor should: + +- preserve exact identifiers, file paths, evidence handles, constraints, blockers, and next actions +- keep current state separate from historical transcript and raw tail +- avoid retaining corrected stale facts in current-state layers +- keep stable layers byte-identical across ordinary compactions +- move bulky re-fetchable details behind recall pointers without losing top-k recoverability +- reduce active prompt size without shifting too much cost into uncached post-compaction turns + +Shorter output is not sufficient if continuation or recall probes fail. + +## Future live-provider extension + +A live cache probe should replay the same compacted prompts against providers that report cache usage and capture: + +- cached input tokens +- uncached input tokens +- cache-write tokens +- latency to first token +- total request latency +- effective input cost over the next few turns + +That extension should be opt-in because it depends on credentials, provider-specific cache semantics, and billable requests. diff --git a/bench/compaction/cache-boundaries.json b/bench/compaction/cache-boundaries.json new file mode 100644 index 0000000..1192251 --- /dev/null +++ b/bench/compaction/cache-boundaries.json @@ -0,0 +1,86 @@ +{ + "cache-bust-volatile-next-step": { + "allowedFirstChangedLayers": [ + "Pi MRC Outstanding Context", + "Pi MRC Brief Transcript", + "Kept Raw Tail" + ], + "minStablePrefixTokens": 90 + }, + "cache-bust-evidence-growth": { + "allowedFirstChangedLayers": [ + "Pi MRC Recent Evidence Handles", + "Pi MRC Brief Transcript", + "Kept Raw Tail" + ], + "minStablePrefixTokens": 110 + }, + "cache-bust-scope-growth": { + "allowedFirstChangedLayers": [ + "Pi MRC Recent Scope Updates", + "Pi MRC Brief Transcript", + "Kept Raw Tail" + ], + "minStablePrefixTokens": 110 + }, + "cache-bust-mutable-tail-growth": { + "allowedFirstChangedLayers": [ + "Pi MRC Recent Scope Updates", + "Pi MRC Recent User Preferences", + "Pi MRC Recent Evidence Handles", + "Pi MRC Outstanding Context", + "Pi MRC Brief Transcript", + "Kept Raw Tail" + ], + "minStablePrefixTokens": 140, + "maxPromptLayerSizes": { + "Pi MRC Recent Scope Updates": 420, + "Pi MRC Recent User Preferences": 360, + "Pi MRC Recent Evidence Handles": 260 + } + }, + "cache-bust-commit-growth": { + "allowedFirstChangedLayers": [ + "Pi MRC Recent Commits", + "Pi MRC Brief Transcript", + "Kept Raw Tail" + ], + "minStablePrefixTokens": 115, + "maxPromptLayerSizes": { + "Pi MRC Recent Commits": 520 + } + }, + "cache-bust-long-evidence-line": { + "allowedFirstChangedLayers": [ + "Pi MRC Recent Evidence Handles", + "Pi MRC Brief Transcript", + "Kept Raw Tail" + ], + "minStablePrefixTokens": 105, + "maxPromptLayerSizes": { + "Pi MRC Recent Evidence Handles": 260 + } + }, + "cache-bust-long-scope-line": { + "allowedFirstChangedLayers": [ + "Pi MRC Recent Scope Updates", + "Pi MRC Brief Transcript", + "Kept Raw Tail" + ], + "minStablePrefixTokens": 110, + "maxPromptLayerSizes": { + "Pi MRC Recent Scope Updates": 300 + } + }, + "cache-bust-long-preference-line": { + "allowedFirstChangedLayers": [ + "Pi MRC Recent User Preferences", + "Pi MRC Brief Transcript", + "Kept Raw Tail" + ], + "minStablePrefixTokens": 110, + "maxPromptLayerSizes": { + "Pi MRC Recent User Preferences": 300 + } + } +} diff --git a/bench/compaction/model-reference-selector.ts b/bench/compaction/model-reference-selector.ts new file mode 100644 index 0000000..11cb589 --- /dev/null +++ b/bench/compaction/model-reference-selector.ts @@ -0,0 +1,158 @@ +/** + * Model-reference compactor for benchmark harness. + * + * Architecture: + * 1. Extract chunks from built compaction state + * 2. Classify chunks via mock model → KEEP / REF / DROP + MVS + * 3. Order KEEP chunks for cache-prefix stability + * 4. Stitch Tier 1 active prompt: MVS + ordered KEEP sections + recall note + * + * Imported and registered in bench/compaction/offline-runner.ts. + */ + +import type { Message } from "@mariozechner/pi-ai"; +import { normalize } from "../../src/core/normalize"; +import { filterNoise } from "../../src/core/filter-noise"; +import { buildSections } from "../../src/core/build-sections"; +import { buildCompactionState } from "../../src/core/compaction-state"; +import { chunkCompactionState, type CompactionChunk } from "../../src/core/chunk-model"; +import { mockClassify } from "../../src/core/mock-classifier"; +import { realClassify } from "../../src/core/classifier"; +import { inlineSmallRefs } from "../../src/core/classifier"; +import { + MODEL_REFERENCE_RECALL_NOTE, + mergePriorChunks, + orderKeepChunks, + renderKeepSections, + renderModelReferenceSummary, +} from "../../src/core/model-reference-stitch"; +import type { CompactorContext, CompactorResult, LayerSnapshot } from "./offline-runner"; + +export const createModelReferenceCompactor = (helpers: { + sourceTextOf: (messages: Message[]) => string; + estimateTokens: (text: string) => number; + renderedDocuments: (messages: Message[]) => Array<{ id: string; text: string; source: string }>; +}) => ({ + name: "model-reference-selector", + compact: async (ctx: CompactorContext): Promise => { + const { messages, allMessages, previous } = ctx; + const inputTokens = helpers.estimateTokens(helpers.sourceTextOf(messages)); + + // Check env for real classifier config + const classifierModel = process.env.CLASSIFIER_MODEL || "deepseek-chat"; + const classifierBaseUrl = process.env.CLASSIFIER_BASE_URL || "https://api.deepseek.com/v1"; + let apiKey = process.env.DEEPSEEK_API_KEY || process.env.OPENAI_API_KEY; + if (!apiKey) { + try { + const auth = JSON.parse(require("fs").readFileSync( + require("path").join(require("os").homedir(), ".pi", "agent", "auth.json"), "utf-8")); + apiKey = auth?.deepseek?.key || auth?.deepseek?.apiKey; + } catch {} + } + const useRealClassifier = !!(apiKey && classifierModel); + + // 0. Recover previous classification for merge-awareness + const prevRefIndex = (previous as any)?.refIndex; + const previousKeepIds = new Set(prevRefIndex?.keepIds ?? []); + const previousRefIds = new Set(prevRefIndex?.refs?.map((r: any) => r.id) ?? []); + + // 1. Build compaction state (reuse existing pipeline) + const blocks = filterNoise(normalize(messages)); + const sectionData = buildSections({ blocks }); + const state = buildCompactionState(sectionData); + + // 2. Chunk the state, plus previous KEEP and REF chunks for merge-awareness. + // Previous chunks can share section-index IDs with fresh chunks; alias those + // collisions so still-relevant old goals/constraints remain classifiable. + const chunks = mergePriorChunks( + chunkCompactionState(state), + [ + ...((prevRefIndex?.keepChunks as CompactionChunk[] | undefined) ?? []), + ...((prevRefIndex?.refChunks as CompactionChunk[] | undefined) ?? []), + ], + ); + + // 4. Classify (real API if env vars set, else mock) + const start = performance.now(); + let classification: any; + let realTokenUsage: { promptTokens: number; completionTokens: number } | undefined; + if (useRealClassifier) { + const realResult = await realClassify(chunks, messages.length, { + baseUrl: classifierBaseUrl, + apiKey, + model: classifierModel, + maxTokens: 1024, + }); + classification = realResult; + // Auto-promote tiny REFs to KEEP + classification = inlineSmallRefs(classification, chunks); + // Store real token usage + realTokenUsage = realResult.usage; + } else { + classification = mockClassify(chunks, messages.length, { + previousIds: { + keepIds: [...previousKeepIds], + refIds: [...previousRefIds], + }, + }); + } + + // 5. Build KEEP chunk objects (exclude bundled chunks) + const bundledIds = new Set(classification.bundles?.flatMap((b) => b.chunkIds) ?? []); + const keepChunks = chunks.filter( + (c) => classification.keepIds.includes(c.id) && !bundledIds.has(c.id), + ); + + // 6. Order KEEP chunks for stability + const ordered = orderKeepChunks(keepChunks, previousKeepIds); + + // 7. Render Tier 1 active prompt + const keepText = renderKeepSections(ordered); + const activePromptState = renderModelReferenceSummary(classification, chunks, { + previousKeepIds, + }); + + const elapsed = performance.now() - start; + + // 8. Build layers for benchmark metrics + const layers: LayerSnapshot[] = [ + { name: "Model-Ref MVS", role: "current", text: classification.mvs }, + { name: "Model-Ref KEEP Chunks", role: "current", text: keepText }, + { name: "Model-Ref Recall Note", role: "recall", text: MODEL_REFERENCE_RECALL_NOTE }, + ]; + + const refDocs = [ + ...classification.refs.map((r) => ({ + id: r.id, + text: `${r.summary} (use mrc_lookup)`, + source: `model-ref-tier2` as const, + })), + ...(classification.bundles ?? []).map((b) => ({ + id: `bundle:${b.id}`, + text: `[${b.label}] ${b.recallCondition}. Files: ${b.chunkIds.filter((id) => id.startsWith("F")).length}, Chunks: ${b.chunkIds.length} (use mrc_lookup for listed refs)`, + source: `model-ref-bundle` as const, + })), + ]; + + return { + activePromptState, + layers, + recallCorpus: helpers.renderedDocuments(allMessages).concat(refDocs), + stats: { + compactionMs: elapsed, + estimatedInputTokens: inputTokens, + estimatedOutputTokens: helpers.estimateTokens(activePromptState), + // Real API token counts when available + classifierPromptTokens: realTokenUsage?.promptTokens, + classifierCompletionTokens: realTokenUsage?.completionTokens, + }, + // Store classification metadata for next compaction's stability ordering + refIndex: { + keepIds: classification.keepIds, + refs: classification.refs, + keepChunks: keepChunks.map((c) => ({ id: c.id, kind: c.kind, text: c.text, section: c.section, index: c.index })), + refChunks: chunks.filter((c) => classification.refs.some((r) => r.id === c.id)), + }, + } as any; + }, +}); diff --git a/bench/compaction/offline-runner.ts b/bench/compaction/offline-runner.ts new file mode 100644 index 0000000..0034a02 --- /dev/null +++ b/bench/compaction/offline-runner.ts @@ -0,0 +1,770 @@ +import { readFileSync } from "node:fs"; +import { join } from "node:path"; +import { fileURLToPath } from "node:url"; +import { performance } from "node:perf_hooks"; +import type { Message } from "@mariozechner/pi-ai"; +import { compileWithReport } from "../../src/core/summarize"; +import { buildSections } from "../../src/core/build-sections"; +import { normalize } from "../../src/core/normalize"; +import { renderMessage } from "../../src/core/render-entries"; +import { clip, textOf } from "../../src/core/content"; +import { summarizeToolResultForPrompt } from "../../src/core/tool-result-summary"; +import type { PiMrcCompactionReport } from "../../src/core/compaction-report"; +import { syntheticCompactionCases, type CompactionBenchmarkCase, type ExpectedTerm } from "./synthetic-cases"; +import { createModelReferenceCompactor } from "./model-reference-selector"; + +export type LayerRole = "static" | "current" | "history" | "recall"; + +export interface LayerSnapshot { + name: string; + role: LayerRole; + text: string; +} + +export interface RecallDocument { + id: string; + text: string; +} + +export interface PromptLayerSnapshot { + name: string; + text: string; +} + +export interface PromptSnapshot { + text: string; + layers: PromptLayerSnapshot[]; +} + +export interface CompactorResult { + activePromptState: string; + layers: LayerSnapshot[]; + recallCorpus: RecallDocument[]; + report?: PiMrcCompactionReport; + stats: { + compactionMs: number; + estimatedInputTokens?: number; + estimatedOutputTokens?: number; + }; +} + +export interface CompactorContext { + /** Messages newly summarized in this compaction cycle. */ + messages: Message[]; + /** Full replay prefix available up to this compaction point. */ + allMessages: Message[]; + previous?: CompactorResult; + cycle: number; +} + +export interface OfflineCompactor { + name: string; + compact(context: CompactorContext): CompactorResult | Promise; +} + +export interface TermProbeResult { + label: string; + term: string; + applicable: boolean; + found: boolean; +} + +export interface RecallProbeResult extends TermProbeResult { + query: string; + topHitIds: string[]; +} + +export interface PromptLayerDiff { + layer: string; + previousPreview: string; + currentPreview: string; + addedLines: string[]; + removedLines: string[]; +} + +export interface CycleMetrics { + caseId: string; + compactor: string; + cycle: number; + compactionPoint: number; + activeChars: number; + activeTokensEst: number; + currentChars: number; + currentTokensEst: number; + fullPromptChars: number; + fullPromptTokensEst: number; + compactionMs: number; + lcpTokensWithPrevious: number | null; + lcpTokenRatioWithPrevious: number | null; + firstChangedLayer: string | null; + changedLayers: string[]; + fullPromptLcpTokensWithPrevious: number | null; + fullPromptLcpTokenRatioWithPrevious: number | null; + firstChangedPromptLayer: string | null; + changedPromptLayers: string[]; + stablePrefixTokens: number | null; + activeTermRecall: number | null; + currentTermRecall: number | null; + recallTermHitRate: number | null; + continuationTermRecall: number | null; + forbiddenLeakCount: number; + forbiddenCurrentLeakCount: number; + activeAbsentLeakCount: number; + missingActiveTerms: string[]; + missingCurrentTerms: string[]; + missingRecallTerms: string[]; + leakedForbiddenTerms: string[]; + leakedForbiddenCurrentTerms: string[]; + leakedActiveAbsentTerms: string[]; + layerSizes: Record; + promptLayerSizes: Record; + promptLayerTokenDeltas: Record; + promptLayerDiffs?: PromptLayerDiff[]; + compactionReport?: PiMrcCompactionReport; +} + +export interface BenchmarkRunResult { + cycles: CycleMetrics[]; + aggregate: Record; +} + +const SEPARATOR = "\n\n---\n\n"; + +const tokenize = (text: string): string[] => + text.match(/[\p{L}\p{N}_./:-]+|[^\s]/gu) ?? []; + +const estimateTokens = (text: string): number => Math.ceil(text.length / 4); + +const lowerIncludes = (haystack: string, needle: string): boolean => + haystack.toLowerCase().includes(needle.toLowerCase()); + +const lcpTokens = (a: string, b: string): number => { + const aa = tokenize(a); + const bb = tokenize(b); + const limit = Math.min(aa.length, bb.length); + let i = 0; + while (i < limit && aa[i] === bb[i]) i += 1; + return i; +}; + +const renderedDocuments = (messages: Message[]): RecallDocument[] => + messages.map((message, index) => { + const rendered = renderMessage(message, index, true); + return { + id: `${index}:${rendered.role}`, + text: `#${index} [${rendered.role}] ${rendered.summary}`, + }; + }); + +const sourceTextOf = (messages: Message[]): string => + renderedDocuments(messages).map((doc) => doc.text).join("\n"); + +const textForRoles = (result: CompactorResult, roles: LayerRole[]): string => { + const selected = result.layers.filter((layer) => roles.includes(layer.role)); + if (selected.length === 0) return ""; + return selected.map((layer) => `[${layer.name}]\n${layer.text}`).join("\n\n"); +}; + +const renderPromptLayers = (layers: PromptLayerSnapshot[]): string => + layers.map((layer) => `[${layer.name}]\n${layer.text}`).join("\n\n"); + +const simulatedPromptOf = (result: CompactorResult, sourceMessages: Message[]): PromptSnapshot => { + const recentTail = renderedDocuments(sourceMessages.slice(-2)) + .map((doc) => doc.text) + .join("\n"); + const layers: PromptLayerSnapshot[] = [ + { + name: "Provider Prefix", + text: [ + "system: You are an expert coding assistant operating inside Pi.", + "format: preserve compacted state sections and use recall before redoing prior work.", + ].join("\n"), + }, + { + name: "Tool Definitions", + text: "tools: read, bash, edit, write, mrc_lookup", + }, + { + name: "Project Instructions", + text: "project: follow local guidance, validate before claiming completion, avoid destructive actions.", + }, + ...result.layers.map((layer) => ({ name: layer.name, text: layer.text })), + { + name: "Kept Raw Tail", + text: recentTail || "- (none)", + }, + ]; + return { layers, text: renderPromptLayers(layers) }; +}; + +const summarizeChangedPromptLayers = ( + previous: PromptSnapshot | undefined, + current: PromptSnapshot, +): { firstChangedPromptLayer: string | null; changedPromptLayers: string[]; promptLayerTokenDeltas: Record } => { + if (!previous) return { firstChangedPromptLayer: null, changedPromptLayers: [], promptLayerTokenDeltas: {} }; + const prevByName = new Map(previous.layers.map((layer) => [layer.name, layer.text])); + const changedPromptLayers = current.layers + .filter((layer) => prevByName.get(layer.name) !== layer.text) + .map((layer) => layer.name); + const promptLayerTokenDeltas = Object.fromEntries(current.layers.map((layer) => { + const previousTokens = tokenize(prevByName.get(layer.name) ?? "").length; + const currentTokens = tokenize(layer.text).length; + return [layer.name, currentTokens - previousTokens]; + })); + return { + firstChangedPromptLayer: changedPromptLayers[0] ?? null, + changedPromptLayers, + promptLayerTokenDeltas, + }; +}; + +const linePreview = (text: string, maxChars = 400): string => + text.length <= maxChars ? text : `${text.slice(0, maxChars)}...(truncated)`; + +const changedPromptLayerDiffs = ( + previous: PromptSnapshot | undefined, + current: PromptSnapshot, + changedLayers: string[], +): PromptLayerDiff[] => { + if (!previous) return []; + const prevByName = new Map(previous.layers.map((layer) => [layer.name, layer.text])); + const currentByName = new Map(current.layers.map((layer) => [layer.name, layer.text])); + return changedLayers.slice(0, 3).map((layer) => { + const previousText = prevByName.get(layer) ?? ""; + const currentText = currentByName.get(layer) ?? ""; + const previousLines = previousText.split("\n").map((line) => line.trim()).filter(Boolean); + const currentLines = currentText.split("\n").map((line) => line.trim()).filter(Boolean); + const previousSet = new Set(previousLines); + const currentSet = new Set(currentLines); + return { + layer, + previousPreview: linePreview(previousText), + currentPreview: linePreview(currentText), + addedLines: currentLines.filter((line) => !previousSet.has(line)).slice(0, 12), + removedLines: previousLines.filter((line) => !currentSet.has(line)).slice(0, 12), + }; + }); +}; + +const termProbe = (terms: ExpectedTerm[] = [], sourceText: string, targetText: string): TermProbeResult[] => + terms.map((term) => { + const applicable = lowerIncludes(sourceText, term.term); + return { + label: term.label, + term: term.term, + applicable, + found: applicable && lowerIncludes(targetText, term.term), + }; + }); + +const leakProbe = termProbe; + +const scoreDocument = (doc: string, query: string): number => { + const terms = query + .toLowerCase() + .split(/\s+/) + .map((part) => part.trim()) + .filter(Boolean); + const hay = doc.toLowerCase(); + return terms.reduce((score, term) => score + (hay.includes(term) ? 1 : 0), 0); +}; + +const recallProbe = ( + terms: ExpectedTerm[] = [], + sourceText: string, + corpus: RecallDocument[], +): RecallProbeResult[] => + terms.map((term) => { + const query = term.query ?? term.term; + const applicable = lowerIncludes(sourceText, term.term); + const ranked = corpus + .map((doc) => ({ doc, score: scoreDocument(doc.text, query) })) + .filter((entry) => entry.score > 0) + .sort((a, b) => b.score - a.score) + .slice(0, 5); + const found = applicable && ranked.some((entry) => lowerIncludes(entry.doc.text, term.term)); + return { + label: term.label, + term: term.term, + query, + applicable, + found, + topHitIds: ranked.map((entry) => entry.doc.id), + }; + }); + +const ratioOf = (probes: TermProbeResult[]): number | null => { + const applicable = probes.filter((probe) => probe.applicable); + if (applicable.length === 0) return null; + return applicable.filter((probe) => probe.found).length / applicable.length; +}; + +const summarizeChangedLayers = ( + previous: CompactorResult | undefined, + current: CompactorResult, +): { firstChangedLayer: string | null; changedLayers: string[] } => { + if (!previous) return { firstChangedLayer: null, changedLayers: [] }; + const prevByName = new Map(previous.layers.map((layer) => [layer.name, layer.text])); + const changedLayers = current.layers + .filter((layer) => prevByName.get(layer.name) !== layer.text) + .map((layer) => layer.name); + return { + firstChangedLayer: changedLayers[0] ?? null, + changedLayers, + }; +}; + +const lines = (items: string[]): string => + items.length === 0 ? "- (none)" : items.map((item) => `- ${item}`).join("\n"); + +const stableUnique = (items: string[], limit = 12): string[] => + [...new Set(items.map((item) => item.trim()).filter(Boolean))].sort().slice(0, limit); + +const regexTerms = (text: string, regex: RegExp, limit = 12): string[] => + stableUnique([...text.matchAll(regex)].map((match) => match[0]), limit); + +const recentHumanLines = (messages: Message[], maxLines = 10): string[] => { + const out: string[] = []; + for (const message of messages.slice(-8)) { + if (message.role !== "user" && message.role !== "assistant") continue; + const text = textOf(message.content); + for (const line of text.split("\n")) { + const trimmed = line.trim(); + if (!trimmed) continue; + if (/\b(next step|current blocker|blocker update|continue|correction|hard constraint|decision)\b/i.test(trimmed)) { + out.push(trimmed); + } + } + } + return out.slice(-maxLines); +}; + +const bulkyPointers = (messages: Message[]): string[] => { + const out: string[] = []; + messages.forEach((message, index) => { + if (message.role !== "toolResult") return; + const text = textOf(message.content); + if (text.length < 500) return; + const paths = regexTerms(text, /\/(?:tmp|var|home|workspace)\/[\w./-]+/g, 4); + const signatures = regexTerms(text, /\b[A-Z][A-Z0-9_]{4,}\b(?:\s+request_id=[\w-]+)?/g, 4); + const details = [...paths, ...signatures].join("; ") || clip(text, 120); + out.push(`#${index} ${message.toolName}: ${details}`); + }); + return out; +}; + +const extractDurableMemory = (messages: Message[]): string[] => { + const memory: string[] = []; + for (const message of messages) { + if (message.role !== "user") continue; + const text = textOf(message.content); + for (const line of text.split("\n")) { + const trimmed = line.trim(); + if (!trimmed) continue; + if (/\b(correction|never|always|prefer|use npm test|node --test)\b/i.test(trimmed)) { + memory.push(trimmed); + } + } + } + + const hasNeverYarn = memory.some((item) => /never use yarn/i.test(item)); + const filtered = hasNeverYarn + ? memory.filter((item) => !/prefer yarn test/i.test(item)) + : memory; + return stableUnique(filtered, 10); +}; + +const makeLayeredCheckpoint = (messages: Message[]): LayerSnapshot[] => { + const blocks = normalize(messages); + const data = buildSections({ blocks }); + const source = sourceTextOf(messages); + const paths = regexTerms(source, /(?:^|[\s"'`])(?:\.?\/?[\w.-]+\/)+[\w.-]+(?:\.[\w.-]+)?/g) + .map((path) => path.trim().replace(/^["'`\s]+/, "")); + const identifiers = regexTerms(source, /\b(?:ERR|CACHE|CRITICAL|req|spn|cache|commit)[\w:-]{3,}\b/g, 16); + const commits = regexTerms(source, /\b[0-9a-f]{7,40}\b/g, 8); + + const stableCheckpoint = [ + "Objective:", + lines(data.sessionGoal), + "Hard constraints and decisions:", + lines(regexTerms(source, /(?:Hard constraint|Decision):[^\n]+/gi, 8)), + "Active files and artifacts:", + lines(stableUnique([...data.filesAndChanges, ...paths], 16)), + "Identifiers and evidence handles:", + lines(stableUnique([...identifiers, ...commits], 20)), + ].join("\n"); + + const volatileState = [ + "Outstanding context:", + lines(data.outstandingContext), + "Recent continuation cues:", + lines(recentHumanLines(messages)), + ].join("\n"); + + const transcriptLines = data.briefTranscript.split("\n").filter(Boolean).slice(-50).join("\n"); + const rawTail = messages.slice(-2).map((message, offset) => { + const index = messages.length - 2 + offset; + const rendered = renderMessage(message, index, true); + if (message.role === "toolResult") { + return `#${index} [${rendered.role}] ${summarizeToolResultForPrompt(textOf(message.content))}`; + } + return `#${index} [${rendered.role}] ${clip(rendered.summary, 700)}`; + }).join("\n"); + + const recallPointers = bulkyPointers(messages); + + return [ + { + name: "Layer 0 Static Prefix Contract", + role: "static", + text: [ + "Compacted state schema v1.", + "Keep section names and order stable.", + "Stable facts appear before volatile facts.", + ].join("\n"), + }, + { + name: "Layer 1 Durable Memory", + role: "current", + text: lines(extractDurableMemory(messages)), + }, + { + name: "Layer 2A Stable Checkpoint", + role: "current", + text: stableCheckpoint, + }, + { + name: "Layer 2B Volatile State", + role: "current", + text: volatileState, + }, + { + name: "Layer 3 Rolling Brief Transcript", + role: "history", + text: transcriptLines || "- (none)", + }, + { + name: "Layer 4 Raw Recent Tail", + role: "history", + text: rawTail || "- (none)", + }, + { + name: "Layer 5 Recall Pointers", + role: "recall", + text: lines(recallPointers), + }, + ]; +}; + +const renderLayers = (layers: LayerSnapshot[]): string => + layers.map((layer) => `[${layer.name}]\n${layer.text}`).join("\n\n"); + +export const offlineCompactors: OfflineCompactor[] = [ + { + name: "pi-vcc", + compact: ({ messages, allMessages, previous }) => { + const inputTokens = estimateTokens(sourceTextOf(messages)); + const keptTail = allMessages.slice(-2); + const start = performance.now(); + const summary = compileWithReport({ messages, previousSummary: previous?.activePromptState }, { + sourceMessageCount: messages.length, + keptMessageCount: keptTail.length, + keptTokensEst: estimateTokens(sourceTextOf(keptTail)), + tokensBefore: estimateTokens(sourceTextOf(allMessages)), + }); + const elapsed = performance.now() - start; + return { + activePromptState: summary.text, + layers: summary.layers, + recallCorpus: renderedDocuments(allMessages), + report: summary.report, + stats: { + compactionMs: elapsed, + estimatedInputTokens: inputTokens, + estimatedOutputTokens: estimateTokens(summary.text), + }, + }; + }, + }, + { + name: "full-rewrite-checkpoint", + compact: ({ allMessages }) => { + const start = performance.now(); + const data = buildSections({ blocks: normalize(allMessages) }); + const current = [ + "Objective:", + lines(data.sessionGoal), + "Files and artifacts:", + lines(data.filesAndChanges), + "Outstanding context:", + lines(data.outstandingContext), + "User preferences:", + lines(data.userPreferences), + ].join("\n"); + const history = data.briefTranscript || "- (none)"; + const layers: LayerSnapshot[] = [ + { name: "Regenerated Current Checkpoint", role: "current", text: current }, + { name: "Regenerated Transcript", role: "history", text: history }, + ]; + const summary = renderLayers(layers); + const elapsed = performance.now() - start; + return { + activePromptState: summary, + layers, + recallCorpus: [], + stats: { + compactionMs: elapsed, + estimatedInputTokens: estimateTokens(sourceTextOf(allMessages)), + estimatedOutputTokens: estimateTokens(summary), + }, + }; + }, + }, + { + name: "cache-aware-layered", + compact: ({ allMessages }) => { + const start = performance.now(); + const layers = makeLayeredCheckpoint(allMessages); + const activePromptState = renderLayers(layers); + const elapsed = performance.now() - start; + return { + activePromptState, + layers, + recallCorpus: renderedDocuments(allMessages), + stats: { + compactionMs: elapsed, + estimatedInputTokens: estimateTokens(sourceTextOf(allMessages)), + estimatedOutputTokens: estimateTokens(activePromptState), + }, + }; + }, + }, + createModelReferenceCompactor({ + sourceTextOf, + estimateTokens, + renderedDocuments, + }), +]; + +const forbiddenLeaksOf = ( + terms: Array = [], + sourceText: string, + targetText: string, +): string[] => + terms + .filter((term) => { + const enforce = !term.afterTerm || lowerIncludes(sourceText, term.afterTerm); + return enforce && lowerIncludes(targetText, term.term); + }) + .map((term) => term.label); + +const cycleMetrics = ( + testCase: CompactionBenchmarkCase, + compactor: OfflineCompactor, + cycle: number, + compactionPoint: number, + sourceMessages: Message[], + result: CompactorResult, + previous: CompactorResult | undefined, + prompt: PromptSnapshot, + previousPrompt: PromptSnapshot | undefined, + includeDiagnostics: boolean, + includeReports: boolean, +): CycleMetrics => { + const sourceText = sourceTextOf(sourceMessages); + const activeText = result.activePromptState; + const currentText = textForRoles(result, ["current"]); + const activeProbes = termProbe(testCase.gold.activeTerms, sourceText, activeText); + const currentProbes = termProbe(testCase.gold.currentTerms ?? [], sourceText, currentText); + const recallProbes = recallProbe(testCase.gold.recallTerms, sourceText, result.recallCorpus); + const continuationProbes = termProbe(testCase.gold.continuationTerms ?? [], sourceText, activeText); + const activeAbsentLeaks = leakProbe(testCase.gold.activeAbsentTerms ?? [], sourceText, activeText) + .filter((probe) => probe.applicable && probe.found); + const leakedForbiddenTerms = forbiddenLeaksOf(testCase.gold.forbiddenTerms, sourceText, activeText); + const leakedForbiddenCurrentTerms = forbiddenLeaksOf(testCase.gold.forbiddenCurrentTerms, sourceText, currentText); + const changed = summarizeChangedLayers(previous, result); + const previousTokens = previous ? tokenize(previous.activePromptState).length : 0; + const currentTokens = tokenize(activeText).length; + const lcp = previous ? lcpTokens(previous.activePromptState, activeText) : null; + const denominator = Math.min(previousTokens, currentTokens); + const promptChanged = summarizeChangedPromptLayers(previousPrompt, prompt); + const previousPromptTokens = previousPrompt ? tokenize(previousPrompt.text).length : 0; + const currentPromptTokens = tokenize(prompt.text).length; + const fullPromptLcp = previousPrompt ? lcpTokens(previousPrompt.text, prompt.text) : null; + const fullPromptDenominator = Math.min(previousPromptTokens, currentPromptTokens); + const stablePrefixTokens = previousPrompt ? fullPromptLcp : null; + + return { + caseId: testCase.id, + compactor: compactor.name, + cycle, + compactionPoint, + activeChars: activeText.length, + activeTokensEst: estimateTokens(activeText), + currentChars: currentText.length, + currentTokensEst: estimateTokens(currentText), + fullPromptChars: prompt.text.length, + fullPromptTokensEst: estimateTokens(prompt.text), + compactionMs: Number(result.stats.compactionMs.toFixed(3)), + lcpTokensWithPrevious: lcp, + lcpTokenRatioWithPrevious: lcp === null || denominator === 0 ? null : Number((lcp / denominator).toFixed(4)), + firstChangedLayer: changed.firstChangedLayer, + changedLayers: changed.changedLayers, + fullPromptLcpTokensWithPrevious: fullPromptLcp, + fullPromptLcpTokenRatioWithPrevious: fullPromptLcp === null || fullPromptDenominator === 0 ? null : Number((fullPromptLcp / fullPromptDenominator).toFixed(4)), + firstChangedPromptLayer: promptChanged.firstChangedPromptLayer, + changedPromptLayers: promptChanged.changedPromptLayers, + stablePrefixTokens, + activeTermRecall: ratioOf(activeProbes), + currentTermRecall: ratioOf(currentProbes), + recallTermHitRate: ratioOf(recallProbes), + continuationTermRecall: ratioOf(continuationProbes), + forbiddenLeakCount: leakedForbiddenTerms.length, + forbiddenCurrentLeakCount: leakedForbiddenCurrentTerms.length, + activeAbsentLeakCount: activeAbsentLeaks.length, + missingActiveTerms: activeProbes.filter((probe) => probe.applicable && !probe.found).map((probe) => probe.label), + missingCurrentTerms: currentProbes.filter((probe) => probe.applicable && !probe.found).map((probe) => probe.label), + missingRecallTerms: recallProbes.filter((probe) => probe.applicable && !probe.found).map((probe) => probe.label), + leakedForbiddenTerms, + leakedForbiddenCurrentTerms, + leakedActiveAbsentTerms: activeAbsentLeaks.map((term) => term.label), + layerSizes: Object.fromEntries(result.layers.map((layer) => [layer.name, layer.text.length])), + promptLayerSizes: Object.fromEntries(prompt.layers.map((layer) => [layer.name, layer.text.length])), + promptLayerTokenDeltas: promptChanged.promptLayerTokenDeltas, + ...(includeDiagnostics && promptChanged.changedPromptLayers.length > 0 + ? { promptLayerDiffs: changedPromptLayerDiffs(previousPrompt, prompt, promptChanged.changedPromptLayers) } + : {}), + ...(includeReports && result.report ? { compactionReport: result.report } : {}), + }; +}; + +const mean = (values: number[]): number | null => { + if (values.length === 0) return null; + return values.reduce((sum, value) => sum + value, 0) / values.length; +}; + +const meanRounded = (values: number[]): number => + Number((values.reduce((sum, value) => sum + value, 0) / Math.max(values.length, 1)).toFixed(3)); + +const aggregate = (cycles: CycleMetrics[]): BenchmarkRunResult["aggregate"] => { + const byCompactor = new Map(); + for (const cycle of cycles) { + const bucket = byCompactor.get(cycle.compactor) ?? []; + bucket.push(cycle); + byCompactor.set(cycle.compactor, bucket); + } + + return Object.fromEntries([...byCompactor].map(([name, items]) => { + const nullableMean = (selector: (item: CycleMetrics) => number | null): number | null => { + const values = items.map(selector).filter((value): value is number => value !== null); + const result = mean(values); + return result === null ? null : Number(result.toFixed(4)); + }; + return [name, { + cycles: items.length, + meanActiveTokensEst: meanRounded(items.map((item) => item.activeTokensEst)), + meanCurrentTokensEst: meanRounded(items.map((item) => item.currentTokensEst)), + meanFullPromptTokensEst: meanRounded(items.map((item) => item.fullPromptTokensEst)), + meanCompactionMs: meanRounded(items.map((item) => item.compactionMs)), + meanActiveTermRecall: nullableMean((item) => item.activeTermRecall), + meanCurrentTermRecall: nullableMean((item) => item.currentTermRecall), + meanRecallTermHitRate: nullableMean((item) => item.recallTermHitRate), + meanContinuationTermRecall: nullableMean((item) => item.continuationTermRecall), + totalForbiddenLeaks: items.reduce((sum, item) => sum + item.forbiddenLeakCount, 0), + totalForbiddenCurrentLeaks: items.reduce((sum, item) => sum + item.forbiddenCurrentLeakCount, 0), + totalActiveAbsentLeaks: items.reduce((sum, item) => sum + item.activeAbsentLeakCount, 0), + meanLcpTokenRatio: nullableMean((item) => item.lcpTokenRatioWithPrevious), + meanFullPromptLcpTokenRatio: nullableMean((item) => item.fullPromptLcpTokenRatioWithPrevious), + meanStablePrefixTokens: nullableMean((item) => item.stablePrefixTokens), + }]; + })); +}; + +export const failedGatesOf = (cycle: CycleMetrics): string[] => { + const failures: string[] = []; + if (cycle.activeTermRecall !== null && cycle.activeTermRecall < 1) failures.push("active-term-recall"); + if (cycle.currentTermRecall !== null && cycle.currentTermRecall < 1) failures.push("current-term-recall"); + if (cycle.recallTermHitRate !== null && cycle.recallTermHitRate < 1) failures.push("recall-hit-rate"); + if (cycle.continuationTermRecall !== null && cycle.continuationTermRecall < 1) failures.push("continuation-term-recall"); + if (cycle.forbiddenLeakCount > 0) failures.push("forbidden-active-leak"); + if (cycle.forbiddenCurrentLeakCount > 0) failures.push("forbidden-current-leak"); + if (cycle.activeAbsentLeakCount > 0) failures.push("active-absent-leak"); + return failures; +}; + +interface CacheBoundary { + allowedFirstChangedLayers: string[]; + minStablePrefixTokens: number; + maxPromptLayerSizes?: Record; +} + +const cacheBoundaryPath = join(fileURLToPath(new URL(".", import.meta.url)), "cache-boundaries.json"); +export const CACHE_BOUNDARIES: Record = JSON.parse(readFileSync(cacheBoundaryPath, "utf8")); + +export const failedCacheGatesOf = (cycle: CycleMetrics): string[] => { + const boundary = CACHE_BOUNDARIES[cycle.caseId]; + if (!boundary || cycle.cycle <= 1) return []; + const failures: string[] = []; + if (!cycle.firstChangedPromptLayer) { + failures.push("missing-first-changed-layer"); + } else if (!boundary.allowedFirstChangedLayers.includes(cycle.firstChangedPromptLayer)) { + failures.push("unexpected-first-changed-layer"); + } + if ((cycle.stablePrefixTokens ?? 0) < boundary.minStablePrefixTokens) failures.push("stable-prefix-too-small"); + for (const [layer, maxSize] of Object.entries(boundary.maxPromptLayerSizes ?? {})) { + if ((cycle.promptLayerSizes[layer] ?? 0) > maxSize) failures.push(`recent-layer-too-large:${layer}`); + } + return failures; +}; + +export const runOfflineCompactionBenchmark = async (options: { + cases?: CompactionBenchmarkCase[]; + compactors?: OfflineCompactor[]; + includeDiagnostics?: boolean; + includeReports?: boolean; +} = {}): BenchmarkRunResult => { + const cases = options.cases ?? syntheticCompactionCases; + const compactors = options.compactors ?? offlineCompactors; + const cycles: CycleMetrics[] = []; + + for (const testCase of cases) { + for (const compactor of compactors) { + let previous: CompactorResult | undefined; + let previousPrompt: PromptSnapshot | undefined; + let previousPoint = 0; + for (const [index, point] of testCase.compactionPoints.entries()) { + const sourceMessages = testCase.messages.slice(0, point); + const cycleMessages = testCase.messages.slice(previousPoint, point); + const result = await compactor.compact({ + messages: cycleMessages, + allMessages: sourceMessages, + previous, + cycle: index + 1, + }); + const prompt = simulatedPromptOf(result, sourceMessages); + cycles.push(cycleMetrics(testCase, compactor, index + 1, point, sourceMessages, result, previous, prompt, previousPrompt, Boolean(options.includeDiagnostics), Boolean(options.includeReports))); + previous = result; + previousPrompt = prompt; + previousPoint = point; + } + } + } + + return { cycles, aggregate: aggregate(cycles) }; +}; diff --git a/bench/compaction/real-sessions.ts b/bench/compaction/real-sessions.ts new file mode 100644 index 0000000..3062732 --- /dev/null +++ b/bench/compaction/real-sessions.ts @@ -0,0 +1,83 @@ +import { readdir, readFile, stat } from "node:fs/promises"; +import { basename, join } from "node:path"; +import type { Message } from "@mariozechner/pi-ai"; +import type { CompactionBenchmarkCase } from "./synthetic-cases"; + +interface SessionFile { + path: string; + size: number; +} + +const walkJsonl = async (dir: string): Promise => { + const entries = await readdir(dir, { withFileTypes: true }); + const out: SessionFile[] = []; + for (const entry of entries) { + const path = join(dir, entry.name); + if (entry.isDirectory()) { + out.push(...await walkJsonl(path)); + } else if (entry.isFile() && entry.name.endsWith(".jsonl")) { + const s = await stat(path); + out.push({ path, size: s.size }); + } + } + return out; +}; + +const isMessage = (value: unknown): value is Message => + Boolean(value && typeof value === "object" && typeof (value as any).role === "string" && "content" in (value as any)); + +const loadMessagesFromJsonl = async (path: string): Promise => { + const text = await readFile(path, "utf8"); + const messages: Message[] = []; + for (const line of text.split("\n")) { + if (!line.trim()) continue; + let entry: any; + try { + entry = JSON.parse(line); + } catch { + continue; + } + if (entry?.type !== "message") continue; + if (isMessage(entry.message)) messages.push(entry.message); + } + return messages; +}; + +const compactionPointsFor = (messageCount: number): number[] => { + if (messageCount <= 3) return []; + const raw = [ + Math.ceil(messageCount * 0.4), + Math.ceil(messageCount * 0.7), + messageCount, + ].filter((point) => point > 2 && point <= messageCount); + return [...new Set(raw)]; +}; + +export const loadRealSessionCases = async (options: { + sessionsDir: string; + limit?: number; +}): Promise => { + const limit = Math.max(1, options.limit ?? 2); + const files = (await walkJsonl(options.sessionsDir)) + .sort((a, b) => b.size - a.size) + .slice(0, limit); + + const cases: CompactionBenchmarkCase[] = []; + for (const file of files) { + const messages = await loadMessagesFromJsonl(file.path); + const compactionPoints = compactionPointsFor(messages.length); + if (compactionPoints.length === 0) continue; + cases.push({ + id: `real-session:${basename(file.path, ".jsonl")}`, + description: `Real Pi session replay sampled from ${file.path}`, + messages, + compactionPoints, + gold: { + activeTerms: [], + recallTerms: [], + }, + }); + } + + return cases; +}; diff --git a/bench/compaction/synthetic-cases.ts b/bench/compaction/synthetic-cases.ts new file mode 100644 index 0000000..41760c8 --- /dev/null +++ b/bench/compaction/synthetic-cases.ts @@ -0,0 +1,695 @@ +import type { Message } from "@mariozechner/pi-ai"; + +export interface ExpectedTerm { + label: string; + term: string; + /** Optional focused query for recall-style lookup. Defaults to the term. */ + query?: string; +} + +export interface ScopedTerm extends ExpectedTerm { + /** Enforce only after this term has appeared in the replayed source text. */ + afterTerm?: string; +} + +export interface CompactionGold { + /** Terms that should appear somewhere in the active prompt. */ + activeTerms: ExpectedTerm[]; + /** Terms that should appear in current-state layers, not only historical transcript/tail. */ + currentTerms?: ExpectedTerm[]; + /** Terms that should be recoverable from external recall. */ + recallTerms: ExpectedTerm[]; + /** Terms forbidden anywhere in the active prompt. */ + forbiddenTerms?: ScopedTerm[]; + /** Terms forbidden from current-state layers but allowed in historical layers or recall. */ + forbiddenCurrentTerms?: ScopedTerm[]; + /** Terms that must stay out of active prompt text because recall should carry them. */ + activeAbsentTerms?: ExpectedTerm[]; + continuationTerms?: ExpectedTerm[]; +} + +export interface CompactionBenchmarkCase { + id: string; + description: string; + messages: Message[]; + /** Message counts at which to run a compaction cycle. */ + compactionPoints: number[]; + gold: CompactionGold; +} + +const ts = 1_700_000_000_000; +let toolId = 0; + +const assistantBase = { + api: "messages" as any, + provider: "anthropic" as any, + model: "benchmark-fixture", + usage: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, total: 0 }, + timestamp: ts, +}; + +const user = (text: string): Message => ({ role: "user", content: text, timestamp: ts }); + +const assistant = (text: string): Message => ({ + role: "assistant", + content: [{ type: "text", text }], + ...assistantBase, + stopReason: "stop", +}); + +const toolCall = (name: string, args: Record): Message => { + toolId += 1; + return { + role: "assistant", + content: [{ type: "toolCall", id: `bench_tool_${toolId}`, name, arguments: args }], + ...assistantBase, + stopReason: "toolUse", + }; +}; + +const toolResult = (name: string, text: string, isError = false): Message => ({ + role: "toolResult", + toolCallId: `bench_tool_${toolId}`, + toolName: name, + content: [{ type: "text", text }], + isError, + timestamp: ts, +}); + +const noisyLog = (needle: string): string => [ + ...Array.from({ length: 80 }, (_, i) => `debug ${String(i).padStart(2, "0")}: cache warmup shard ok`), + `CRITICAL ${needle}`, + ...Array.from({ length: 80 }, (_, i) => `debug ${String(i + 80).padStart(2, "0")}: retry window unchanged`), +].join("\n"); + +const longEvidencePayload = (needle: string): string => [ + ...Array.from({ length: 24 }, (_, i) => `/tmp/pi-vcc-cache-evidence/${needle}/very/deep/path/with/verbose/component/name/cache-proof-artifact-${String(i + 1).padStart(2, "0")}.json`), + `CACHE_LONG_EVIDENCE request_id=${needle}`, +].join("\n"); + +const longScope = (tag: string): string => + `Also add detailed scope requirement ${tag} covering dashboard drift checks, benchmark explain output, report artifact review, rollback notes, and validation evidence before broader replay.`; + +const longPreference = (tag: string): string => + `I prefer ${tag} notes to include dashboard drift checks, benchmark explain output, report artifact paths, rollback notes, and validation evidence before broader replay.`; + +const readFile = (path: string, text: string): Message[] => [ + toolCall("read", { path }), + toolResult("read", text), +]; + +export const syntheticCompactionCases: CompactionBenchmarkCase[] = [ + { + id: "boundary-loss-auth-refresh", + description: "A critical constraint and error signature appear immediately before a compaction cut.", + messages: [ + user("Fix password-reset login. Hard constraint: do not change the public login API."), + assistant("I will inspect the auth refresh path and keep the public login API unchanged."), + toolCall("read", { path: "src/auth/session.ts" }), + toolResult("read", "export function refreshSessionAfterPasswordReset() { return null; }"), + assistant("The likely fix belongs in src/auth/session.ts, not the public login handler."), + toolCall("bash", { command: "bun test tests/auth-refresh.test.ts" }), + toolResult("bash", "FAIL tests/auth-refresh.test.ts\nERR_REFRESH_AFTER_RESET expired refresh token after password reset", true), + user("Continue from here. The next step is to patch refreshSessionAfterPasswordReset, then rerun tests/auth-refresh.test.ts."), + assistant("I will patch refreshSessionAfterPasswordReset and rerun the focused auth-refresh test."), + ], + compactionPoints: [7, 9], + gold: { + activeTerms: [ + { label: "constraint", term: "do not change the public login API" }, + { label: "file", term: "src/auth/session.ts" }, + { label: "identifier", term: "ERR_REFRESH_AFTER_RESET" }, + ], + currentTerms: [ + { label: "constraint", term: "do not change the public login API" }, + { label: "file", term: "src/auth/session.ts" }, + { label: "identifier", term: "ERR_REFRESH_AFTER_RESET" }, + ], + recallTerms: [ + { label: "failing test", term: "tests/auth-refresh.test.ts", query: "auth-refresh" }, + ], + continuationTerms: [ + { label: "next edit", term: "patch refreshSessionAfterPasswordReset" }, + { label: "next validation", term: "rerun tests/auth-refresh.test.ts" }, + ], + }, + }, + { + id: "identifier-provenance", + description: "Similar identifiers make exact provenance and active entity recovery important.", + messages: [ + user("Audit cache invalidation. The target artifact is /tmp/cache-probe-A17.log, not /tmp/cache-probe-A71.log."), + assistant("I will keep the A17 artifact distinct from the A71 decoy and check the cache probe IDs."), + toolCall("read", { path: "/tmp/cache-probe-A17.log" }), + toolResult("read", "probe_id=cache_probe_A17\nspan=spn_cache_keep_91\ncommit=9f3a2b1\nstatus=prefix preserved"), + toolCall("read", { path: "/tmp/cache-probe-A71.log" }), + toolResult("read", "probe_id=cache_probe_A71\nspan=spn_cache_drop_19\nstatus=decoy"), + assistant("Decision: use cache_probe_A17 and span spn_cache_keep_91 as the evidence handle. Ignore cache_probe_A71."), + user("Continue the audit using commit 9f3a2b1 and evidence span spn_cache_keep_91."), + ], + compactionPoints: [6, 8], + gold: { + activeTerms: [ + { label: "artifact", term: "/tmp/cache-probe-A17.log" }, + { label: "probe", term: "cache_probe_A17" }, + { label: "span", term: "spn_cache_keep_91" }, + { label: "commit", term: "9f3a2b1" }, + ], + currentTerms: [ + { label: "artifact", term: "/tmp/cache-probe-A17.log" }, + { label: "probe", term: "cache_probe_A17" }, + { label: "span", term: "spn_cache_keep_91" }, + { label: "commit", term: "9f3a2b1" }, + ], + recallTerms: [ + { label: "decoy provenance", term: "cache_probe_A71", query: "cache_probe_A71" }, + ], + forbiddenCurrentTerms: [ + { label: "decoy as current target", term: "use cache_probe_A71", afterTerm: "Ignore cache_probe_A71" }, + ], + continuationTerms: [ + { label: "continue span", term: "spn_cache_keep_91" }, + ], + }, + }, + { + id: "recall-required-bulk-log", + description: "A bulky log should be externalized while retaining a pointer and recallable exact failure line.", + messages: [ + user("Investigate a flaky compaction benchmark. Store bulky logs as pointers when possible."), + assistant("I will inspect the benchmark log and keep only the evidence handle in active state."), + toolCall("bash", { command: "./run-benchmark > /tmp/pi-vcc-bench-482.log" }), + toolResult("bash", noisyLog("CACHE_MISS_AT_LAYER_2B request_id=req_cache_482"), true), + assistant("The important pointer is /tmp/pi-vcc-bench-482.log. The exact line CACHE_MISS_AT_LAYER_2B request_id=req_cache_482 can be recalled from the log."), + user("Continue with the pointer only; do not paste the whole log back into context."), + ], + compactionPoints: [4, 6], + gold: { + activeTerms: [ + { label: "log pointer", term: "/tmp/pi-vcc-bench-482.log" }, + ], + currentTerms: [ + { label: "log pointer", term: "/tmp/pi-vcc-bench-482.log" }, + ], + recallTerms: [ + { label: "critical line", term: "CACHE_MISS_AT_LAYER_2B request_id=req_cache_482", query: "CACHE_MISS_AT_LAYER_2B req_cache_482" }, + ], + activeAbsentTerms: [ + { label: "early bulky log line", term: "debug 00: cache warmup shard ok" }, + { label: "late bulky log line", term: "debug 120: retry window unchanged" }, + ], + continuationTerms: [ + { label: "pointer discipline", term: "do not paste the whole log" }, + ], + }, + }, + { + id: "correction-stale-memory", + description: "A corrected user preference should replace stale durable memory.", + messages: [ + user("For this repo, prefer yarn test when validating."), + assistant("Noted: yarn test for validation."), + toolCall("bash", { command: "yarn test" }), + toolResult("bash", "yarn: command not found", true), + user("Correction: never use yarn here. Use npm test for broad validation and node --test for focused checks."), + assistant("Understood. I will avoid yarn and use npm test or node --test depending on scope."), + user("Continue and choose the focused validation command first."), + ], + compactionPoints: [4, 7], + gold: { + activeTerms: [ + { label: "corrected preference", term: "never use yarn" }, + { label: "broad validation", term: "npm test" }, + { label: "focused validation", term: "node --test" }, + ], + currentTerms: [ + { label: "corrected preference", term: "never use yarn" }, + { label: "broad validation", term: "npm test" }, + { label: "focused validation", term: "node --test" }, + ], + recallTerms: [ + { label: "failed old tool", term: "yarn: command not found", query: "yarn command not found" }, + ], + forbiddenCurrentTerms: [ + { label: "stale positive preference", term: "prefer yarn test", afterTerm: "Correction: never use yarn here" }, + ], + continuationTerms: [ + { label: "focused command", term: "node --test" }, + ], + }, + }, + { + id: "realistic-scope-and-status", + description: "A real-session-shaped scope extension should be captured, but follow-up status should stay volatile.", + messages: [ + user("Build a local ClickHouse-based OpenTelemetry ingestion and query system."), + assistant("I will start with local ClickHouse, ingestion, and query scaffolding."), + user("Good, now lets add meta monitoring for the chart itself. This means metrics for our clickhouse instance and dashboards for grafana."), + assistant("I will extend the current work with meta monitoring and Grafana dashboards."), + user("Status update: meta monitoring wiring is started; next validate dashboard provisioning."), + assistant("Next step: validate dashboard provisioning without changing the stable objective."), + ], + compactionPoints: [2, 4, 6], + gold: { + activeTerms: [ + { label: "original objective", term: "OpenTelemetry ingestion and query system" }, + { label: "scope extension", term: "meta monitoring" }, + ], + currentTerms: [ + { label: "original objective", term: "OpenTelemetry ingestion and query system" }, + { label: "scope extension", term: "meta monitoring" }, + ], + recallTerms: [ + { label: "dashboard validation", term: "dashboard provisioning", query: "dashboard provisioning" }, + ], + continuationTerms: [ + { label: "volatile next step", term: "validate dashboard provisioning" }, + ], + }, + }, + { + id: "cache-bust-scope-growth", + description: "Stable objective and evidence remain fixed while additive scope updates change across compactions.", + messages: [ + user("Build cache-aware compaction. Stable objective: preserve cacheable prefix while keeping continuation state recoverable."), + assistant("Stable checkpoint: preserve cacheable prefix; canonical file src/core/compaction-state.ts; validation in Docker."), + user("Also add dashboard provisioning checks to the current scope."), + assistant("I will include dashboard provisioning checks in the current scope without changing the stable objective."), + user("Also add Grafana datasource validation to the current scope."), + assistant("I will include Grafana datasource validation as the latest scope update."), + user("Also add provider cache accounting notes to the current scope."), + assistant("I will include provider cache accounting notes while preserving the stable objective."), + ], + compactionPoints: [4, 6, 8], + gold: { + activeTerms: [ + { label: "stable objective", term: "preserve cacheable prefix" }, + { label: "canonical file", term: "src/core/compaction-state.ts" }, + { label: "first scope", term: "dashboard provisioning checks" }, + { label: "latest scope", term: "provider cache accounting notes" }, + ], + currentTerms: [ + { label: "stable objective", term: "preserve cacheable prefix" }, + { label: "canonical file", term: "src/core/compaction-state.ts" }, + { label: "first scope", term: "dashboard provisioning checks" }, + { label: "latest scope", term: "provider cache accounting notes" }, + ], + recallTerms: [ + { label: "middle scope", term: "Grafana datasource validation", query: "Grafana datasource validation" }, + ], + continuationTerms: [ + { label: "latest scope", term: "provider cache accounting notes" }, + ], + }, + }, + { + id: "cache-bust-evidence-growth", + description: "Stable work state remains unchanged while new evidence handles are discovered across compactions.", + messages: [ + user("Audit cache probes. Stable objective: preserve prefix cache while tracking evidence handles. Always keep benchmark validation in Docker."), + assistant("Stable checkpoint: preserve prefix cache; validation preference Docker; canonical file src/cache/probe.ts."), + toolCall("read", { path: "src/cache/probe.ts" }), + toolResult("read", "export const cacheProbe = 'cache_probe_alpha';\n// request_id=req_cache_alpha"), + assistant("Evidence handles so far: src/cache/probe.ts and cache_probe_alpha."), + toolCall("bash", { command: "grep -R cache_probe_beta /tmp/cache-evidence-beta.log" }), + toolResult("bash", "CACHE_LAYER_SHIFT request_id=req_cache_beta\ntrace_id=trace_cache_beta\n/tmp/cache-evidence-beta.log"), + assistant("Additional evidence handle: /tmp/cache-evidence-beta.log with req_cache_beta."), + toolCall("bash", { command: "grep -R cache_probe_gamma /tmp/cache-evidence-gamma.log" }), + toolResult("bash", "CACHE_LAYER_STABLE request_id=req_cache_gamma\ntrace_id=trace_cache_gamma\n/tmp/cache-evidence-gamma.log"), + assistant("Additional evidence handle: /tmp/cache-evidence-gamma.log with req_cache_gamma."), + ], + compactionPoints: [5, 8, 11], + gold: { + activeTerms: [ + { label: "stable objective", term: "preserve prefix cache" }, + { label: "canonical file", term: "src/cache/probe.ts" }, + { label: "validation preference", term: "Docker" }, + { label: "latest evidence", term: "req_cache_gamma" }, + ], + currentTerms: [ + { label: "stable objective", term: "preserve prefix cache" }, + { label: "canonical file", term: "src/cache/probe.ts" }, + { label: "validation preference", term: "Docker" }, + { label: "latest evidence", term: "req_cache_gamma" }, + ], + recallTerms: [ + { label: "earlier beta evidence", term: "CACHE_LAYER_SHIFT request_id=req_cache_beta", query: "CACHE_LAYER_SHIFT req_cache_beta" }, + ], + continuationTerms: [ + { label: "latest evidence", term: "req_cache_gamma" }, + ], + }, + }, + { + id: "cache-bust-mutable-tail-growth", + description: "Recent scope, preference, and evidence updates should stay bounded while latest items remain recoverable.", + messages: [ + user("Maintain cache-aware compaction. Stable objective: keep stable sections byte-stable while bounding recent mutable state."), + assistant("Stable checkpoint: keep stable sections byte-stable; canonical file src/core/summarize.ts."), + user("Also add scope item tail_scope_01 to the current scope. I prefer tail preference tail_pref_01."), + toolCall("bash", { command: "grep req_tail_ev_01 /tmp/tail-evidence-01.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_01 /tmp/tail-evidence-01.log"), + assistant("Recorded tail_scope_01, tail_pref_01, and req_tail_ev_01."), + user("Also add scope item tail_scope_02 to the current scope. I prefer tail preference tail_pref_02."), + toolCall("bash", { command: "grep req_tail_ev_02 /tmp/tail-evidence-02.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_02 /tmp/tail-evidence-02.log"), + assistant("Recorded tail_scope_02, tail_pref_02, and req_tail_ev_02."), + user("Also add scope item tail_scope_03 to the current scope. I prefer tail preference tail_pref_03."), + toolCall("bash", { command: "grep req_tail_ev_03 /tmp/tail-evidence-03.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_03 /tmp/tail-evidence-03.log"), + assistant("Recorded tail_scope_03, tail_pref_03, and req_tail_ev_03."), + user("Also add scope item tail_scope_04 to the current scope. I prefer tail preference tail_pref_04."), + toolCall("bash", { command: "grep req_tail_ev_04 /tmp/tail-evidence-04.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_04 /tmp/tail-evidence-04.log"), + assistant("Recorded tail_scope_04, tail_pref_04, and req_tail_ev_04."), + user("Also add scope item tail_scope_05 to the current scope. I prefer tail preference tail_pref_05."), + toolCall("bash", { command: "grep req_tail_ev_05 /tmp/tail-evidence-05.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_05 /tmp/tail-evidence-05.log"), + assistant("Recorded tail_scope_05, tail_pref_05, and req_tail_ev_05."), + user("Also add scope item tail_scope_06 to the current scope. I prefer tail preference tail_pref_06."), + toolCall("bash", { command: "grep req_tail_ev_06 /tmp/tail-evidence-06.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_06 /tmp/tail-evidence-06.log"), + assistant("Recorded tail_scope_06, tail_pref_06, and req_tail_ev_06."), + user("Also add scope item tail_scope_07 to the current scope. I prefer tail preference tail_pref_07."), + toolCall("bash", { command: "grep req_tail_ev_07 /tmp/tail-evidence-07.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_07 /tmp/tail-evidence-07.log"), + assistant("Recorded tail_scope_07, tail_pref_07, and req_tail_ev_07."), + user("Also add scope item tail_scope_08 to the current scope. I prefer tail preference tail_pref_08."), + toolCall("bash", { command: "grep req_tail_ev_08 /tmp/tail-evidence-08.log" }), + toolResult("bash", "CACHE_TAIL_EVENT request_id=req_tail_ev_08 /tmp/tail-evidence-08.log"), + assistant("Recorded tail_scope_08, tail_pref_08, and req_tail_ev_08."), + ], + compactionPoints: [10, 22, 34], + gold: { + activeTerms: [ + { label: "stable objective", term: "keep stable sections byte-stable" }, + { label: "latest scope", term: "tail_scope_08" }, + { label: "latest preference", term: "tail_pref_08" }, + { label: "latest evidence", term: "req_tail_ev_08" }, + ], + currentTerms: [ + { label: "stable objective", term: "keep stable sections byte-stable" }, + { label: "latest scope", term: "tail_scope_08" }, + { label: "latest preference", term: "tail_pref_08" }, + { label: "latest evidence", term: "req_tail_ev_08" }, + ], + recallTerms: [ + { label: "old scope", term: "tail_scope_01", query: "tail_scope_01" }, + { label: "old evidence", term: "req_tail_ev_01", query: "req_tail_ev_01" }, + ], + continuationTerms: [ + { label: "latest scope", term: "tail_scope_08" }, + ], + }, + }, + { + id: "cache-bust-commit-growth", + description: "New git commits should not rewrite the stable commit section across repeated compactions.", + messages: [ + user("Maintain cache-aware compaction. Stable objective: keep commit evidence visible without busting the stable prompt prefix."), + assistant("Stable checkpoint: objective keep commit evidence visible; canonical file src/extract/commits.ts."), + toolCall("bash", { command: "git commit -m \"test: add cache churn probe\"" }), + toolResult("bash", "[feat/cache a1b2c3d] test: add cache churn probe\n 2 files changed"), + assistant("Commit a1b2c3d recorded for the cache churn probe."), + toolCall("bash", { command: "git commit -m \"fix: keep commit section stable\"" }), + toolResult("bash", "[feat/cache b2c3d4e] fix: keep commit section stable\n 3 files changed"), + assistant("Commit b2c3d4e recorded while preserving the stable objective."), + toolCall("bash", { command: "git commit -m \"docs: explain commit cache boundary\"" }), + toolResult("bash", "[feat/cache c3d4e5f] docs: explain commit cache boundary\n 1 file changed"), + assistant("Commit c3d4e5f recorded; next compare commit cache boundary metrics."), + ], + compactionPoints: [5, 8, 11], + gold: { + activeTerms: [ + { label: "stable objective", term: "keep commit evidence visible" }, + { label: "canonical file", term: "src/extract/commits.ts" }, + { label: "latest commit", term: "c3d4e5f" }, + ], + currentTerms: [ + { label: "stable objective", term: "keep commit evidence visible" }, + { label: "canonical file", term: "src/extract/commits.ts" }, + { label: "latest commit", term: "c3d4e5f" }, + ], + recallTerms: [ + { label: "middle commit", term: "b2c3d4e", query: "b2c3d4e commit section stable" }, + ], + continuationTerms: [ + { label: "next proof", term: "compare commit cache boundary metrics" }, + ], + }, + }, + { + id: "cache-bust-long-evidence-line", + description: "A single fresh evidence line with many long paths should be clipped, not allowed to bloat the recent evidence layer.", + messages: [ + user("Audit evidence formatting. Stable objective: keep evidence useful while bounding recent evidence line length."), + assistant("Stable checkpoint: evidence must stay useful and bounded; canonical file src/extract/evidence.ts."), + toolCall("bash", { command: "grep req_long_ev_anchor /tmp/pi-vcc-cache-evidence/anchor.log" }), + toolResult("bash", "CACHE_LONG_EVIDENCE request_id=req_long_ev_anchor /tmp/pi-vcc-cache-evidence/anchor.log"), + assistant("Initial evidence handle req_long_ev_anchor is recorded."), + toolCall("bash", { command: "find /tmp/pi-vcc-cache-evidence/req_long_ev_latest -type f" }), + toolResult("bash", longEvidencePayload("req_long_ev_latest")), + assistant("Latest evidence handle req_long_ev_latest is recorded; keep the long path list bounded."), + ], + compactionPoints: [5, 8], + gold: { + activeTerms: [ + { label: "stable objective", term: "bounding recent evidence line length" }, + { label: "canonical file", term: "src/extract/evidence.ts" }, + { label: "latest evidence", term: "req_long_ev_latest" }, + ], + currentTerms: [ + { label: "stable objective", term: "bounding recent evidence line length" }, + { label: "canonical file", term: "src/extract/evidence.ts" }, + { label: "latest evidence", term: "req_long_ev_latest" }, + ], + recallTerms: [ + { label: "long path payload", term: "cache-proof-artifact-24.json", query: "cache-proof-artifact-24" }, + ], + continuationTerms: [ + { label: "bounded path list", term: "long path list bounded" }, + ], + }, + }, + { + id: "cache-bust-long-scope-line", + description: "Verbose fresh scope updates should stay bounded in the recent scope layer.", + messages: [ + user("Maintain cache-aware compaction. Stable objective: keep verbose scope updates useful but bounded."), + assistant("Stable checkpoint: objective keep verbose scope useful but bounded; canonical file src/extract/goals.ts."), + user("Also add compact scope baseline to the current scope."), + assistant("Baseline current scope is established."), + user([longScope("scope_long_alpha"), longScope("scope_long_beta"), longScope("scope_long_gamma")].join("\n")), + assistant("Recorded verbose scope updates; next verify the recent scope layer remains bounded."), + ], + compactionPoints: [4, 6], + gold: { + activeTerms: [ + { label: "stable objective", term: "verbose scope updates useful but bounded" }, + { label: "canonical file", term: "src/extract/goals.ts" }, + { label: "latest scope", term: "scope_long_beta" }, + ], + currentTerms: [ + { label: "stable objective", term: "verbose scope updates useful but bounded" }, + { label: "canonical file", term: "src/extract/goals.ts" }, + { label: "latest scope", term: "scope_long_beta" }, + ], + recallTerms: [ + { label: "third verbose scope", term: "scope_long_gamma", query: "scope_long_gamma" }, + ], + continuationTerms: [ + { label: "bounded recent scope", term: "recent scope layer remains bounded" }, + ], + }, + }, + { + id: "cache-bust-long-preference-line", + description: "Verbose fresh preferences should stay bounded in the recent preferences layer.", + messages: [ + user("Maintain cache-aware compaction. Stable objective: keep verbose preferences useful but bounded.\nAlways use Docker for broad validation."), + assistant("Stable checkpoint: objective keep verbose preferences useful but bounded; canonical file src/extract/preferences.ts."), + user(longPreference("pref_long_alpha")), + assistant("Recorded pref_long_alpha."), + user(longPreference("pref_long_beta")), + assistant("Recorded pref_long_beta."), + user(longPreference("pref_long_gamma")), + assistant("Recorded pref_long_gamma; next verify the recent preference layer remains bounded."), + ], + compactionPoints: [2, 8], + gold: { + activeTerms: [ + { label: "stable objective", term: "verbose preferences useful but bounded" }, + { label: "canonical file", term: "src/extract/preferences.ts" }, + { label: "latest preference", term: "pref_long_gamma" }, + ], + currentTerms: [ + { label: "stable objective", term: "verbose preferences useful but bounded" }, + { label: "canonical file", term: "src/extract/preferences.ts" }, + { label: "latest preference", term: "pref_long_gamma" }, + ], + recallTerms: [ + { label: "first verbose preference", term: "pref_long_alpha", query: "pref_long_alpha" }, + ], + continuationTerms: [ + { label: "bounded recent preference", term: "recent preference layer remains bounded" }, + ], + }, + }, + { + id: "model-ref-keep-ref-drop", + description: "Model classifies conversation into KEEP (critical identifiers), REF (useful context), and DROP (fluff). Subsequent compactions merge with previous classifications.", + messages: [ + user("Work on src/core/session.ts. The session module needs cache-aware state tracking."), + assistant("Working on src/core/session.ts. CACHE_SESSION probe request_id=sess-001. Added state tracking with commit abc1234."), + user("Also, what should I have for lunch? Thinking tacos or sushi."), + assistant("Tacos would be a great choice. There's a place nearby."), + user("OK back to work. Always use Docker for validation. Now continue on src/core/session.ts."), + assistant("Continuing on src/core/session.ts. Respecting Docker preference. Added validation config."), + ], + compactionPoints: [2, 6], + gold: { + // Only assert terms that should be present regardless of cycle. + // MRC re-classifies from scratch each cycle (does not accumulate). + activeTerms: [ + { label: "session file", term: "src/core/session.ts" }, + ], + currentTerms: [ + { label: "session file", term: "src/core/session.ts" }, + { label: "Docker preference", term: "Docker" }, + ], + recallTerms: [ + { label: "lunch discussion", term: "lunch", query: "lunch tacos" }, + ], + continuationTerms: [ + { label: "docker preference respected", term: "Docker" }, + ], + }, + }, + { + id: "multi-cycle-ref-promotion", + description: "Auth chunks become REF during database phase, promoted back when auth returns. Tests merge-awareness across 3 compactions.", + messages: [ + user("Work on auth module. Implement JWT refresh token rotation in src/auth/refresh.ts."), + assistant("Auth module: added token rotation to src/auth/refresh.ts, commit a1b2c3d. ERR_AUTH_REFRESH request_id=req-auth-001."), + user("Switch to database module. Add connection pooling to src/db/pool.ts. Always use PostgreSQL."), + assistant("DB module: added connection pooling to src/db/pool.ts, commit d4e5f6g. CACHE_DB_POOL request_id=req-db-001."), + user("Back to auth module. The refresh token rotation from earlier needs audit logging."), + assistant("Auth module: adding audit logging to src/auth/refresh.ts per earlier JWT rotation, commit a7b8c9d."), + ], + compactionPoints: [2, 4, 6], + gold: { + // No strict activeTerms on topics — the classifier correctly demotes non-current + // topics to REF. This IS the multi-cycle promotion behavior we're testing. + currentTerms: [ + { label: "auth file tracked", term: "src/auth/refresh.ts" }, + { label: "db file tracked", term: "src/db/pool.ts" }, + ], + recallTerms: [ + { label: "JWT detail", term: "JWT refresh", query: "JWT refresh token" }, + { label: "DB pooling", term: "connection pooling", query: "PostgreSQL pooling" }, + ], + continuationTerms: [ + { label: "audit logging", term: "audit logging" }, + ], + }, + }, + { + id: "cache-bust-volatile-next-step", + description: "Stable objective and identifiers remain fixed while only volatile next-step state changes across cycles.", + messages: [ + user("Benchmark cache-aware compaction. Stable objective: preserve Layer 0 and Layer 1 prefixes."), + assistant("Stable checkpoint: objective preserve Layer 0 and Layer 1 prefixes; identifier cache_schema_v3."), + user("Current blocker: first run lacks cached input token accounting."), + assistant("Next step: add offline LCP token metrics for cache_schema_v3."), + user("Blocker update: offline LCP metrics are done; now add recall top-k metrics."), + assistant("Next step: add recall top-k metrics while preserving cache_schema_v3 stable text."), + user("Blocker update: recall top-k metrics are done; now document live provider limits."), + assistant("Next step: document live provider limits without changing Layer 0 or Layer 1 wording."), + ], + compactionPoints: [4, 6, 8], + gold: { + activeTerms: [ + { label: "stable objective", term: "preserve Layer 0 and Layer 1 prefixes" }, + { label: "schema", term: "cache_schema_v3" }, + ], + currentTerms: [ + { label: "stable objective", term: "preserve Layer 0 and Layer 1 prefixes" }, + { label: "schema", term: "cache_schema_v3" }, + ], + recallTerms: [ + { label: "old blocker", term: "first run lacks cached input token accounting", query: "cached input token accounting" }, + ], + continuationTerms: [ + { label: "latest next step", term: "document live provider limits" }, + ], + }, + }, +]; + +const readFileWorkingMapMessages: Message[] = [ + user("Patch the plugin loader after reading the existing loader files. If the file-read working map survives compaction, continue without rereading."), + assistant("I will read loader, resolver, and package files, then patch only the plugin loader."), + ...readFile("src/runtime/loaders/node-loader.ts", [ + "import { createRequire } from 'node:module';", + "export function loadNodeModule(specifier: string) {", + " if (specifier.startsWith('node:')) return nativeLoad(specifier);", + " return loadViaCreateRequire(specifier);", + "}", + "export const supportsSyncLoad = true;", + ].join("\n")), + ...readFile("src/runtime/loaders/extension-loader.ts", [ + "import { createRequire } from 'node:module';", + "export function loadExtensionModule(specifier: string) {", + " const require = createRequire(import.meta.url);", + " return require(specifier);", + "}", + "export const extensionLoaderMode = 'create-require';", + ].join("\n")), + ...readFile("src/runtime/resolver.ts", [ + "import { loadExtensionModule } from './loaders/extension-loader';", + "import { loadNodeModule } from './loaders/node-loader';", + "resolver.registerScheme('pi-extension:', loadExtensionModule);", + "resolver.registerScheme('node:', loadNodeModule);", + ].join("\n")), + ...Array.from({ length: 12 }, (_, index) => { + const n = String(index + 1).padStart(2, "0"); + return readFile(`src/runtime/generated/noise-${n}.ts`, [ + `export const NOISE_READ_BODY_${n} = true;`, + "export function generatedResolverNoise() { return 'irrelevant generated fixture'; }", + ].join("\n")); + }).flat(), + assistant("I have enough context. Next patch src/runtime/loaders/plugin-loader.ts to match the existing loader conventions; reread only if compaction loses the code map."), + user("Compact now, then continue without rereading: implement src/runtime/loaders/plugin-loader.ts using the same scheme registration and sync-load convention."), +]; + +export const continuationProbeCases: CompactionBenchmarkCase[] = [ + { + id: "probe-read-file-working-map", + description: "A large read-file working map contains cross-file code patterns needed for the next edit.", + messages: readFileWorkingMapMessages, + compactionPoints: [readFileWorkingMapMessages.length], + gold: { + activeTerms: [ + { label: "createRequire pattern from read output", term: "createRequire(import.meta.url)" }, + { label: "scheme registration from read output", term: "resolver.registerScheme('pi-extension:'" }, + { label: "sync load convention from read output", term: "supportsSyncLoad" }, + { label: "target file", term: "src/runtime/loaders/plugin-loader.ts" }, + ], + currentTerms: [ + { label: "target file", term: "src/runtime/loaders/plugin-loader.ts" }, + ], + recallTerms: [ + { label: "node loader fallback body", term: "loadViaCreateRequire", query: "loadViaCreateRequire node loader" }, + { label: "extension loader createRequire body", term: "createRequire(import.meta.url)", query: "extension loader createRequire" }, + { label: "resolver scheme body", term: "resolver.registerScheme('pi-extension:'", query: "pi-extension resolver scheme" }, + ], + activeAbsentTerms: [ + { label: "irrelevant generated read body", term: "NOISE_READ_BODY_12" }, + ], + continuationTerms: [ + { label: "no reread continuation", term: "without rereading" }, + { label: "same scheme registration", term: "scheme registration" }, + ], + }, + }, +]; diff --git a/index.ts b/index.ts index 93a0e02..2cdcfd9 100644 --- a/index.ts +++ b/index.ts @@ -1,14 +1,46 @@ import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; -import { scaffoldSettings } from "./src/core/settings"; +import { loadSettings, scaffoldSettings } from "./src/core/settings"; import { registerBeforeCompactHook } from "./src/hooks/before-compact"; -import { registerPiVccCommand } from "./src/commands/pi-vcc"; -import { registerVccRecallCommand } from "./src/commands/vcc-recall"; -import { registerRecallTool } from "./src/tools/recall"; +import { registerMrcReferenceJournalHook } from "./src/hooks/mrc-reference-journal"; +import { registerPiMrcCommand } from "./src/commands/pi-mrc"; +import { registerPiMrcReportCommand } from "./src/commands/pi-mrc-report"; +import { registerDumpContextCommand } from "./src/commands/pi-mrc-dump-context"; +import { registerPiMrcControlCommands } from "./src/commands/pi-mrc-control"; +import { registerLookupTool } from "./src/tools/lookup"; +import { registerCompactionReportCard } from "./src/ui/compaction-report-card"; +import { pushContextSlot, pushProviderRequestSlot } from "./src/core/context-buffer"; export default (pi: ExtensionAPI) => { scaffoldSettings(); + + // Always buffer real context for dump/mrc use. + pi.on("context", (event, ctx) => { + const sessionFile = ctx.sessionManager.getSessionFile(); + if (!sessionFile) return; + pushContextSlot(sessionFile, { + timestamp: new Date().toISOString(), + messages: event.messages as unknown[], + }); + }); + + // When debug mode is enabled, also buffer the final provider payload so users + // can audit what Pi sends after context conversion and provider shaping. + pi.on("before_provider_request", (event, ctx) => { + if (!loadSettings().debug) return; + const sessionFile = ctx.sessionManager.getSessionFile(); + if (!sessionFile) return; + pushProviderRequestSlot(sessionFile, { + timestamp: new Date().toISOString(), + payload: event.payload, + }); + }); + + registerCompactionReportCard(pi); + registerMrcReferenceJournalHook(pi); registerBeforeCompactHook(pi); - registerPiVccCommand(pi); - registerVccRecallCommand(pi); - registerRecallTool(pi); + registerPiMrcCommand(pi); + registerPiMrcReportCommand(pi); + registerDumpContextCommand(pi); + registerPiMrcControlCommands(pi); + registerLookupTool(pi); }; diff --git a/package.json b/package.json index dac40fb..2ea4150 100644 --- a/package.json +++ b/package.json @@ -1,21 +1,22 @@ { - "name": "@sting8k/pi-vcc", + "name": "@badliveware/pi-mrc", "version": "0.3.12", - "description": "Algorithmic conversation compactor for pi - transcript-preserving structured summaries, no LLM calls", + "description": "Model-reference compactor for Pi with exact hidden lookup and cache-aware context stashing", "main": "index.ts", "keywords": [ "pi-package", "pi-extension", - "vcc", + "mrc", "compact", "compaction" ], "repository": { "type": "git", - "url": "git+https://github.com/sting8k/pi-vcc.git" + "url": "git+https://github.com/BadLiveware/pi-model-reference-compactor.git" }, "peerDependencies": { "@mariozechner/pi-coding-agent": "*", + "@mariozechner/pi-tui": "*", "@sinclair/typebox": "*" }, "pi": { diff --git a/scripts/bench-compaction.ts b/scripts/bench-compaction.ts new file mode 100644 index 0000000..70390ef --- /dev/null +++ b/scripts/bench-compaction.ts @@ -0,0 +1,114 @@ +#!/usr/bin/env node +import { failedCacheGatesOf, failedGatesOf, offlineCompactors, runOfflineCompactionBenchmark } from "../bench/compaction/offline-runner"; +import { continuationProbeCases, syntheticCompactionCases } from "../bench/compaction/synthetic-cases"; +import { loadRealSessionCases } from "../bench/compaction/real-sessions"; +import { formatCompactionReportCard } from "../src/core/compaction-report"; + +const args = process.argv.slice(2); + +const argValue = (name: string): string | undefined => { + const inline = args.find((arg) => arg.startsWith(`${name}=`)); + if (inline) return inline.slice(name.length + 1); + const index = args.indexOf(name); + if (index >= 0) return args[index + 1]; + return undefined; +}; + +const hasFlag = (name: string): boolean => args.includes(name); + +const realSessionsDir = argValue("--real-sessions-dir"); +const realLimitRaw = argValue("--real-limit"); +if (realLimitRaw !== undefined && !/^[1-9]\d*$/.test(realLimitRaw)) { + console.error(`Invalid --real-limit: ${realLimitRaw}`); + process.exit(1); +} +const realLimit = realLimitRaw ? Number.parseInt(realLimitRaw, 10) : undefined; +const caseFilter = argValue("--case-filter"); +const includeDiagnostics = hasFlag("--show-layer-diff"); +const includeReports = hasFlag("--include-report") || hasFlag("--explain"); +const includeProbes = hasFlag("--include-probes"); + +const selected = argValue("--compactors") + ?.split(",") + .map((name) => name.trim()) + .filter(Boolean); + +const compactors = selected + ? offlineCompactors.filter((compactor) => selected.includes(compactor.name)) + : offlineCompactors; + +if (selected && compactors.length !== selected.length) { + const found = new Set(compactors.map((compactor) => compactor.name)); + const missing = selected.filter((name) => !found.has(name)); + console.error(`Unknown compactor(s): ${missing.join(", ")}`); + console.error(`Available compactors: ${offlineCompactors.map((compactor) => compactor.name).join(", ")}`); + process.exit(1); +} + +const cases = hasFlag("--real-only") ? [] : [...syntheticCompactionCases, ...(includeProbes ? continuationProbeCases : [])]; +if (realSessionsDir) { + cases.push(...await loadRealSessionCases({ sessionsDir: realSessionsDir, limit: realLimit })); +} +const filteredCases = caseFilter + ? cases.filter((testCase) => testCase.id.includes(caseFilter) || testCase.description.includes(caseFilter)) + : cases; + +const result = await runOfflineCompactionBenchmark({ compactors, cases: filteredCases, includeDiagnostics, includeReports }); +const failures = result.cycles + .map((cycle) => ({ cycle, gates: failedGatesOf(cycle) })) + .filter((entry) => entry.gates.length > 0); +const cacheFailures = result.cycles + .map((cycle) => ({ cycle, gates: failedCacheGatesOf(cycle) })) + .filter((entry) => entry.gates.length > 0); + +if (hasFlag("--explain")) { + for (const cycle of result.cycles) { + console.log(`## ${cycle.caseId} / ${cycle.compactor} / cycle ${cycle.cycle}`); + console.log(`compactionPoint=${cycle.compactionPoint} firstChangedPromptLayer=${cycle.firstChangedPromptLayer ?? "none"} stablePrefixTokens=${cycle.stablePrefixTokens ?? "n/a"}`); + if (cycle.compactionReport) { + console.log(formatCompactionReportCard(cycle.compactionReport, { expanded: true })); + } else { + console.log("No compaction report available for this compactor."); + } + console.log(""); + } +} else if (hasFlag("--jsonl")) { + for (const cycle of result.cycles) { + console.log(JSON.stringify(cycle)); + } +} else { + console.log(JSON.stringify(result, null, 2)); +} + +const printFailures = (title: string, entries: typeof failures) => { + console.error(`\n${title}: ${entries.length} cycle(s)`); + for (const { cycle, gates } of entries.slice(0, 20)) { + console.error(JSON.stringify({ + caseId: cycle.caseId, + compactor: cycle.compactor, + cycle: cycle.cycle, + gates, + firstChangedPromptLayer: cycle.firstChangedPromptLayer, + stablePrefixTokens: cycle.stablePrefixTokens, + missingActiveTerms: cycle.missingActiveTerms, + missingCurrentTerms: cycle.missingCurrentTerms, + missingRecallTerms: cycle.missingRecallTerms, + leakedForbiddenTerms: cycle.leakedForbiddenTerms, + leakedForbiddenCurrentTerms: cycle.leakedForbiddenCurrentTerms, + leakedActiveAbsentTerms: cycle.leakedActiveAbsentTerms, + })); + } + if (entries.length > 20) { + console.error(`... ${entries.length - 20} additional failing cycle(s) omitted`); + } +}; + +if (hasFlag("--assert") && failures.length > 0) { + printFailures("Compaction benchmark assertions failed", failures); + process.exit(1); +} + +if (hasFlag("--assert-cache") && cacheFailures.length > 0) { + printFailures("Compaction cache assertions failed", cacheFailures); + process.exit(1); +} diff --git a/scripts/compare-compaction-refs.mjs b/scripts/compare-compaction-refs.mjs new file mode 100755 index 0000000..ee6f497 --- /dev/null +++ b/scripts/compare-compaction-refs.mjs @@ -0,0 +1,328 @@ +#!/usr/bin/env node +import { spawnSync } from "node:child_process"; +import { existsSync, mkdirSync, readFileSync, rmSync, writeFileSync } from "node:fs"; +import { tmpdir } from "node:os"; +import { basename, join, resolve } from "node:path"; + +const args = process.argv.slice(2); + +const valueOf = (name, fallback) => { + const inline = args.find((arg) => arg.startsWith(`${name}=`)); + if (inline) return inline.slice(name.length + 1); + const index = args.indexOf(name); + return index >= 0 ? args[index + 1] : fallback; +}; + +const hasFlag = (name) => args.includes(name); + +const baselineRef = valueOf("--baseline", "53dc551"); +const headRef = valueOf("--head", "HEAD"); +const compactors = valueOf("--compactors", "pi-vcc"); +const realSessionsDir = valueOf("--real-sessions-dir"); +const realLimit = valueOf("--real-limit"); +const caseFilter = valueOf("--case-filter"); +const outDir = resolve(valueOf("--out", join(tmpdir(), `pi-vcc-compaction-compare-${Date.now()}`))); +const keepWorktrees = hasFlag("--keep-worktrees"); +const includeRealOnly = hasFlag("--real-only"); +const includeLayerDiff = hasFlag("--show-layer-diff"); +const includeProbes = hasFlag("--include-probes"); + +const run = (command, commandArgs, options = {}) => { + const result = spawnSync(command, commandArgs, { + cwd: options.cwd, + stdio: options.capture ? ["ignore", "pipe", "pipe"] : "inherit", + encoding: "utf8", + }); + if (result.status !== 0) { + const rendered = `${command} ${commandArgs.join(" ")}`; + if (options.capture) { + process.stderr.write(result.stdout ?? ""); + process.stderr.write(result.stderr ?? ""); + } + throw new Error(`Command failed (${result.status}): ${rendered}`); + } + return result.stdout ?? ""; +}; + +const repoRoot = run("git", ["rev-parse", "--show-toplevel"], { capture: true }).trim(); + +const ensureRef = (ref) => { + run("git", ["rev-parse", "--verify", `${ref}^{commit}`], { cwd: repoRoot, capture: true }); +}; + +const safeName = (value) => value.replace(/[^a-zA-Z0-9_.-]+/g, "-").replace(/^-+|-+$/g, "").slice(0, 60) || "ref"; +const runId = `${Date.now()}-${process.pid}`; +const worktreeRoot = join(tmpdir(), `pi-vcc-ref-compare-${runId}`); +const baselineWorktree = join(worktreeRoot, `baseline-${safeName(baselineRef)}`); +const headWorktree = join(worktreeRoot, `head-${safeName(headRef)}`); + +const benchArgs = () => { + const out = ["--jsonl", "--compactors", compactors]; + if (includeRealOnly) out.push("--real-only"); + if (realSessionsDir) out.push("--real-sessions-dir", "/sessions"); + if (realLimit) out.push("--real-limit", realLimit); + if (caseFilter) out.push("--case-filter", caseFilter); + if (includeLayerDiff) out.push("--show-layer-diff"); + if (includeProbes) out.push("--include-probes"); + return out; +}; + +const readJsonl = (path) => readFileSync(path, "utf8") + .split("\n") + .map((line) => line.trim()) + .filter(Boolean) + .map((line) => JSON.parse(line)); + +const correctnessFailures = (cycle) => [ + ...(cycle.missingActiveTerms ?? []), + ...(cycle.missingCurrentTerms ?? []), + ...(cycle.missingRecallTerms ?? []), + ...(cycle.leakedForbiddenTerms ?? []), + ...(cycle.leakedForbiddenCurrentTerms ?? []), + ...(cycle.leakedActiveAbsentTerms ?? []), +].length; + +const cacheBoundaries = JSON.parse(readFileSync(resolve(repoRoot, "bench/compaction/cache-boundaries.json"), "utf8")); + +const cacheFailures = (cycle) => { + const boundary = cacheBoundaries[cycle.caseId]; + if (!boundary || cycle.cycle <= 1) return 0; + let count = 0; + if (!cycle.firstChangedPromptLayer || !boundary.allowedFirstChangedLayers.includes(cycle.firstChangedPromptLayer)) count += 1; + if ((cycle.stablePrefixTokens ?? 0) < boundary.minStablePrefixTokens) count += 1; + for (const [layer, maxSize] of Object.entries(boundary.maxPromptLayerSizes ?? {})) { + if ((cycle.promptLayerSizes?.[layer] ?? 0) > maxSize) count += 1; + } + return count; +}; + +const mean = (items, selector) => { + const values = items.map(selector).filter((value) => typeof value === "number" && Number.isFinite(value)); + if (values.length === 0) return null; + return values.reduce((sum, value) => sum + value, 0) / values.length; +}; + +const fmt = (value, digits = 2) => value === null || value === undefined ? "n/a" : Number(value).toFixed(digits); +const signed = (value, digits = 2) => value === null || value === undefined ? "n/a" : `${value >= 0 ? "+" : ""}${Number(value).toFixed(digits)}`; + +const RECENT_MUTABLE_LAYERS = [ + "Pi MRC Recent Scope Updates", + "Pi MRC Recent User Preferences", + "Pi MRC Recent Evidence Handles", +]; + +const layerRank = (layer) => { + if (!layer) return 999; + if (layer === "Provider Prefix") return 0; + if (layer === "Tool Definitions") return 1; + if (layer === "Project Instructions") return 2; + if (layer.startsWith("Pi MRC Session Goal")) return 3; + if (layer.startsWith("Pi MRC Files")) return 4; + if (layer.startsWith("Pi MRC Commits")) return 5; + if (layer.startsWith("Pi MRC Evidence Handles")) return 6; + if (layer.startsWith("Pi MRC User Preferences")) return 7; + if (layer.startsWith("Pi MRC Current Scope")) return 8; + if (layer.startsWith("Pi MRC Recent")) return 9; + if (layer.startsWith("Pi MRC Outstanding")) return 10; + if (layer.startsWith("Pi MRC Brief")) return 11; + if (layer === "Kept Raw Tail") return 12; + return 50; +}; + +const rowLabel = (row) => `${row.caseId} / ${row.compactor} / cycle ${row.cycle}`; + +const summarize = (label, rows) => ({ + label, + cycles: rows.length, + meanStablePrefixTokens: mean(rows, (row) => row.stablePrefixTokens), + meanFullPromptTokensEst: mean(rows, (row) => row.fullPromptTokensEst), + meanCurrentTokensEst: mean(rows, (row) => row.currentTokensEst), + correctnessFailureCycles: rows.filter((row) => correctnessFailures(row) > 0).length, + cacheFailureCycles: rows.filter((row) => cacheFailures(row) > 0).length, +}); + +const keyOf = (row) => `${row.caseId}\u0000${row.compactor}\u0000${row.cycle}`; + +const markdownReport = ({ baselineRows, headRows, baselinePath, headPath }) => { + const baseline = summarize("baseline", baselineRows); + const head = summarize("head", headRows); + const baselineByKey = new Map(baselineRows.map((row) => [keyOf(row), row])); + const pairs = headRows + .map((headRow) => ({ baselineRow: baselineByKey.get(keyOf(headRow)), headRow })) + .filter((pair) => pair.baselineRow); + const stableDeltas = pairs.map(({ baselineRow, headRow }) => (headRow.stablePrefixTokens ?? 0) - (baselineRow.stablePrefixTokens ?? 0)); + const tokenDeltas = pairs.map(({ baselineRow, headRow }) => headRow.fullPromptTokensEst - baselineRow.fullPromptTokensEst); + const currentDeltas = pairs.map(({ baselineRow, headRow }) => headRow.currentTokensEst - baselineRow.currentTokensEst); + const improved = pairs.filter(({ baselineRow, headRow }) => + (headRow.stablePrefixTokens ?? 0) > (baselineRow.stablePrefixTokens ?? 0) + || correctnessFailures(headRow) < correctnessFailures(baselineRow) + || cacheFailures(headRow) < cacheFailures(baselineRow) + ); + const regressed = pairs.filter(({ baselineRow, headRow }) => + (headRow.stablePrefixTokens ?? 0) < (baselineRow.stablePrefixTokens ?? 0) + || correctnessFailures(headRow) > correctnessFailures(baselineRow) + || cacheFailures(headRow) > cacheFailures(baselineRow) + ); + const notable = pairs + .filter(({ baselineRow, headRow }) => baselineRow.firstChangedPromptLayer !== headRow.firstChangedPromptLayer + || correctnessFailures(baselineRow) !== correctnessFailures(headRow) + || cacheFailures(baselineRow) !== cacheFailures(headRow)) + .slice(0, 20); + const worstStablePrefixDeltas = pairs + .filter(({ baselineRow, headRow }) => baselineRow.stablePrefixTokens != null && headRow.stablePrefixTokens != null) + .map(({ baselineRow, headRow }) => ({ baselineRow, headRow, delta: headRow.stablePrefixTokens - baselineRow.stablePrefixTokens })) + .sort((a, b) => a.delta - b.delta) + .slice(0, 10); + const largestPromptGrowth = pairs + .map(({ baselineRow, headRow }) => ({ baselineRow, headRow, delta: headRow.fullPromptTokensEst - baselineRow.fullPromptTokensEst })) + .sort((a, b) => b.delta - a.delta) + .slice(0, 10); + const earliestFirstChanged = headRows + .filter((row) => row.cycle > 1 && row.firstChangedPromptLayer) + .sort((a, b) => layerRank(a.firstChangedPromptLayer) - layerRank(b.firstChangedPromptLayer) || (a.stablePrefixTokens ?? 0) - (b.stablePrefixTokens ?? 0)) + .slice(0, 10); + const largestRecentLayers = headRows + .flatMap((row) => RECENT_MUTABLE_LAYERS.map((layer) => ({ row, layer, size: row.promptLayerSizes?.[layer] ?? 0 }))) + .filter((entry) => entry.size > 0) + .sort((a, b) => b.size - a.size) + .slice(0, 10); + + const lines = []; + lines.push("# Compaction Ref Comparison"); + lines.push(""); + lines.push(`- Baseline ref: \`${baselineRef}\``); + lines.push(`- Head ref: \`${headRef}\``); + lines.push(`- Compactors: \`${compactors}\``); + if (realSessionsDir) lines.push(`- Real sessions: \`${realSessionsDir}\``); + if (realLimit) lines.push(`- Real session limit: \`${realLimit}\``); + if (caseFilter) lines.push(`- Case filter: \`${caseFilter}\``); + if (includeProbes) lines.push("- Probe cases: included"); + lines.push(`- Baseline JSONL: \`${baselinePath}\``); + lines.push(`- Head JSONL: \`${headPath}\``); + lines.push(""); + lines.push("## Aggregate"); + lines.push(""); + lines.push("| metric | baseline | head | delta |"); + lines.push("| --- | ---: | ---: | ---: |"); + lines.push(`| cycles | ${baseline.cycles} | ${head.cycles} | ${head.cycles - baseline.cycles} |`); + lines.push(`| mean stable prefix tokens | ${fmt(baseline.meanStablePrefixTokens)} | ${fmt(head.meanStablePrefixTokens)} | ${signed(mean(stableDeltas, (v) => v))} |`); + lines.push(`| mean full prompt tokens | ${fmt(baseline.meanFullPromptTokensEst)} | ${fmt(head.meanFullPromptTokensEst)} | ${signed(mean(tokenDeltas, (v) => v))} |`); + lines.push(`| mean current tokens | ${fmt(baseline.meanCurrentTokensEst)} | ${fmt(head.meanCurrentTokensEst)} | ${signed(mean(currentDeltas, (v) => v))} |`); + lines.push(`| correctness failure cycles | ${baseline.correctnessFailureCycles} | ${head.correctnessFailureCycles} | ${head.correctnessFailureCycles - baseline.correctnessFailureCycles} |`); + lines.push(`| cache failure cycles | ${baseline.cacheFailureCycles} | ${head.cacheFailureCycles} | ${head.cacheFailureCycles - baseline.cacheFailureCycles} |`); + lines.push(""); + lines.push("## Matched-cycle signals"); + lines.push(""); + lines.push(`- Matched cycles: ${pairs.length}`); + lines.push(`- Improved cycles: ${improved.length}`); + lines.push(`- Regressed cycles: ${regressed.length}`); + lines.push(""); + lines.push("## Notable changed cycles"); + lines.push(""); + if (notable.length === 0) { + lines.push("No notable first-layer, correctness, or cache-gate changes in matched cycles."); + } else { + lines.push("| case | compactor | cycle | baseline first layer | head first layer | stable prefix delta | correctness delta | cache delta |"); + lines.push("| --- | --- | ---: | --- | --- | ---: | ---: | ---: |"); + for (const { baselineRow, headRow } of notable) { + lines.push(`| ${headRow.caseId} | ${headRow.compactor} | ${headRow.cycle} | ${baselineRow.firstChangedPromptLayer ?? "n/a"} | ${headRow.firstChangedPromptLayer ?? "n/a"} | ${signed((headRow.stablePrefixTokens ?? 0) - (baselineRow.stablePrefixTokens ?? 0), 0)} | ${correctnessFailures(headRow) - correctnessFailures(baselineRow)} | ${cacheFailures(headRow) - cacheFailures(baselineRow)} |`); + } + } + lines.push(""); + lines.push("## Outliers"); + lines.push(""); + lines.push("### Worst stable-prefix deltas"); + lines.push(""); + lines.push("| case | baseline | head | delta | head first layer |"); + lines.push("| --- | ---: | ---: | ---: | --- |"); + for (const { baselineRow, headRow, delta } of worstStablePrefixDeltas) { + lines.push(`| ${rowLabel(headRow)} | ${baselineRow.stablePrefixTokens ?? "n/a"} | ${headRow.stablePrefixTokens ?? "n/a"} | ${signed(delta, 0)} | ${headRow.firstChangedPromptLayer ?? "n/a"} |`); + } + lines.push(""); + lines.push("### Largest full-prompt growth"); + lines.push(""); + lines.push("| case | baseline tokens | head tokens | delta | head first layer |"); + lines.push("| --- | ---: | ---: | ---: | --- |"); + for (const { baselineRow, headRow, delta } of largestPromptGrowth) { + lines.push(`| ${rowLabel(headRow)} | ${baselineRow.fullPromptTokensEst} | ${headRow.fullPromptTokensEst} | ${signed(delta, 0)} | ${headRow.firstChangedPromptLayer ?? "n/a"} |`); + } + lines.push(""); + lines.push("### Earliest changed head layers"); + lines.push(""); + lines.push("| case | first changed layer | stable prefix tokens | full prompt tokens |"); + lines.push("| --- | --- | ---: | ---: |"); + for (const row of earliestFirstChanged) { + lines.push(`| ${rowLabel(row)} | ${row.firstChangedPromptLayer ?? "n/a"} | ${row.stablePrefixTokens ?? "n/a"} | ${row.fullPromptTokensEst} |`); + } + lines.push(""); + lines.push("### Largest recent mutable layers"); + lines.push(""); + if (largestRecentLayers.length === 0) { + lines.push("No recent mutable layers were present in the head run."); + } else { + lines.push("| case | layer | chars |"); + lines.push("| --- | --- | ---: |"); + for (const { row, layer, size } of largestRecentLayers) { + lines.push(`| ${rowLabel(row)} | ${layer} | ${size} |`); + } + } + lines.push(""); + return `${lines.join("\n")}\n`; +}; + +const builtImages = []; + +const runBench = ({ label, ref, worktree }) => { + console.error(`Adding ${label} worktree for ${ref}`); + run("git", ["worktree", "add", "--detach", worktree, ref], { cwd: repoRoot }); + const image = `pi-mrc-bench-${safeName(label)}-${runId}`.toLowerCase(); + console.error(`Building ${image}`); + run("docker", ["build", "-t", image, "."], { cwd: worktree }); + builtImages.push(image); + const jsonlPath = join(outDir, `${label}.jsonl`); + const stderrPath = join(outDir, `${label}.stderr.log`); + const dockerArgs = ["run", "--rm"]; + if (realSessionsDir) dockerArgs.push("-v", `${resolve(realSessionsDir)}:/sessions:ro`); + dockerArgs.push(image, ...benchArgs()); + console.error(`Running ${label} benchmark`); + const result = spawnSync("docker", dockerArgs, { cwd: worktree, encoding: "utf8", stdio: ["ignore", "pipe", "pipe"] }); + writeFileSync(jsonlPath, result.stdout ?? ""); + writeFileSync(stderrPath, result.stderr ?? ""); + if (result.status !== 0) { + process.stderr.write(result.stderr ?? ""); + throw new Error(`${label} benchmark failed with status ${result.status}; see ${stderrPath}`); + } + return { jsonlPath, stderrPath }; +}; + +try { + ensureRef(baselineRef); + ensureRef(headRef); + mkdirSync(outDir, { recursive: true }); + mkdirSync(worktreeRoot, { recursive: true }); + + const baseline = runBench({ label: "baseline", ref: baselineRef, worktree: baselineWorktree }); + const head = runBench({ label: "head", ref: headRef, worktree: headWorktree }); + const report = markdownReport({ + baselineRows: readJsonl(baseline.jsonlPath), + headRows: readJsonl(head.jsonlPath), + baselinePath: baseline.jsonlPath, + headPath: head.jsonlPath, + }); + const reportPath = join(outDir, "comparison.md"); + writeFileSync(reportPath, report); + console.log(report); + console.error(`Wrote ${reportPath}`); +} finally { + if (!keepWorktrees && existsSync(worktreeRoot)) { + for (const worktree of [baselineWorktree, headWorktree]) { + if (existsSync(worktree)) { + spawnSync("git", ["worktree", "remove", "--force", worktree], { cwd: repoRoot, stdio: "ignore" }); + } + } + rmSync(worktreeRoot, { recursive: true, force: true }); + for (const image of builtImages) { + spawnSync("docker", ["rmi", image], { stdio: "ignore" }); + } + } +} diff --git a/src/commands/pi-mrc-control.ts b/src/commands/pi-mrc-control.ts new file mode 100644 index 0000000..d3a81d6 --- /dev/null +++ b/src/commands/pi-mrc-control.ts @@ -0,0 +1,37 @@ +import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; + +const disabledSessions = new Set(); + +const sessionKeyOf = (ctx: { sessionManager?: { getSessionFile?: () => string | undefined } }): string | undefined => + ctx.sessionManager?.getSessionFile?.(); + +export const isPiMrcDisabled = (sessionFile?: string): boolean => + !!sessionFile && disabledSessions.has(sessionFile); + +export const registerPiMrcControlCommands = (pi: ExtensionAPI) => { + pi.registerCommand("pi-mrc-off", { + description: "Disable pi-mrc compaction interception for this session", + handler: async (_args, ctx) => { + const sessionKey = sessionKeyOf(ctx); + if (!sessionKey) { + ctx.ui.notify("pi-mrc: No session file available; cannot disable this session.", "warning"); + return; + } + disabledSessions.add(sessionKey); + ctx.ui.notify("pi-mrc disabled for this session. Pi's built-in compactor will handle /compact and auto-compaction.", "info"); + }, + }); + + pi.registerCommand("pi-mrc-on", { + description: "Enable pi-mrc compaction interception for this session", + handler: async (_args, ctx) => { + const sessionKey = sessionKeyOf(ctx); + if (!sessionKey) { + ctx.ui.notify("pi-mrc: No session file available; cannot enable this session.", "warning"); + return; + } + disabledSessions.delete(sessionKey); + ctx.ui.notify("pi-mrc enabled for this session.", "info"); + }, + }); +}; diff --git a/src/commands/pi-mrc-dump-context.ts b/src/commands/pi-mrc-dump-context.ts new file mode 100644 index 0000000..ce98cac --- /dev/null +++ b/src/commands/pi-mrc-dump-context.ts @@ -0,0 +1,165 @@ +/** + * /pi-mrc-dump-context command. + * + * Extracts a structured context guide from the current session JSONL + * without triggering any compaction. Writes Markdown by default; + * supports --raw for session JSONL, --raw-context for Pi AgentMessage context, + * --raw-provider for the exact provider request payload, and --summary for inline display. + * + * Usage: + * /pi-mrc-dump-context → writes to /tmp/pi-mrc-context-guide.md + * /pi-mrc-dump-context /path/to/output.md → writes to specified path + * /pi-mrc-dump-context --raw → dumps raw active branch as JSONL + * /pi-mrc-dump-context --raw /path/to/out.jsonl → raw JSONL to specified path + * /pi-mrc-dump-context --raw-context → dumps latest captured Pi AgentMessage[] context + * /pi-mrc-dump-context --raw-provider → dumps latest provider request payload + * /pi-mrc-dump-context --summary → displays extracted context inline + */ + +import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; +import { statSync, writeFileSync, mkdirSync, existsSync } from "fs"; +import { dirname } from "path"; +import { + extractContext, + extractContextFromBuffer, + formatContextGuide, + writeContextGuide, + dumpRawSessionJsonl, +} from "../core/dump-context"; + +export const registerDumpContextCommand = (pi: ExtensionAPI) => { + pi.registerCommand("pi-mrc-dump-context", { + description: + "Extract structured context guide from session JSONL. Args: [output path] [--raw] [--summary]. No compaction is triggered.", + handler: async (args: string, ctx) => { + const sessionFile = ctx.sessionManager.getSessionFile(); + if (!sessionFile) { + ctx.ui.notify("No session file available.", "error"); + return; + } + + const raw = args.trim(); + const argv = raw.split(/\s+/).filter(Boolean); + const hasFlag = (flag: string): boolean => argv.includes(flag); + const isRawContext = hasFlag("--raw-context"); + const isRawProvider = hasFlag("--raw-provider") || hasFlag("--raw-request") || hasFlag("--raw-model"); + const isRaw = hasFlag("--raw"); + const isSummary = hasFlag("--summary"); + + const pathArg = argv + .filter((arg) => !["--raw-context", "--raw-provider", "--raw-request", "--raw-model", "--raw", "--summary"].includes(arg)) + .join(" "); + + // --raw-provider: dump exactly the latest provider request payload seen by Pi. + if (isRawProvider) { + const { readProviderRequestBuffer, listBufferedSessions } = await import("../core/context-buffer"); + const slots = readProviderRequestBuffer(sessionFile); + if (slots.length === 0) { + const sessions = listBufferedSessions(); + if (sessions.length === 0) { + ctx.ui.notify("No provider request buffer found. Prompt the agent at least once first.", "warning"); + return; + } + ctx.ui.notify(`No provider request buffer for this session. Available: ${sessions.map((s: any) => s.file).join(", ")}`, "warning"); + return; + } + const latest = slots[slots.length - 1]; + const payload = latest?.payload; + if (payload === undefined) { + ctx.ui.notify("No payload in latest provider request buffer slot.", "warning"); + return; + } + + const outPath = pathArg || `/tmp/pi-mrc-raw-provider-${Date.now()}.json`; + const dir = dirname(outPath); + if (!existsSync(dir)) mkdirSync(dir, { recursive: true }); + writeFileSync(outPath, JSON.stringify(payload, null, 2)); + const size = statSync(outPath).size; + ctx.ui.notify(`Raw provider request dumped: ${outPath} (${(size / 1024).toFixed(0)} KB, ${slots.length} buffer slots)`, "info"); + return; + } + + // --raw-context: dump just the latest Pi AgentMessage[] context payload. + if (isRawContext) { + // Look up buffer for this session + const { readContextBuffer, listBufferedSessions } = await import("../core/context-buffer"); + const slots = readContextBuffer(sessionFile); + if (slots.length === 0) { + const sessions = listBufferedSessions(); + if (sessions.length === 0) { + ctx.ui.notify("No context buffer found. Prompt the agent at least once first.", "warning"); + return; + } + ctx.ui.notify(`No buffer for this session. Available: ${sessions.map((s: any) => s.file).join(", ")}`, "warning"); + return; + } + const latest = slots[slots.length - 1]; + const messages = latest?.messages; + if (!Array.isArray(messages)) { + ctx.ui.notify("No messages in latest buffer slot.", "warning"); + return; + } + + const outPath = pathArg || `/tmp/pi-mrc-raw-context-${Date.now()}.json`; + const dir = dirname(outPath); + if (!existsSync(dir)) mkdirSync(dir, { recursive: true }); + writeFileSync(outPath, JSON.stringify(messages, null, 2)); + const size = statSync(outPath).size; + ctx.ui.notify(`Raw context dumped: ${outPath} (${(size / 1024).toFixed(0)} KB, ${messages.length} messages, ${slots.length} buffer slots)`, "info"); + return; + } + + // --raw: dump raw JSONL. This does not require successful context extraction. + if (isRaw) { + const outPath = pathArg || undefined; + const written = dumpRawSessionJsonl(sessionFile, outPath); + const size = statSync(written).size; + ctx.ui.notify(`Raw session dumped: ${written} (${(size / 1024).toFixed(0)} KB)`, "info"); + return; + } + + // Try real context buffer first, fall back to session extraction + let extracted = extractContextFromBuffer(sessionFile); + let sourceLabel = "real context buffer"; + if (!extracted) { + extracted = extractContext(sessionFile); + sourceLabel = "session file"; + } + if (!extracted) { + ctx.ui.notify("Failed to extract context from buffer or session file.", "error"); + return; + } + + if (isSummary) { + const guide = formatContextGuide(extracted, sessionFile); + pi.sendMessage({ + customType: "mrc-context-dump", + content: guide, + display: true, + }); + return; + } + + // Default: write context guide Markdown + const outPath = pathArg || undefined; + const written = writeContextGuide(extracted, sessionFile, outPath); + const size = statSync(written).size; + ctx.ui.notify(`Context guide written (${sourceLabel}): ${written} (${(size / 1024).toFixed(1)} KB)`, "info"); + + const summary = [ + `Context guide for ${extracted.stats.sessionId} (${sourceLabel})`, + ` Goals: ${extracted.goal.length}`, + ` Decisions: ${extracted.decisions.length}`, + ` Preferences: ${extracted.preferences.length}`, + ` Modified files: ${extracted.filesModified.size}`, + ` Recent user messages: ${extracted.recentUserMessages.length}`, + ` Compaction summaries: ${extracted.compactionSummaries.length}`, + ]; + pi.sendMessage({ + customType: "mrc-context-dump", + content: summary.join("\n"), + display: true, + }); + }, + }); +}; diff --git a/src/commands/pi-mrc-report.ts b/src/commands/pi-mrc-report.ts new file mode 100644 index 0000000..7fd4d70 --- /dev/null +++ b/src/commands/pi-mrc-report.ts @@ -0,0 +1,96 @@ +import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; +import { readFileSync } from "fs"; +import { + findCompactionReportRecords, + formatCompactionReportCommandSummary, + formatCompactionReportRecordList, + PI_MRC_REPORT_COMMAND_TYPE, + selectCompactionReportRecord, + writeCompactionReportArtifacts, +} from "../core/compaction-report-history"; +import { formatCompactionReportCard } from "../core/compaction-report"; + +const parseSessionFileEntries = (sessionFile: string | undefined): any[] => { + if (!sessionFile) return []; + try { + return readFileSync(sessionFile, "utf-8") + .split("\n") + .filter((line) => line.trim()) + .map((line) => { + try { return JSON.parse(line); } catch { return undefined; } + }) + .filter(Boolean); + } catch { + return []; + } +}; + +const sessionEntriesOf = (ctx: any): any[] => { + try { + const entries = ctx.sessionManager.getEntries?.(); + if (Array.isArray(entries) && entries.length > 0) return entries; + } catch { + // Defensive fallback: session managers from older Pi versions or partially + // loaded sessions can throw; the JSONL parser below still gives a report view. + } + return parseSessionFileEntries(ctx.sessionManager.getSessionFile?.()); +}; + +const entryIdFromArgs = (args: string): string | undefined => + args.match(/\bentry:([^\s]+)/i)?.[1]; + +export const registerPiMrcReportCommand = (pi: ExtensionAPI) => { + pi.registerCommand("pi-mrc-report", { + description: "Inspect latest pi-mrc compaction report; args: list, show, json, entry:", + handler: async (args: string, ctx) => { + const raw = args.trim(); + const lower = raw.toLowerCase(); + const records = findCompactionReportRecords(sessionEntriesOf(ctx)); + + if (lower.includes("list")) { + pi.sendMessage({ + customType: PI_MRC_REPORT_COMMAND_TYPE, + content: formatCompactionReportRecordList(records), + display: true, + }); + return; + } + + const entryId = entryIdFromArgs(raw); + const record = selectCompactionReportRecord(records, entryId); + if (!record) { + const suffix = entryId ? ` for entry ${entryId}` : ""; + ctx.ui.notify(`No pi-mrc compaction report found${suffix}.`, "warning"); + return; + } + + if (lower.includes("json") && lower.includes("inline")) { + pi.sendMessage({ + customType: PI_MRC_REPORT_COMMAND_TYPE, + content: `\`\`\`json\n${JSON.stringify(record.report, null, 2)}\n\`\`\``, + display: true, + details: record.report, + }); + return; + } + + if (lower.includes("show") || lower.includes("inline")) { + pi.sendMessage({ + customType: PI_MRC_REPORT_COMMAND_TYPE, + content: formatCompactionReportCard(record.report, { expanded: true }), + display: true, + details: record.report, + }); + return; + } + + const artifacts = writeCompactionReportArtifacts(record); + pi.sendMessage({ + customType: PI_MRC_REPORT_COMMAND_TYPE, + content: formatCompactionReportCommandSummary(record, artifacts), + display: true, + details: { report: record.report, artifacts }, + }); + }, + }); +}; diff --git a/src/commands/pi-vcc.ts b/src/commands/pi-mrc.ts similarity index 67% rename from src/commands/pi-vcc.ts rename to src/commands/pi-mrc.ts index 608d691..c472617 100644 --- a/src/commands/pi-vcc.ts +++ b/src/commands/pi-mrc.ts @@ -1,26 +1,26 @@ import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; -import { getLastCompactionStats, PI_VCC_COMPACT_INSTRUCTION } from "../hooks/before-compact"; +import { getLastCompactionStats, PI_MRC_COMPACT_INSTRUCTION } from "../hooks/before-compact"; const formatTokens = (n: number): string => { if (n >= 1000) return `${(n / 1000).toFixed(1)}k`; return String(n); }; -export const registerPiVccCommand = (pi: ExtensionAPI) => { - pi.registerCommand("pi-vcc", { - description: "Compact conversation with pi-vcc structured summary", +export const registerPiMrcCommand = (pi: ExtensionAPI) => { + pi.registerCommand("pi-mrc", { + description: "Compact conversation with pi-mrc model-reference compaction", handler: async (_args, ctx) => { ctx.compact({ - customInstructions: PI_VCC_COMPACT_INSTRUCTION, + customInstructions: PI_MRC_COMPACT_INSTRUCTION, onComplete: () => { const stats = getLastCompactionStats(); if (stats) { ctx.ui.notify( - `pi-vcc: ${stats.summarized} source entries processed; tail kept ${stats.kept} (~${formatTokens(stats.keptTokensEst)} tok).`, + `pi-mrc: ${stats.summarized} source entries processed; tail kept ${stats.kept} (~${formatTokens(stats.keptTokensEst)} tok).`, "info", ); } else { - ctx.ui.notify("Compacted with pi-vcc", "info"); + ctx.ui.notify("Compacted with pi-mrc", "info"); } }, onError: (err) => { diff --git a/src/commands/vcc-recall.ts b/src/commands/vcc-recall.ts deleted file mode 100644 index 8dcb509..0000000 --- a/src/commands/vcc-recall.ts +++ /dev/null @@ -1,65 +0,0 @@ -import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; -import { loadAllMessages } from "../core/load-messages"; -import { searchEntries } from "../core/search-entries"; -import { formatRecallOutput } from "../core/format-recall"; -import { getActiveLineageEntryIds } from "../core/lineage"; -import { parseRecallScope } from "../core/recall-scope"; - -const PAGE_SIZE = 5; -const DEFAULT_RECENT = 25; - -export const registerVccRecallCommand = (pi: ExtensionAPI) => { - pi.registerCommand("pi-vcc-recall", { - description: "Search session history. Defaults to active lineage; add scope:all for off-lineage branches.", - handler: async (args: string, ctx) => { - const sessionFile = ctx.sessionManager.getSessionFile(); - if (!sessionFile) { - ctx.ui.notify("No session file available.", "error"); - return; - } - - const raw = args.trim(); - const parsed = parseRecallScope(raw); - const lineageEntryIds = parsed.scope === "lineage" - ? getActiveLineageEntryIds(ctx.sessionManager) - : undefined; - if (!parsed.text) { - // No query: show recent - const { rendered } = loadAllMessages(sessionFile, false, lineageEntryIds); - const recent = rendered.slice(-DEFAULT_RECENT); - const output = (parsed.scope === "all" ? "Scope: all\n\n" : "") + formatRecallOutput(recent); - pi.sendMessage({ customType: "vcc-recall", content: output, display: true }, { triggerTurn: true }); - return; - } - - // Parse page:N from args - const pageMatch = parsed.text.match(/\bpage:(\d+)\b/i); - const page = pageMatch ? Math.max(1, parseInt(pageMatch[1], 10)) : 1; - const query = parsed.text.replace(/\bpage:\d+\b/i, "").trim(); - - if (!query) { - const { rendered } = loadAllMessages(sessionFile, false, lineageEntryIds); - const recent = rendered.slice(-DEFAULT_RECENT); - const output = (parsed.scope === "all" ? "Scope: all\n\n" : "") + formatRecallOutput(recent); - pi.sendMessage({ customType: "vcc-recall", content: output, display: true }, { triggerTurn: true }); - return; - } - - const { rendered, rawMessages } = loadAllMessages(sessionFile, false, lineageEntryIds); - const allResults = searchEntries(rendered, rawMessages, query); - - const start = (page - 1) * PAGE_SIZE; - const pageResults = allResults.slice(start, start + PAGE_SIZE); - const totalPages = Math.ceil(allResults.length / PAGE_SIZE); - const scopeSuffix = parsed.scope === "all" ? " (scope: all)" : ""; - const header = totalPages > 1 - ? `Page ${page}/${totalPages} (${allResults.length} total matches${scopeSuffix})` - : `${allResults.length} matches${scopeSuffix}`; - const footer = page < totalPages - ? `\n--- /pi-vcc-recall ${query}${parsed.scope === "all" ? " scope:all" : ""} page:${page + 1} ---` - : ""; - const output = formatRecallOutput(pageResults, query, header) + footer; - pi.sendMessage({ customType: "vcc-recall", content: output, display: true }, { triggerTurn: true }); - }, - }); -}; diff --git a/src/core/brief.ts b/src/core/brief.ts index c53ce14..25a3b8b 100644 --- a/src/core/brief.ts +++ b/src/core/brief.ts @@ -1,5 +1,6 @@ import type { NormalizedBlock } from "../types"; -import { clip, firstLine } from "./content"; +import { clip } from "./content"; +import { summarizeToolResultForPrompt } from "./tool-result-summary"; import { extractPath } from "./tool-args"; import { collapseSkillText } from "./skill-collapse"; @@ -181,7 +182,7 @@ export const buildBriefSections = (blocks: NormalizedBlock[]): BriefLine[] => { } case "tool_result": { if (b.isError) { - const body = firstLine(b.text, 150); + const body = summarizeToolResultForPrompt(b.text); // Drop empty/placeholder error bodies — keep the line only if it carries info. if (!body || body === "(no output)") break; const ref = b.sourceIndex != null ? ` (#${b.sourceIndex})` : ""; diff --git a/src/core/build-sections.ts b/src/core/build-sections.ts index 58c4bb1..e516fe5 100644 --- a/src/core/build-sections.ts +++ b/src/core/build-sections.ts @@ -1,11 +1,14 @@ import type { NormalizedBlock } from "../types"; -import { clip, clipSentence, firstLine, nonEmptyLines } from "./content"; +import { clip, clipSentence, nonEmptyLines } from "./content"; +import { summarizeToolResultForPrompt } from "./tool-result-summary"; import type { SectionData } from "../sections"; -import { extractGoals } from "../extract/goals"; +import { extractGoalState } from "../extract/goals"; import { extractFiles } from "../extract/files"; import { extractPreferences, dedupPreferencesAgainstGoals } from "../extract/preferences"; import { extractCommits, formatCommits } from "../extract/commits"; +import { extractEvidence, formatEvidence } from "../extract/evidence"; import { buildBriefSections, sectionsToTranscript, stringifyBrief } from "./brief"; +import { extractPath } from "./tool-args"; export interface BuildSectionsInput { blocks: NormalizedBlock[]; @@ -20,7 +23,7 @@ const extractOutstandingContext = (blocks: NormalizedBlock[]): string[] => { for (const b of tail) { if (b.kind === "tool_result" && b.isError) { - items.push(`[${b.name}] ${firstLine(b.text, 150)}`); + items.push(`[${b.name}] ${summarizeToolResultForPrompt(b.text)}`); continue; } @@ -51,7 +54,7 @@ const formatFileActivity = (blocks: NormalizedBlock[]): string[] => { const cap = (set: Set, limit: number) => { const arr = [...set]; if (arr.length <= limit) return arr.join(", "); - return arr.slice(0, limit).join(", ") + ` (+${arr.length - limit} more)`; + return arr.slice(0, limit).join(", ") + " (+more)"; }; if (act.modified.size > 0) lines.push(`Modified: ${cap(act.modified, 10)}`); if (act.created.size > 0) lines.push(`Created: ${cap(act.created, 10)}`); @@ -59,19 +62,84 @@ const formatFileActivity = (blocks: NormalizedBlock[]): string[] => { return lines; }; +const READ_TOOLS = new Set(["Read", "read", "read_file", "View"]); + +const readLineScore = (line: string): number => { + let score = 0; + if (/\b(createRequire|register[A-Z]\w*|supports\w+|handler|schema|strategy|compactor)\b/.test(line)) score += 5; + if (/\bexport\s+(function|class|const|interface|type)\b/.test(line)) score += 3; + if (/^import\b/.test(line)) score += 1; + if (/\b(return|if|else)\b/.test(line)) score += 1; + return score; +}; + +const importantReadLines = (text: string): string[] => { + const candidates = text + .split("\n") + .map((line, order) => ({ line: line.trim(), order })) + .filter((candidate) => candidate.line) + .map((candidate) => ({ ...candidate, score: readLineScore(candidate.line) })) + .filter((candidate) => candidate.score > 0) + .sort((a, b) => b.score - a.score || a.order - b.order) + .slice(0, 4) + .sort((a, b) => a.order - b.order); + return candidates.map((candidate) => clip(candidate.line, 110)); +}; + +const readContextScore = (path: string, lines: string[]): number => { + let score = 0; + if (/\b(loader|resolver|runtime|hook|strategy|compactor|session|auth|cache)\b/i.test(path)) score += 2; + if (/\b(generated|fixture|snapshot|noise)\b/i.test(path)) score -= 4; + const text = lines.join("\n"); + if (/\b(register[A-Z]\w*|createRequire|supports\w+|handler|schema|strategy|compactor)\b/.test(text)) score += 3; + if (/\b(export function|export class|export const|interface|type )\b/.test(text)) score += 1; + return score; +}; + +const extractReadContext = (blocks: NormalizedBlock[]): string[] => { + const readResults: { path: string; lines: string[]; score: number; order: number }[] = []; + const pendingReadPaths: Array = []; + + for (const [index, block] of blocks.entries()) { + if (block.kind === "tool_call") { + if (READ_TOOLS.has(block.name)) { + pendingReadPaths.push(extractPath(block.args)); + } + continue; + } + if (block.kind !== "tool_result" || !READ_TOOLS.has(block.name)) continue; + const readPath = pendingReadPaths.shift(); + if (!readPath || block.isError) continue; + const lines = importantReadLines(block.text); + if (lines.length === 0) continue; + const score = readContextScore(readPath, lines); + if (score <= 0) continue; + readResults.push({ path: readPath, lines, score, order: index }); + } + + return readResults + .sort((a, b) => b.score - a.score || a.order - b.order) + .slice(0, 4) + .sort((a, b) => a.order - b.order) + .map((result) => `${result.path}: ${clip(result.lines.join("; "), 220)}`); +}; + export const buildSections = (input: BuildSectionsInput): SectionData => { const { blocks } = input; const briefSections = buildBriefSections(blocks); - const sessionGoal = extractGoals(blocks); + const goalState = extractGoalState(blocks); const userPreferences = dedupPreferencesAgainstGoals( extractPreferences(blocks), - sessionGoal, + [...goalState.stableGoals, ...goalState.currentScope], ); return { - sessionGoal, + sessionGoal: goalState.stableGoals, + currentScope: goalState.currentScope, outstandingContext: extractOutstandingContext(blocks), filesAndChanges: formatFileActivity(blocks), + readContext: extractReadContext(blocks), commits: formatCommits(extractCommits(blocks)), + evidenceHandles: formatEvidence(extractEvidence(blocks)), userPreferences, briefTranscript: stringifyBrief(briefSections), transcriptEntries: sectionsToTranscript(briefSections), diff --git a/src/core/chunk-model.ts b/src/core/chunk-model.ts new file mode 100644 index 0000000..c919f39 --- /dev/null +++ b/src/core/chunk-model.ts @@ -0,0 +1,132 @@ +/** + * Chunk model for the model-reference compactor. + * + * Splits compaction state into referenceable chunks, each with a stable ID + * that survives across compactions. The model classifies these chunks into + * KEEP (active prompt), REF (retrievable index), or DROP (archive only). + */ + +import type { CompactionState } from "./compaction-state"; + +export type ChunkKind = + | "goal" + | "scope" + | "recent-scope" + | "file" + | "read-context" + | "commit" + | "recent-commit" + | "evidence" + | "recent-evidence" + | "preference" + | "recent-preference" + | "outstanding-context" + | "transcript-line" + | "recall"; + +export interface CompactionChunk { + /** Stable ID, e.g. "sessionGoal:0", "evidence:2", "transcript:15" */ + id: string; + kind: ChunkKind; + /** Full text content, preserved verbatim when in KEEP tier */ + text: string; + /** Source section name for reconstruction */ + section: string; + /** 0-based index within the section */ + index: number; +} + +/** + * Build chunks from a CompactionState. + * + * Each section item becomes one chunk. Transcript lines are split per line. + * Chunk IDs use the pattern `section:index` and are stable as long as + * the section's items retain their identity across compactions. + */ +export const chunkCompactionState = (state: CompactionState): CompactionChunk[] => { + const chunks: CompactionChunk[] = []; + + const items = ( + kind: ChunkKind, + section: string, + source: string[], + ): void => { + for (let i = 0; i < source.length; i++) { + chunks.push({ id: `${section}:${i}`, kind, text: source[i], section, index: i }); + } + }; + + items("goal", "sessionGoal", state.current.sessionGoal); + items("scope", "currentScope", state.current.currentScope); + items("recent-scope", "recentScope", state.current.recentScopeUpdates); + items("file", "files", state.current.filesAndChanges); + items("read-context", "readContext", state.current.readContext); + items("commit", "commits", state.current.commits); + items("recent-commit", "recentCommits", state.current.recentCommits); + items("evidence", "evidence", state.current.evidenceHandles); + items("recent-evidence", "recentEvidence", state.current.recentEvidenceHandles); + items("preference", "preferences", state.current.userPreferences); + items("recent-preference", "recentPreferences", state.current.recentUserPreferences); + items("outstanding-context", "outstanding", state.current.outstandingContext); + + // Transcript lines + const transcriptLines = state.history.briefTranscript + .split("\n") + .filter((line) => line.trim().length > 0); + for (let i = 0; i < transcriptLines.length; i++) { + chunks.push({ + id: `transcript:${i}`, + kind: "transcript-line", + text: transcriptLines[i], + section: "transcript", + index: i, + }); + } + + return chunks; +}; + +export interface SubGoal { + /** CURRENT subgoals are priority-ordered; COMPLETED subgoals prevent rework. */ + status: "CURRENT" | "COMPLETED"; + label: string; + /** Priority reason for CURRENT subgoals; outcome/rationale for COMPLETED subgoals. */ + note: string; + recallCondition: string; + ref: string; // chunk IDs or bundle:name +} + +/** Classification result from the model */ +export interface ChunkClassification { + keepIds: string[]; + refs: Array<{ id: string; summary: string }>; + dropIds: string[]; + mvs: string; + overarching?: string; + subGoals?: SubGoal[]; + /** Parked goal bundles for later revival */ + bundles?: GoalBundle[]; +} + +/** A parked goal context bundle */ +export interface GoalBundle { + id: string; + label: string; + recallCondition: string; + chunkIds: string[]; +} + +/** A single REF index entry stored in Tier 2 */ +export interface RefIndexEntry { + id: string; + summary: string; + /** Compaction cycle when this was last classified as REF */ + cycle: number; + /** Times this chunk has been promoted from REF to KEEP */ + promotionCount: number; +} + +/** Tier 2 retrievable index */ +export interface RefIndex { + entries: RefIndexEntry[]; +} diff --git a/src/core/classifier.ts b/src/core/classifier.ts new file mode 100644 index 0000000..8e4421b --- /dev/null +++ b/src/core/classifier.ts @@ -0,0 +1,376 @@ +/** + * Real LLM classifier using an OpenAI-compatible chat API. + * + * Sends conversation chunks to a cheap model (default DeepSeek Flash) which + * classifies them into KEEP (critical, keep in active prompt), REF (useful, + * store in retrievable index), or DROP (archive only). The model also writes + * a short Minimum Viable Summary paragraph. + * + * The model's job is classification, not content creation. Chunk text is + * preserved verbatim; the model only picks which to keep and writes one-line + * summaries for REF chunks and the MVS paragraph. + */ + +import type { CompactionChunk, ChunkClassification } from "./chunk-model"; + +export interface ClassifierConfig { + /** API base URL (OpenAI-compatible) */ + baseUrl: string; + /** API key */ + apiKey: string; + /** Model name (e.g. "deepseek-chat", "gpt-4o-mini") */ + model: string; + /** Maximum output tokens */ + maxTokens?: number; + /** Timeout in ms */ + timeoutMs?: number; +} + +export interface ClassifierResult extends ChunkClassification { + /** Real token usage from API response */ + usage?: { + promptTokens: number; + completionTokens: number; + }; +} + +const CLASSIFIER_SYSTEM_PROMPT = `You are a context compaction classifier. Your job is to classify conversation chunks into tiers so a future LLM can continue the work efficiently. + +DO NOT rewrite or summarize the chunk content. You only: +1. Decide which chunks to KEEP, REF, or DROP +2. Write actionable REF summaries with recall conditions +3. Group parked old-goal chunks into BUNDLE entries +4. Write a short Minimum Viable Summary (MVS) paragraph + +Classification rules: + +DECISION PRINCIPLE: For each chunk, ask "Would a new agent need this to make its NEXT tool call or file edit?" If yes → KEEP. If it might help later but not now → REF. If no agent would ever need it → DROP. + +SOURCE RECOVERABILITY RULE: Repository source files are cheap, authoritative, and rereadable. +- Do NOT KEEP or REF full source snippets, function bodies, type bodies, or config bodies when a path/symbol/line hint lets the agent reread the source. +- For source-derived context, preserve only minimal locators: file path, symbol/function/class/type name, optional line hint, and why it matters. +- DROP source body details that are easy to recover with read/rg/code-intel. +- KEEP source-derived details only when they are not easily recoverable: uncommitted/deleted edits not present in files, generated/transient output, exact errors, benchmark results, user decisions, constraints, or non-obvious investigation conclusions. +- Prefer conversation-only state over source-visible state. + +- KEEP: ONLY what is directly actionable for the IMMEDIATE next step. A new agent reading only KEEP chunks should know: 1) what to work on, 2) which files to touch, 3) what constraints are active, 4) what was just decided. If you can't explain why a chunk would directly affect the next read/edit/bash call, put it in REF. + Priority: user's last explicit decision > currently edited files > active constraints > current goal > recent evidence. Do NOT keep: old-phase goals, review meta-guidelines, generic evidence without identifiers, repeated goal variants, rereadable source bodies. + +- REF: Context an agent might need if the conversation returns to a topic. Write "Recall if " so the agent knows WHEN to retrieve this. + INLINING RULE: If the chunk content is shorter than ~120 chars — shorter than or close to the recall condition you would write — just KEEP it instead. Don't make the agent recall something it could just read. + RECOVERABLE SOURCE RULE: if the full content is in a repository file, the REF summary should be a locator/trigger (path + symbol + why), not a paraphrase of the source body. + +- DROP: Fluff, status updates, duplicates, greetings, stale metadata, and source-visible details that can be reread from a path/symbol locator. + +KEEP BUDGET: Target ~800-1,500 characters of KEEP output total (roughly 15-25 chunks depending on size). If you exceed the character budget, move lowest-priority items to REF. Prefer keeping 10 high-signal chunks over 25 low-signal ones. + +BUNDLE format (for parked old goals): +- When chunks belong to a previous goal that is no longer active, group them into a named bundle. +- Format: BUNDLE: |