Fix/incremental destructive overwrite by petercoxphoto · Pull Request #251 · DeusData/codebase-memory-mcp

petercoxphoto · 2026-04-13T20:18:31Z

Summary

index_repository was destructively overwriting a project's graph in the store on every call, regardless of mode. In a mode downgrade (full → fast), files inside FAST_SKIP_DIRS directories (tools, scripts, bin, build, docs, __tests__, migrations, …) were silently purged from the graph. Real incident 2026-04-13: a live Skyline graph lost 516 nodes mid-session when a concurrent process ran fast mode, including every file under packages/mcp/src/tools/.
Fix: find_deleted_files now stat()s each stored-but-missing file. Files that exist on disk are mode-skipped (nodes and hash rows preserved); only files that truly don't exist are purged. Mode-skipped hash rows are threaded through dump_and_persist → persist_hashes and re-upserted into the rebuilt DB so subsequent reindexes can still classify them.
Follow-up: persist_hashes now checks the return code of every cbm_store_upsert_file_hash call on both the current-files and mode-skipped loops, logs a warning on failure with scope/rel_path/rc, and continues (partial-failure policy). Without this, a silent upsert failure on the mode_skipped loop could revive the orphaned-node bug class the main fix is designed to prevent.

Why this matters

The upstream FAST_SKIP_DIRS heuristic in discover.c is reasonable in isolation — lesser index modes skip directories named tools/scripts/bin/etc. to index faster. The bug was that cbm_pipeline_run_incremental then classified the resulting absence as deletion, purging the nodes. Nothing in the index store distinguished "this file isn't in the current walk" from "this file doesn't exist anymore". One run of index_repository(mode='fast') on a project that had previously been indexed in full mode would silently destroy the full-mode data — no warning, no log above info level.

The additive-merge semantics were partially built: cbm_pipeline_run_incremental already loads the existing graph into a gbuf, mutates only the changed files, and dumps back. The missing piece was correctly classifying stored-but-absent files.

Fail-safe rules (preserve on uncertainty)

repo_path NULL → log error, preserve everything. A NULL is a misconfigured pipeline, not a deletion signal.
snprintf path truncation (repo_path + rel_path ≥ 4KB) → preserve. Can't reliably stat a truncated path.
stat() errno != ENOENT/ENOTDIR (EACCES, EIO, ELOOP, transient NFS, …) → preserve. File may exist; we just can't see it right now.
strdup OOM mid-record → drop the entry rather than persist a row with a NULL rel_path that would silently fail the NOT NULL constraint.

Test plan

New test incremental_fast_preserves_mode_skipped_tools_dir in tests/test_pipeline.c walks the real failure mode end-to-end in four phases:
1. Full mode index — both files present
2. Fast mode reindex — tools/util.go nodes preserved (was 0 before fix)
3. Mutate main.go and fast reindex — forces dump_and_persist to run instead of the noop early-return, exercising the hash-preservation path
4. Delete tools/util.go from disk and full reindex — confirms the deletion-detection contract still holds (would fail with orphaned nodes without hash-row preservation)
New test store_file_hash_upsert_rejects_null_required_fields in tests/test_store_nodes.c pins the upstream API contract that cbm_store_upsert_file_hash returns CBM_STORE_ERR on NULL required fields. If a future schema change relaxes any NOT NULL, this test catches it at the producer layer so the consumer's warning doesn't go silent.
Full test suite (2722 tests) passes under ASan + UBSan + LeakSanitizer.
Adversarial audit cycle 1: 4 must-fix findings (ENOENT-only stat semantics, snprintf truncation check, hash-row drift, NULL repo_path branch) — all fixed.
Adversarial audit cycle 2: verified all 4 cycle-1 fixes in code and test, found 1 minor (strdup NULL check) — fixed inline.
Adversarial audit on follow-up (return-code checking): zero CRITICAL/MAJOR/MINOR, 6 observations all confirmed design choices. Ship verdict.
Live end-to-end verification: built production binary, ran against a synthetic repo with packages/mcp/src/tools/*.go files, observed log line incremental.classify changed=0 unchanged=1 deleted=0 mode_skipped=2 (deleted would have been 2 before the fix).

Note

This fix removes the destructive failure mode but does NOT relax any policy around always using mode='full'. Full mode remains the only mode that provides complete directory coverage (no FAST_SKIP_DIRS at discover time) and semantic edges (SIMILAR_TO, SEMANTICALLY_RELATED). This fix is defense-in-depth for the case where a session running with stale instructions or any future tooling accidentally calls a lesser mode.

`index_repository` was destructively overwriting a project's graph in the store. Specifically, `find_deleted_files` classified ANY stored file absent from the current discovery list as "deleted" and purged its nodes via `cbm_gbuf_delete_by_file`. In a mode downgrade (full → fast), directories matching `FAST_SKIP_DIRS` (`tools`, `scripts`, `bin`, `build`, `docs`, `__tests__`, `migrations`, ...) are excluded from discovery but still exist on disk. They were getting silently purged. Real-world incident, 2026-04-13: the Skyline graph was built at 03:45 by the nightly full-mode cron (2819 nodes, packages/mcp/src/tools/ included). At 14:36 a concurrent /develop session ran mode='fast' per the then- current skill text. The Skyline graph collapsed to 2303 nodes mid-day and the entire packages/mcp/src/tools/ directory (18 files / ~500 nodes) vanished from the live graph while a diagnosing session was running. Fix --- `find_deleted_files` now stat()s each stored file that is absent from current discovery and classifies it as one of: - "deleted" — stat() returns ENOENT or ENOTDIR. Nodes purged, hash row dropped. - "mode-skipped" — stat() succeeds (or fails with a non-ENOENT errno, or the path can't be reliably constructed). Nodes AND hash row preserved. Mode-skipped hash rows are now carried through `dump_and_persist` → `persist_hashes` and re-upserted into the new DB. Without that, an orphaned-node bug emerges: 1. full mode indexes everything 2. fast mode runs and (with only the node-preservation half of the fix) drops mode-skipped hash rows 3. mode-skipped file is then deleted on disk 4. next reindex's stored hashes don't include the file → noop or can't detect the deletion → graph nodes for the deleted file remain forever The new test exercises exactly this sequence and the deletion-detection contract holds end-to-end. Fail-safe rules (preserve nodes on uncertainty) ----------------------------------------------- - repo_path NULL → log error, return 0 deletions, preserve everything. The caller contract is that repo_path is required; a NULL means a misconfigured pipeline, not a deletion signal. - snprintf truncation (combined path ≥ CBM_SZ_4K) → preserve. We can't reliably stat a truncated path. - stat() errno != ENOENT/ENOTDIR (EACCES, EIO, ELOOP, transient NFS, etc.) → preserve. The file may exist; we just can't see it right now. - strdup OOM mid-record → drop the entry rather than persist a row with a NULL rel_path that would silently fail the NOT NULL constraint and reintroduce the orphaned-node bug. These rules came out of two adversarial audit cycles: - Cycle 1 caught the original ENOENT-only gap, the snprintf truncation gap, the silent NULL repo_path branch, and the hash-row drift gap. - Cycle 2 caught the strdup NULL handling gap. Test ---- `incremental_fast_preserves_mode_skipped_tools_dir` in test_pipeline.c walks the real failure mode end-to-end: 1. Full mode index — both files present 2. Fast mode reindex — tools/util.go nodes preserved (was 0 before fix) 3. Mutate main.go and fast reindex — forces dump_and_persist to run instead of the noop early-return, exercising the hash-preservation path that step 2 doesn't cover 4. Delete tools/util.go from disk and full reindex — confirms the deletion-detection contract still holds (would fail with orphaned nodes without the hash-row preservation half of the fix) All 2741 existing tests still pass under ASan + UBSan + LeakSanitizer. Diagnosed in session a1fc365c-d9fd-4228-a86a-ef92e9e97984. Tracked as PM task claude-connectors/codebase-memory-index-repository-is-destructive (mC_AtUBH9DAh). Note: this fix removes the destructive failure mode but does NOT relax the policy that `mode='full'` is the only acceptable mode for our own projects. Full mode remains required for complete visibility and searchability (semantic edges, SIMILAR_TO, full directory coverage). This fix is defense-in-depth for the case where a session running with older instructions accidentally calls a lesser mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cycle 2 of the additive-merge audit (mC_AtUBH9DAh, commit fab42f9) flagged that persist_hashes was discarding the return code of cbm_store_upsert_file_hash on both the current-files loop and the new mode_skipped loop. A constraint violation, I/O error, or any other upsert failure silently dropped one or more hash rows. Pre-existing pattern for the current-files loop (just a perf drag — next reindex sees the file as new and re-parses it). New risk for the mode_skipped loop: a silent failure leaves the file's nodes preserved in the graph but the hash row missing from the new DB. The next reindex can't classify the file. If it's later deleted on disk, the deletion isn't detected → orphaned graph nodes survive forever — exactly the failure mode mC_AtUBH9DAh was designed to prevent, just behind a different trigger. Fix --- persist_hashes now: - checks rc on every upsert call in both loops - on rc != CBM_STORE_OK, logs a warning with scope (current vs mode_skipped), rel_path, and the rc value - continues to the next file (partial-failure policy: partial preservation is better than total loss) - at end, if any failures, emits a summary line with both counts Function header documents the partial-failure policy explicitly so a future reader doesn't mistake "continue on error" for "ignore error". Test ---- New test `store_file_hash_upsert_rejects_null_required_fields` in test_store_nodes.c pins the API contract that persist_hashes depends on: cbm_store_upsert_file_hash returns CBM_STORE_ERR (not silent OK) when a NOT NULL column would receive SQL NULL. Tests three NOT NULL violations (rel_path, sha256, project) and asserts that a previously-valid row is still present afterwards (partial-failure isolation at the store level). If a future schema change relaxes any of these NOT NULL constraints, the downstream warning becomes silent and the orphaned-node bug class can re-emerge — this test catches that change at the right layer. Audit verdict: ships clean. No CRITICAL/MAJOR/MINOR findings. All 2742 tests pass under ASan + UBSan + LeakSanitizer. PM task: claude-connectors/codebase-memory-persist-hashes-ignores-upsert-return-codes (KepDJhBEQNoW) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

petercoxphoto and others added 2 commits April 13, 2026 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/incremental destructive overwrite#251

Fix/incremental destructive overwrite#251
petercoxphoto wants to merge 2 commits intoDeusData:mainfrom
petercoxphoto:fix/incremental-destructive-overwrite

petercoxphoto commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

petercoxphoto commented Apr 13, 2026

Summary

Why this matters

Fail-safe rules (preserve on uncertainty)

Test plan

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant