Fix/incremental destructive overwrite#251
Open
petercoxphoto wants to merge 2 commits intoDeusData:mainfrom
Open
Fix/incremental destructive overwrite#251petercoxphoto wants to merge 2 commits intoDeusData:mainfrom
petercoxphoto wants to merge 2 commits intoDeusData:mainfrom
Conversation
`index_repository` was destructively overwriting a project's graph in the
store. Specifically, `find_deleted_files` classified ANY stored file
absent from the current discovery list as "deleted" and purged its nodes
via `cbm_gbuf_delete_by_file`. In a mode downgrade (full → fast),
directories matching `FAST_SKIP_DIRS` (`tools`, `scripts`, `bin`, `build`,
`docs`, `__tests__`, `migrations`, ...) are excluded from discovery but
still exist on disk. They were getting silently purged.
Real-world incident, 2026-04-13: the Skyline graph was built at 03:45 by
the nightly full-mode cron (2819 nodes, packages/mcp/src/tools/ included).
At 14:36 a concurrent /develop session ran mode='fast' per the then-
current skill text. The Skyline graph collapsed to 2303 nodes mid-day and
the entire packages/mcp/src/tools/ directory (18 files / ~500 nodes)
vanished from the live graph while a diagnosing session was running.
Fix
---
`find_deleted_files` now stat()s each stored file that is absent from
current discovery and classifies it as one of:
- "deleted" — stat() returns ENOENT or ENOTDIR. Nodes purged,
hash row dropped.
- "mode-skipped" — stat() succeeds (or fails with a non-ENOENT errno,
or the path can't be reliably constructed). Nodes
AND hash row preserved.
Mode-skipped hash rows are now carried through `dump_and_persist` →
`persist_hashes` and re-upserted into the new DB. Without that, an
orphaned-node bug emerges:
1. full mode indexes everything
2. fast mode runs and (with only the node-preservation half of the fix)
drops mode-skipped hash rows
3. mode-skipped file is then deleted on disk
4. next reindex's stored hashes don't include the file → noop or can't
detect the deletion → graph nodes for the deleted file remain forever
The new test exercises exactly this sequence and the deletion-detection
contract holds end-to-end.
Fail-safe rules (preserve nodes on uncertainty)
-----------------------------------------------
- repo_path NULL → log error, return 0 deletions, preserve everything.
The caller contract is that repo_path is required; a NULL means a
misconfigured pipeline, not a deletion signal.
- snprintf truncation (combined path ≥ CBM_SZ_4K) → preserve. We can't
reliably stat a truncated path.
- stat() errno != ENOENT/ENOTDIR (EACCES, EIO, ELOOP, transient NFS,
etc.) → preserve. The file may exist; we just can't see it right now.
- strdup OOM mid-record → drop the entry rather than persist a row with
a NULL rel_path that would silently fail the NOT NULL constraint and
reintroduce the orphaned-node bug.
These rules came out of two adversarial audit cycles:
- Cycle 1 caught the original ENOENT-only gap, the snprintf truncation
gap, the silent NULL repo_path branch, and the hash-row drift gap.
- Cycle 2 caught the strdup NULL handling gap.
Test
----
`incremental_fast_preserves_mode_skipped_tools_dir` in test_pipeline.c
walks the real failure mode end-to-end:
1. Full mode index — both files present
2. Fast mode reindex — tools/util.go nodes preserved (was 0 before fix)
3. Mutate main.go and fast reindex — forces dump_and_persist to run
instead of the noop early-return, exercising the hash-preservation
path that step 2 doesn't cover
4. Delete tools/util.go from disk and full reindex — confirms the
deletion-detection contract still holds (would fail with orphaned
nodes without the hash-row preservation half of the fix)
All 2741 existing tests still pass under ASan + UBSan + LeakSanitizer.
Diagnosed in session a1fc365c-d9fd-4228-a86a-ef92e9e97984. Tracked as
PM task claude-connectors/codebase-memory-index-repository-is-destructive
(mC_AtUBH9DAh).
Note: this fix removes the destructive failure mode but does NOT relax
the policy that `mode='full'` is the only acceptable mode for our own
projects. Full mode remains required for complete visibility and
searchability (semantic edges, SIMILAR_TO, full directory coverage).
This fix is defense-in-depth for the case where a session running with
older instructions accidentally calls a lesser mode.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cycle 2 of the additive-merge audit (mC_AtUBH9DAh, commit fab42f9) flagged that persist_hashes was discarding the return code of cbm_store_upsert_file_hash on both the current-files loop and the new mode_skipped loop. A constraint violation, I/O error, or any other upsert failure silently dropped one or more hash rows. Pre-existing pattern for the current-files loop (just a perf drag — next reindex sees the file as new and re-parses it). New risk for the mode_skipped loop: a silent failure leaves the file's nodes preserved in the graph but the hash row missing from the new DB. The next reindex can't classify the file. If it's later deleted on disk, the deletion isn't detected → orphaned graph nodes survive forever — exactly the failure mode mC_AtUBH9DAh was designed to prevent, just behind a different trigger. Fix --- persist_hashes now: - checks rc on every upsert call in both loops - on rc != CBM_STORE_OK, logs a warning with scope (current vs mode_skipped), rel_path, and the rc value - continues to the next file (partial-failure policy: partial preservation is better than total loss) - at end, if any failures, emits a summary line with both counts Function header documents the partial-failure policy explicitly so a future reader doesn't mistake "continue on error" for "ignore error". Test ---- New test `store_file_hash_upsert_rejects_null_required_fields` in test_store_nodes.c pins the API contract that persist_hashes depends on: cbm_store_upsert_file_hash returns CBM_STORE_ERR (not silent OK) when a NOT NULL column would receive SQL NULL. Tests three NOT NULL violations (rel_path, sha256, project) and asserts that a previously-valid row is still present afterwards (partial-failure isolation at the store level). If a future schema change relaxes any of these NOT NULL constraints, the downstream warning becomes silent and the orphaned-node bug class can re-emerge — this test catches that change at the right layer. Audit verdict: ships clean. No CRITICAL/MAJOR/MINOR findings. All 2742 tests pass under ASan + UBSan + LeakSanitizer. PM task: claude-connectors/codebase-memory-persist-hashes-ignores-upsert-return-codes (KepDJhBEQNoW) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
index_repositorywas destructively overwriting a project's graph in the store on every call, regardless of mode. In a mode downgrade (full → fast), files insideFAST_SKIP_DIRSdirectories (tools,scripts,bin,build,docs,__tests__,migrations, …) were silently purged from the graph. Real incident 2026-04-13: a live Skyline graph lost 516 nodes mid-session when a concurrent process ran fast mode, including every file underpackages/mcp/src/tools/.find_deleted_filesnow stat()s each stored-but-missing file. Files that exist on disk are mode-skipped (nodes and hash rows preserved); only files that truly don't exist are purged. Mode-skipped hash rows are threaded throughdump_and_persist→persist_hashesand re-upserted into the rebuilt DB so subsequent reindexes can still classify them.persist_hashesnow checks the return code of everycbm_store_upsert_file_hashcall on both the current-files and mode-skipped loops, logs a warning on failure with scope/rel_path/rc, and continues (partial-failure policy). Without this, a silent upsert failure on the mode_skipped loop could revive the orphaned-node bug class the main fix is designed to prevent.Why this matters
The upstream
FAST_SKIP_DIRSheuristic indiscover.cis reasonable in isolation — lesser index modes skip directories namedtools/scripts/bin/etc. to index faster. The bug was thatcbm_pipeline_run_incrementalthen classified the resulting absence as deletion, purging the nodes. Nothing in the index store distinguished "this file isn't in the current walk" from "this file doesn't exist anymore". One run ofindex_repository(mode='fast')on a project that had previously been indexed in full mode would silently destroy the full-mode data — no warning, no log above info level.The additive-merge semantics were partially built:
cbm_pipeline_run_incrementalalready loads the existing graph into a gbuf, mutates only the changed files, and dumps back. The missing piece was correctly classifying stored-but-absent files.Fail-safe rules (preserve on uncertainty)
repo_pathNULL → log error, preserve everything. A NULL is a misconfigured pipeline, not a deletion signal.snprintfpath truncation (repo_path+rel_path≥ 4KB) → preserve. Can't reliably stat a truncated path.stat()errno != ENOENT/ENOTDIR (EACCES, EIO, ELOOP, transient NFS, …) → preserve. File may exist; we just can't see it right now.strdupOOM mid-record → drop the entry rather than persist a row with a NULL rel_path that would silently fail the NOT NULL constraint.Test plan
incremental_fast_preserves_mode_skipped_tools_dirintests/test_pipeline.cwalks the real failure mode end-to-end in four phases:tools/util.gonodes preserved (was 0 before fix)main.goand fast reindex — forcesdump_and_persistto run instead of the noop early-return, exercising the hash-preservation pathtools/util.gofrom disk and full reindex — confirms the deletion-detection contract still holds (would fail with orphaned nodes without hash-row preservation)store_file_hash_upsert_rejects_null_required_fieldsintests/test_store_nodes.cpins the upstream API contract thatcbm_store_upsert_file_hashreturnsCBM_STORE_ERRon NULL required fields. If a future schema change relaxes any NOT NULL, this test catches it at the producer layer so the consumer's warning doesn't go silent.strdupNULL check) — fixed inline.packages/mcp/src/tools/*.gofiles, observed log lineincremental.classify changed=0 unchanged=1 deleted=0 mode_skipped=2(deletedwould have been 2 before the fix).Note
This fix removes the destructive failure mode but does NOT relax any policy around always using
mode='full'. Full mode remains the only mode that provides complete directory coverage (noFAST_SKIP_DIRSat discover time) and semantic edges (SIMILAR_TO, SEMANTICALLY_RELATED). This fix is defense-in-depth for the case where a session running with stale instructions or any future tooling accidentally calls a lesser mode.