Skip to content

Fix/incremental destructive overwrite#251

Open
petercoxphoto wants to merge 2 commits intoDeusData:mainfrom
petercoxphoto:fix/incremental-destructive-overwrite
Open

Fix/incremental destructive overwrite#251
petercoxphoto wants to merge 2 commits intoDeusData:mainfrom
petercoxphoto:fix/incremental-destructive-overwrite

Conversation

@petercoxphoto
Copy link
Copy Markdown

Summary

  • index_repository was destructively overwriting a project's graph in the store on every call, regardless of mode. In a mode downgrade (full → fast), files inside FAST_SKIP_DIRS directories (tools, scripts, bin, build, docs, __tests__, migrations, …) were silently purged from the graph. Real incident 2026-04-13: a live Skyline graph lost 516 nodes mid-session when a concurrent process ran fast mode, including every file under packages/mcp/src/tools/.
  • Fix: find_deleted_files now stat()s each stored-but-missing file. Files that exist on disk are mode-skipped (nodes and hash rows preserved); only files that truly don't exist are purged. Mode-skipped hash rows are threaded through dump_and_persistpersist_hashes and re-upserted into the rebuilt DB so subsequent reindexes can still classify them.
  • Follow-up: persist_hashes now checks the return code of every cbm_store_upsert_file_hash call on both the current-files and mode-skipped loops, logs a warning on failure with scope/rel_path/rc, and continues (partial-failure policy). Without this, a silent upsert failure on the mode_skipped loop could revive the orphaned-node bug class the main fix is designed to prevent.

Why this matters

The upstream FAST_SKIP_DIRS heuristic in discover.c is reasonable in isolation — lesser index modes skip directories named tools/scripts/bin/etc. to index faster. The bug was that cbm_pipeline_run_incremental then classified the resulting absence as deletion, purging the nodes. Nothing in the index store distinguished "this file isn't in the current walk" from "this file doesn't exist anymore". One run of index_repository(mode='fast') on a project that had previously been indexed in full mode would silently destroy the full-mode data — no warning, no log above info level.

The additive-merge semantics were partially built: cbm_pipeline_run_incremental already loads the existing graph into a gbuf, mutates only the changed files, and dumps back. The missing piece was correctly classifying stored-but-absent files.

Fail-safe rules (preserve on uncertainty)

  • repo_path NULL → log error, preserve everything. A NULL is a misconfigured pipeline, not a deletion signal.
  • snprintf path truncation (repo_path + rel_path ≥ 4KB) → preserve. Can't reliably stat a truncated path.
  • stat() errno != ENOENT/ENOTDIR (EACCES, EIO, ELOOP, transient NFS, …) → preserve. File may exist; we just can't see it right now.
  • strdup OOM mid-record → drop the entry rather than persist a row with a NULL rel_path that would silently fail the NOT NULL constraint.

Test plan

  • New test incremental_fast_preserves_mode_skipped_tools_dir in tests/test_pipeline.c walks the real failure mode end-to-end in four phases:
    1. Full mode index — both files present
    2. Fast mode reindex — tools/util.go nodes preserved (was 0 before fix)
    3. Mutate main.go and fast reindex — forces dump_and_persist to run instead of the noop early-return, exercising the hash-preservation path
    4. Delete tools/util.go from disk and full reindex — confirms the deletion-detection contract still holds (would fail with orphaned nodes without hash-row preservation)
  • New test store_file_hash_upsert_rejects_null_required_fields in tests/test_store_nodes.c pins the upstream API contract that cbm_store_upsert_file_hash returns CBM_STORE_ERR on NULL required fields. If a future schema change relaxes any NOT NULL, this test catches it at the producer layer so the consumer's warning doesn't go silent.
  • Full test suite (2722 tests) passes under ASan + UBSan + LeakSanitizer.
  • Adversarial audit cycle 1: 4 must-fix findings (ENOENT-only stat semantics, snprintf truncation check, hash-row drift, NULL repo_path branch) — all fixed.
  • Adversarial audit cycle 2: verified all 4 cycle-1 fixes in code and test, found 1 minor (strdup NULL check) — fixed inline.
  • Adversarial audit on follow-up (return-code checking): zero CRITICAL/MAJOR/MINOR, 6 observations all confirmed design choices. Ship verdict.
  • Live end-to-end verification: built production binary, ran against a synthetic repo with packages/mcp/src/tools/*.go files, observed log line incremental.classify changed=0 unchanged=1 deleted=0 mode_skipped=2 (deleted would have been 2 before the fix).

Note

This fix removes the destructive failure mode but does NOT relax any policy around always using mode='full'. Full mode remains the only mode that provides complete directory coverage (no FAST_SKIP_DIRS at discover time) and semantic edges (SIMILAR_TO, SEMANTICALLY_RELATED). This fix is defense-in-depth for the case where a session running with stale instructions or any future tooling accidentally calls a lesser mode.

petercoxphoto and others added 2 commits April 13, 2026 21:02
`index_repository` was destructively overwriting a project's graph in the
store. Specifically, `find_deleted_files` classified ANY stored file
absent from the current discovery list as "deleted" and purged its nodes
via `cbm_gbuf_delete_by_file`. In a mode downgrade (full → fast),
directories matching `FAST_SKIP_DIRS` (`tools`, `scripts`, `bin`, `build`,
`docs`, `__tests__`, `migrations`, ...) are excluded from discovery but
still exist on disk. They were getting silently purged.

Real-world incident, 2026-04-13: the Skyline graph was built at 03:45 by
the nightly full-mode cron (2819 nodes, packages/mcp/src/tools/ included).
At 14:36 a concurrent /develop session ran mode='fast' per the then-
current skill text. The Skyline graph collapsed to 2303 nodes mid-day and
the entire packages/mcp/src/tools/ directory (18 files / ~500 nodes)
vanished from the live graph while a diagnosing session was running.

Fix
---

`find_deleted_files` now stat()s each stored file that is absent from
current discovery and classifies it as one of:

 - "deleted"      — stat() returns ENOENT or ENOTDIR. Nodes purged,
                    hash row dropped.
 - "mode-skipped" — stat() succeeds (or fails with a non-ENOENT errno,
                    or the path can't be reliably constructed). Nodes
                    AND hash row preserved.

Mode-skipped hash rows are now carried through `dump_and_persist` →
`persist_hashes` and re-upserted into the new DB. Without that, an
orphaned-node bug emerges:

 1. full mode indexes everything
 2. fast mode runs and (with only the node-preservation half of the fix)
    drops mode-skipped hash rows
 3. mode-skipped file is then deleted on disk
 4. next reindex's stored hashes don't include the file → noop or can't
    detect the deletion → graph nodes for the deleted file remain forever

The new test exercises exactly this sequence and the deletion-detection
contract holds end-to-end.

Fail-safe rules (preserve nodes on uncertainty)
-----------------------------------------------

 - repo_path NULL → log error, return 0 deletions, preserve everything.
   The caller contract is that repo_path is required; a NULL means a
   misconfigured pipeline, not a deletion signal.
 - snprintf truncation (combined path ≥ CBM_SZ_4K) → preserve. We can't
   reliably stat a truncated path.
 - stat() errno != ENOENT/ENOTDIR (EACCES, EIO, ELOOP, transient NFS,
   etc.) → preserve. The file may exist; we just can't see it right now.
 - strdup OOM mid-record → drop the entry rather than persist a row with
   a NULL rel_path that would silently fail the NOT NULL constraint and
   reintroduce the orphaned-node bug.

These rules came out of two adversarial audit cycles:
 - Cycle 1 caught the original ENOENT-only gap, the snprintf truncation
   gap, the silent NULL repo_path branch, and the hash-row drift gap.
 - Cycle 2 caught the strdup NULL handling gap.

Test
----

`incremental_fast_preserves_mode_skipped_tools_dir` in test_pipeline.c
walks the real failure mode end-to-end:

 1. Full mode index — both files present
 2. Fast mode reindex — tools/util.go nodes preserved (was 0 before fix)
 3. Mutate main.go and fast reindex — forces dump_and_persist to run
    instead of the noop early-return, exercising the hash-preservation
    path that step 2 doesn't cover
 4. Delete tools/util.go from disk and full reindex — confirms the
    deletion-detection contract still holds (would fail with orphaned
    nodes without the hash-row preservation half of the fix)

All 2741 existing tests still pass under ASan + UBSan + LeakSanitizer.

Diagnosed in session a1fc365c-d9fd-4228-a86a-ef92e9e97984. Tracked as
PM task claude-connectors/codebase-memory-index-repository-is-destructive
(mC_AtUBH9DAh).

Note: this fix removes the destructive failure mode but does NOT relax
the policy that `mode='full'` is the only acceptable mode for our own
projects. Full mode remains required for complete visibility and
searchability (semantic edges, SIMILAR_TO, full directory coverage).
This fix is defense-in-depth for the case where a session running with
older instructions accidentally calls a lesser mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cycle 2 of the additive-merge audit (mC_AtUBH9DAh, commit fab42f9)
flagged that persist_hashes was discarding the return code of
cbm_store_upsert_file_hash on both the current-files loop and the new
mode_skipped loop. A constraint violation, I/O error, or any other
upsert failure silently dropped one or more hash rows.

Pre-existing pattern for the current-files loop (just a perf drag — next
reindex sees the file as new and re-parses it). New risk for the
mode_skipped loop: a silent failure leaves the file's nodes preserved in
the graph but the hash row missing from the new DB. The next reindex
can't classify the file. If it's later deleted on disk, the deletion
isn't detected → orphaned graph nodes survive forever — exactly the
failure mode mC_AtUBH9DAh was designed to prevent, just behind a
different trigger.

Fix
---

persist_hashes now:
  - checks rc on every upsert call in both loops
  - on rc != CBM_STORE_OK, logs a warning with scope (current vs
    mode_skipped), rel_path, and the rc value
  - continues to the next file (partial-failure policy: partial
    preservation is better than total loss)
  - at end, if any failures, emits a summary line with both counts

Function header documents the partial-failure policy explicitly so a
future reader doesn't mistake "continue on error" for "ignore error".

Test
----

New test `store_file_hash_upsert_rejects_null_required_fields` in
test_store_nodes.c pins the API contract that persist_hashes depends on:
cbm_store_upsert_file_hash returns CBM_STORE_ERR (not silent OK) when a
NOT NULL column would receive SQL NULL. Tests three NOT NULL violations
(rel_path, sha256, project) and asserts that a previously-valid row is
still present afterwards (partial-failure isolation at the store level).

If a future schema change relaxes any of these NOT NULL constraints, the
downstream warning becomes silent and the orphaned-node bug class can
re-emerge — this test catches that change at the right layer.

Audit verdict: ships clean. No CRITICAL/MAJOR/MINOR findings. All
2742 tests pass under ASan + UBSan + LeakSanitizer.

PM task: claude-connectors/codebase-memory-persist-hashes-ignores-upsert-return-codes
(KepDJhBEQNoW)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant