fix(generation): stabilize prompt hashes across re-runs by dmikushin · Pull Request #62 · repowise-dev/repowise

dmikushin · 2026-04-10T12:28:08Z

Problem

Every repowise init re-generates all wiki pages from scratch, even when the codebase hasn't changed. The root cause is non-deterministic source_hash values: the SHA-256 is computed over the rendered Jinja2 prompt, and two context variables were unstable across runs.

Source of non-determinism 1: graph edge ordering

Files are parsed in parallel via ProcessPoolExecutor + as_completed, so the order in which nodes and edges are inserted into the NetworkX graph is non-deterministic. graph.predecessors() / graph.successors() return nodes in insertion order, so dependents and dependencies lists in FilePageContext shuffled between runs → different rendered prompt → different source_hash.

Fix: sort predecessors/successors before building FilePageContext in ContextAssembler.assemble_file_page.

Source of non-determinism 2: Louvain community IDs

nx.community.louvain_communities already receives seed=42, but the adjacency traversal order inside Louvain still depends on node insertion order (same root cause). Additionally, the community list returned by louvain_communities has no guaranteed order, so enumerate() assigned different integer IDs to the same community across runs.

Fix: before calling louvain_communities, rebuild a sorted copy of the undirected graph (g_stable) with nodes and edges added in alphabetical order. After the call, sort the returned community list by each community's lexicographically smallest member before enumerate().

Impact

These two fixes make source_hash stable across re-runs for unchanged files, enabling the DB content cache (_db_content_cache keyed by source_hash) to skip redundant LLM calls and save API costs.

Testing

Added scripts/diagnose_hash_mismatch.py — a diagnostic script that:

Calls betweenness_centrality() and community_detection() twice and reports any differing values
For each cached file_page in wiki.db, renders the prompt fresh and compares SHA-256 with the stored source_hash
Reports whether a mismatch is caused by dep_summaries or another factor, with a unified diff

Run from the target repo directory:

python3 scripts/diagnose_hash_mismatch.py /path/to/repo --max-pages 20

Checklist

context_assembler.py: sort predecessors/successors
graph.py: sorted graph copy + sorted community list before enumerate
scripts/diagnose_hash_mismatch.py: diagnostic tool

Graph edge ordering and community IDs were non-deterministic because files are parsed in parallel (ProcessPoolExecutor + as_completed), causing NetworkX node insertion order to vary between runs. Changes: - context_assembler: sort predecessors/successors before including them in FilePageContext so dependents/dependencies lists are identical across runs regardless of graph construction order - graph: rebuild a sorted copy of the undirected graph before passing it to louvain_communities so adjacency traversal order is reproducible; also sort the returned community list by each community's smallest member before assigning integer IDs via enumerate() Adds scripts/diagnose_hash_mismatch.py to verify the fix and identify any remaining sources of hash instability (dep_summaries, betweenness sampling, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

RaghavChamadiya · 2026-04-11T06:24:11Z

Nice analysis and clean fix. The sorted predecessors/successors and stabilized Louvain ordering both make sense, and the PR description is really well written.

A few things before I merge:

Betweenness centrality: for large repos (above the threshold), betweenness_centrality uses k=500 random samples without a seed. If betweenness values feed into the rendered prompt, that's a third source of non-determinism you haven't addressed. Can you check whether that's the case? If so, adding seed=42 there too would complete the fix.
Unit tests: the core changes (sorted edges, sorted communities) don't have tests that verify stability across different insertion orders. Something like building the same graph in two different orders and asserting identical community IDs would really lock this in and catch regressions in CI.
Diagnostic script: scripts/diagnose_hash_mismatch.py imports private internals like _run_ingestion which will break on refactors. Fine to keep it as an unsupported debugging tool in scripts/, but a proper unit test would be more valuable long term.

Happy to merge once (1) and (2) are addressed.

dmikushin requested review from RaghavChamadiya and swati510 as code owners April 10, 2026 12:28

dmikushin force-pushed the fix/stabilize-prompt-hashes branch from da425be to 6fdf4ee Compare April 10, 2026 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(generation): stabilize prompt hashes across re-runs#62

fix(generation): stabilize prompt hashes across re-runs#62
dmikushin wants to merge 1 commit intorepowise-dev:mainfrom
dmikushin:fix/stabilize-prompt-hashes

dmikushin commented Apr 10, 2026

Uh oh!

RaghavChamadiya commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmikushin commented Apr 10, 2026

Problem

Source of non-determinism 1: graph edge ordering

Source of non-determinism 2: Louvain community IDs

Impact

Testing

Checklist

Uh oh!

RaghavChamadiya commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants