fix(generation): stabilize prompt hashes across re-runs#62
fix(generation): stabilize prompt hashes across re-runs#62dmikushin wants to merge 1 commit intorepowise-dev:mainfrom
Conversation
Graph edge ordering and community IDs were non-deterministic because files are parsed in parallel (ProcessPoolExecutor + as_completed), causing NetworkX node insertion order to vary between runs. Changes: - context_assembler: sort predecessors/successors before including them in FilePageContext so dependents/dependencies lists are identical across runs regardless of graph construction order - graph: rebuild a sorted copy of the undirected graph before passing it to louvain_communities so adjacency traversal order is reproducible; also sort the returned community list by each community's smallest member before assigning integer IDs via enumerate() Adds scripts/diagnose_hash_mismatch.py to verify the fix and identify any remaining sources of hash instability (dep_summaries, betweenness sampling, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
da425be to
6fdf4ee
Compare
|
Nice analysis and clean fix. The sorted predecessors/successors and stabilized Louvain ordering both make sense, and the PR description is really well written. A few things before I merge:
Happy to merge once (1) and (2) are addressed. |
Problem
Every
repowise initre-generates all wiki pages from scratch, even when the codebase hasn't changed. The root cause is non-deterministicsource_hashvalues: the SHA-256 is computed over the rendered Jinja2 prompt, and two context variables were unstable across runs.Source of non-determinism 1: graph edge ordering
Files are parsed in parallel via
ProcessPoolExecutor+as_completed, so the order in which nodes and edges are inserted into the NetworkX graph is non-deterministic.graph.predecessors()/graph.successors()return nodes in insertion order, sodependentsanddependencieslists inFilePageContextshuffled between runs → different rendered prompt → differentsource_hash.Fix: sort
predecessors/successorsbefore buildingFilePageContextinContextAssembler.assemble_file_page.Source of non-determinism 2: Louvain community IDs
nx.community.louvain_communitiesalready receivesseed=42, but the adjacency traversal order inside Louvain still depends on node insertion order (same root cause). Additionally, the community list returned bylouvain_communitieshas no guaranteed order, soenumerate()assigned different integer IDs to the same community across runs.Fix: before calling
louvain_communities, rebuild a sorted copy of the undirected graph (g_stable) with nodes and edges added in alphabetical order. After the call, sort the returned community list by each community's lexicographically smallest member beforeenumerate().Impact
These two fixes make
source_hashstable across re-runs for unchanged files, enabling the DB content cache (_db_content_cachekeyed bysource_hash) to skip redundant LLM calls and save API costs.Testing
Added
scripts/diagnose_hash_mismatch.py— a diagnostic script that:betweenness_centrality()andcommunity_detection()twice and reports any differing valuesfile_pageinwiki.db, renders the prompt fresh and compares SHA-256 with the storedsource_hashdep_summariesor another factor, with a unified diffRun from the target repo directory:
Checklist
context_assembler.py: sort predecessors/successorsgraph.py: sorted graph copy + sorted community list before enumeratescripts/diagnose_hash_mismatch.py: diagnostic tool