Summary
Add a lightweight sentence embedding model (all-MiniLM-L6-v2, 22M params) as the third signal in the Reciprocal Rank Fusion (RRF) locator, catching semantic matches that BM25 and keywords miss.
Problem
Current locator uses BM25 + keyword overlap. On the 1.3MB large-doc test, Q15 fails because "temnein" (Greek word meaning "to cut") is in chunk 531, but BM25 picks chunk 553 (which discusses "Stegocephalia" = "roof-headed"). The semantic connection between "what does temnein mean" and the chunk containing "temnein (to cut)" is obvious to humans but invisible to keyword matching.
Proposed Solution
# Three-signal RRF (currently two)
rrf[cid] = (1/(60+rank_keyword) +
1/(60+rank_bm25) +
1/(60+rank_semantic)) # NEW
Embedding model selection
| Model |
Params |
CPU latency |
Quality |
| all-MiniLM-L6-v2 |
22M |
~30ms/query |
Good |
| BGE-small-en |
33M |
~50ms/query |
Better |
| nomic-embed-text |
137M |
~200ms/query |
Best |
MiniLM is recommended: 30ms per query on CPU, no GPU needed.
Pre-computation
Chunk embeddings are computed once during quantcpp index and stored alongside KV caches. Per-query cost is only one embedding (30ms).
Expected Impact
- Q15 (temnein): semantic similarity catches the correct chunk
- 19/20 → 20/20 on large-doc test
- General: better handling of paraphrased/synonym queries
Priority: P2
Summary
Add a lightweight sentence embedding model (all-MiniLM-L6-v2, 22M params) as the third signal in the Reciprocal Rank Fusion (RRF) locator, catching semantic matches that BM25 and keywords miss.
Problem
Current locator uses BM25 + keyword overlap. On the 1.3MB large-doc test, Q15 fails because "temnein" (Greek word meaning "to cut") is in chunk 531, but BM25 picks chunk 553 (which discusses "Stegocephalia" = "roof-headed"). The semantic connection between "what does temnein mean" and the chunk containing "temnein (to cut)" is obvious to humans but invisible to keyword matching.
Proposed Solution
Embedding model selection
MiniLM is recommended: 30ms per query on CPU, no GPU needed.
Pre-computation
Chunk embeddings are computed once during
quantcpp indexand stored alongside KV caches. Per-query cost is only one embedding (30ms).Expected Impact
Priority: P2