Skip to content

KV cache pre-build for RLV: question latency 35s → 5s #83

@unamedkr

Description

@unamedkr

Summary

Pre-compute and persist KV caches for each document chunk during indexing, eliminating prefill overhead at query time. This is the single highest-impact speed optimization for RLV.

Current State

Each RLV question requires:

Locator (BM25):    0.01s  ← fast
Lookup (LLM):     15-20s  ← prefill 8s + generate 7s
Verifier:          0-8s
Total:            ~35s/question

The 8s prefill is spent re-reading the same chunk text every time. For 20 questions on the same document, we prefill the same chunks ~20 times.

Proposed Solution

# One-time indexing (slow, ~5min for 1.3MB doc)
quantcpp index document.txt --output document.kv/

# Per-question (fast, ~5s)
quantcpp rlv --index document.kv/ "Who directed Mercury Fur?"

Implementation:

# During indexing:
for chunk in gist.chunks:
    ctx = quant_new(model, config)
    quant_generate(ctx, chunk.text, null_callback, null)  # prefill only
    quant_save_context(ctx, f"document.kv/chunk_{chunk.id}.kv")

# During query:
ctx = quant_new(model, config)
quant_load_context(ctx, f"document.kv/chunk_{best_id}.kv")  # instant
quant_generate(ctx, question, on_token, data)  # generate only (~5s)

Impact

Metric Before After
Per-question latency 35s ~5s
20-question benchmark 12min ~2min
First-question latency 35s 35s (indexing amortized)

quant.cpp Advantage

save_context/load_context is unique to quant.cpp — no other inference engine provides this. Combined with KV compression (6.4x), each chunk's cache is only a few hundred KB on disk.

Priority: P0

This is the difference between "demo" and "usable product". 35s/question is a demo; 5s/question is a tool people actually use.


Proposed by ClawTeam based on RLV Day 5 benchmarking

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions