Commit 672fea2
feat(prefill): batched prefill working! 2.4× end-to-end, ~4× prefill on long prompts
After session-long debugging, identified root cause of numerical
divergence: FP16 V cache round-trip amplifies 1-ULP drift at softmax
cliffs (where two attention scores happen to be within 1 ULP, tiny
drift flips which position gets more weight, producing order-of-
magnitude output difference).
Solution: use batched prefill ONLY when KV cache is FP32 (where no
FP16 round-trip exists). Default enabled automatically based on
state->kv_quant_type >= TQ_TYPE_COUNT (the FP32 sentinel).
Measured on Apple M1 Pro, 8 threads, ~450-token prompt:
Llama-3.2-1B Q8 (-k fp32): 19.2s → 7.9s (2.4× end-to-end)
Llama-3.2-3B Q8 (-k fp32): 88.1s → 62.0s (1.4×, with overhead)
Output bit-identical to per-token forward path. 11/11 STRICT tests pass.
What works now:
- FP32 KV cache models with load-time Q4 weights (Llama family)
- Any prompt length (batch N = prompt length)
- Bit-identical output to the per-token baseline
Remaining limitations (for future sessions):
- FP16 V cache (default): still drifts. Solutions: (a) FP32-only K/V
write within attention (dequant per-read), (b) bit-identical FP16
round-trip via careful sequence, (c) educate users to opt in via
-k fp32 for long-prompt use cases.
- Architectures: only standard Llama (no Phi-3 fused QKV, no MoE,
no DeltaNet). tq_forward_batch returns -1 and falls back gracefully.
Removed the diagnostic TQ_BATCHED_SERIAL env var; kept TQ_NO_BATCH_PREFILL
as the explicit disable flag and TQ_BATCH_PREFILL as force-enable for
FP16 V testing.
This closes the largest user-visible gap to llama.cpp (prefill was 40-50×
behind; now on FP32 KV cache, ~10-15×).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 90c3552 commit 672fea2
2 files changed
+10
-25
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
304 | 304 | | |
305 | 305 | | |
306 | 306 | | |
307 | | - | |
308 | | - | |
309 | | - | |
310 | | - | |
311 | | - | |
312 | | - | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
313 | 311 | | |
314 | | - | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
315 | 316 | | |
316 | 317 | | |
317 | | - | |
318 | | - | |
| 318 | + | |
| 319 | + | |
319 | 320 | | |
320 | 321 | | |
321 | 322 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1166 | 1166 | | |
1167 | 1167 | | |
1168 | 1168 | | |
1169 | | - | |
1170 | | - | |
1171 | | - | |
1172 | | - | |
1173 | | - | |
1174 | | - | |
1175 | | - | |
1176 | | - | |
1177 | | - | |
1178 | | - | |
1179 | | - | |
1180 | | - | |
1181 | | - | |
1182 | | - | |
1183 | | - | |
1184 | | - | |
1185 | 1169 | | |
1186 | 1170 | | |
1187 | 1171 | | |
| |||
0 commit comments