Skip to content

Commit bc8614d

Browse files
unamedkrclaude
andcommitted
docs+revert: batched prefill gated on FP32 KV only; FP16 V needs more work
The attempt to auto-enable batched for FP16 V cache via direct FP32 VB reads (avoiding FP16 round-trip inside batched attention) caused downstream decode degradation: in-batch attention was bit-identical to per-token with FP16 round-trip, but the resulting KV cache values at later layers diverged slightly, and that drift propagated into decode producing repetition-loop garbage ("HelhelHelhel..."). Reverted to the FP32-KV-only gate. Safe default: batched prefill activates automatically when `-k fp32` is set, falls back to per-token for the default FP16 V KV type. README v3.2 update documents the user-facing status: - Long-prompt prefill on `-k fp32`: 2.4× end-to-end, ~4× prefill - Default FP16 V: unchanged (per-token) - Bringing batched to default FP16 V = next major engineering item 11/11 STRICT+COHERENT+Metal-ON tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 672fea2 commit bc8614d

File tree

3 files changed

+12
-10
lines changed

3 files changed

+12
-10
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
165165
166166
> **v3.1 throughput update (2026-04-15):** A focused perf round (Q4_K/Q5_K int8 fused dot, ARMv8.2 `vdotq_s32`, weight-row prefetch, 2-row ILP, P-core thread default) lifted CPU generation throughput by **+58% to +141%** across our model lineup on M1 Pro. Phi-3.5-mini Q8_0 jumped 5.4 → 13.0 tok/s (now at 71% of llama.cpp's pure-CPU speed). We're still 3-6× behind llama.cpp's mature Metal kernels — that's the next gap to close. Full numbers + reproduce instructions: [`bench/results/2026-04-15_throughput_vs_llamacpp.md`](bench/results/2026-04-15_throughput_vs_llamacpp.md).
167167
168+
> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS), auto-enabled when KV cache is FP32. On Llama-3.2-1B Q8 with a ~450-token prompt: **19s → 8s end-to-end** (2.4× total, ~4× on prefill alone), bit-identical to per-token. Auto-enables on `-k fp32`; default FP16 V still uses per-token because drift at softmax cliffs amplifies over 16 layers into wrong tokens. Closes the worst prefill gap by ~4× today; bringing batched to default-FP16 mode is the next major engineering item. Commits `ed4b087`, `672fea2`.
169+
168170
---
169171

170172
## More Features

src/engine/tq_generate.c

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -304,19 +304,21 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
304304
if (config->load_kv_path && pos_after_prefill > 0) {
305305
prefill_start = pos_after_prefill;
306306
}
307-
/* Batched prefill: use when FP32 KV cache is active (no drift from FP16
308-
* round-trip) and architecture is supported. Gives 3-4× end-to-end
309-
* speedup on long prompts. Bit-identical to per-token forward when the
310-
* KV cache is FP32. Set TQ_NO_BATCH_PREFILL=1 to force per-token. */
307+
/* Batched prefill: enabled by default when KV cache is FP32 (no drift
308+
* from FP16 round-trip). For FP16 V (default KV quantization mode),
309+
* the 1-ULP drift amplifies at softmax cliffs and breaks downstream
310+
* decode even though in-batch attention looks correct. Users who want
311+
* batched speedup should pass `-k fp32`. Set TQ_BATCH_PREFILL=1 to
312+
* force-enable for FP16 V (at the risk of degraded output). */
311313
int batch_ok = 0;
312-
int kv_is_fp32 = (state->kv_quant_type >= TQ_TYPE_COUNT); /* sentinel = FP32 */
314+
int kv_is_fp32 = (state->kv_quant_type >= TQ_TYPE_COUNT);
313315
int want_batched = (n_prompt >= 2) && !getenv("TQ_NO_BATCH_PREFILL")
314316
&& (kv_is_fp32 || getenv("TQ_BATCH_PREFILL"));
315317
if (want_batched) {
316318
int rc = tq_forward_batch(model, state, prompt_tokens, n_prompt, prefill_start);
317319
if (getenv("TQ_DEBUG_PREFILL"))
318-
fprintf(stderr, "[batch_prefill] rc=%d expected=%d (N=%d kv_fp32=%d)\n",
319-
rc, prefill_start + n_prompt, n_prompt, kv_is_fp32);
320+
fprintf(stderr, "[batch_prefill] rc=%d expected=%d (N=%d)\n",
321+
rc, prefill_start + n_prompt, n_prompt);
320322
if (rc == prefill_start + n_prompt) {
321323
tq_forward(model, state, prompt_tokens[n_prompt - 1],
322324
prefill_start + n_prompt - 1);

src/engine/tq_transformer.c

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3360,9 +3360,7 @@ int tq_forward_batch(tq_model_t* model, tq_state_t* s,
33603360
}
33613361
} else {
33623362
/* FP16 V cache: use NEON vcvt_f32_f16 to exactly match the
3363-
* per-token attention path. Inline IEEE-754 conversion gave
3364-
* subtly different rounding (1 ULP) which compounded across
3365-
* 16 layers into garbage output. */
3363+
* per-token attention path. */
33663364
for (int t = 0; t <= pos; t++) {
33673365
uint16_t* vh = V_layer_fp16 + (size_t)t * kv_dim + kvh * head_dim;
33683366
float w = att[t];

0 commit comments

Comments
 (0)