docs+revert: batched prefill gated on FP32 KV only; FP16 V needs more work

unamedkr · claude · unamedkr · commit bc8614db283e · 2026-04-15T23:27:38.000+09:00
The attempt to auto-enable batched for FP16 V cache via direct FP32 VB
reads (avoiding FP16 round-trip inside batched attention) caused
downstream decode degradation: in-batch attention was bit-identical
to per-token with FP16 round-trip, but the resulting KV cache values
at later layers diverged slightly, and that drift propagated into
decode producing repetition-loop garbage ("HelhelHelhel...").

Reverted to the FP32-KV-only gate. Safe default: batched prefill
activates automatically when `-k fp32` is set, falls back to per-token
for the default FP16 V KV type.

README v3.2 update documents the user-facing status:
  - Long-prompt prefill on `-k fp32`: 2.4× end-to-end, ~4× prefill
  - Default FP16 V: unchanged (per-token)
  - Bringing batched to default FP16 V = next major engineering item

11/11 STRICT+COHERENT+Metal-ON tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -165,6 +165,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
 
 > **v3.1 throughput update (2026-04-15):** A focused perf round (Q4_K/Q5_K int8 fused dot, ARMv8.2 `vdotq_s32`, weight-row prefetch, 2-row ILP, P-core thread default) lifted CPU generation throughput by **+58% to +141%** across our model lineup on M1 Pro. Phi-3.5-mini Q8_0 jumped 5.4 → 13.0 tok/s (now at 71% of llama.cpp's pure-CPU speed). We're still 3-6× behind llama.cpp's mature Metal kernels — that's the next gap to close. Full numbers + reproduce instructions: [`bench/results/2026-04-15_throughput_vs_llamacpp.md`](bench/results/2026-04-15_throughput_vs_llamacpp.md).
 
+> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS), auto-enabled when KV cache is FP32. On Llama-3.2-1B Q8 with a ~450-token prompt: **19s → 8s end-to-end** (2.4× total, ~4× on prefill alone), bit-identical to per-token. Auto-enables on `-k fp32`; default FP16 V still uses per-token because drift at softmax cliffs amplifies over 16 layers into wrong tokens. Closes the worst prefill gap by ~4× today; bringing batched to default-FP16 mode is the next major engineering item. Commits `ed4b087`, `672fea2`.
+
 ---
 
 ## More Features
diff --git a/src/engine/tq_generate.c b/src/engine/tq_generate.c
@@ -304,19 +304,21 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
     if (config->load_kv_path && pos_after_prefill > 0) {
         prefill_start = pos_after_prefill;
     }
-    /* Batched prefill: use when FP32 KV cache is active (no drift from FP16
-     * round-trip) and architecture is supported. Gives 3-4× end-to-end
-     * speedup on long prompts. Bit-identical to per-token forward when the
-     * KV cache is FP32. Set TQ_NO_BATCH_PREFILL=1 to force per-token. */
+    /* Batched prefill: enabled by default when KV cache is FP32 (no drift
+     * from FP16 round-trip). For FP16 V (default KV quantization mode),
+     * the 1-ULP drift amplifies at softmax cliffs and breaks downstream
+     * decode even though in-batch attention looks correct. Users who want
+     * batched speedup should pass `-k fp32`. Set TQ_BATCH_PREFILL=1 to
+     * force-enable for FP16 V (at the risk of degraded output). */
     int batch_ok = 0;
-    int kv_is_fp32 = (state->kv_quant_type >= TQ_TYPE_COUNT);  /* sentinel = FP32 */
+    int kv_is_fp32 = (state->kv_quant_type >= TQ_TYPE_COUNT);
     int want_batched = (n_prompt >= 2) && !getenv("TQ_NO_BATCH_PREFILL")
                      && (kv_is_fp32 || getenv("TQ_BATCH_PREFILL"));
     if (want_batched) {
         int rc = tq_forward_batch(model, state, prompt_tokens, n_prompt, prefill_start);
         if (getenv("TQ_DEBUG_PREFILL"))
-            fprintf(stderr, "[batch_prefill] rc=%d expected=%d (N=%d kv_fp32=%d)\n",
-                    rc, prefill_start + n_prompt, n_prompt, kv_is_fp32);
+            fprintf(stderr, "[batch_prefill] rc=%d expected=%d (N=%d)\n",
+                    rc, prefill_start + n_prompt, n_prompt);
         if (rc == prefill_start + n_prompt) {
             tq_forward(model, state, prompt_tokens[n_prompt - 1],
                        prefill_start + n_prompt - 1);
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -3360,9 +3360,7 @@ int tq_forward_batch(tq_model_t* model, tq_state_t* s,
                     }
                 } else {
                     /* FP16 V cache: use NEON vcvt_f32_f16 to exactly match the
-                     * per-token attention path. Inline IEEE-754 conversion gave
-                     * subtly different rounding (1 ULP) which compounded across
-                     * 16 layers into garbage output. */
+                     * per-token attention path. */
                     for (int t = 0; t <= pos; t++) {
                         uint16_t* vh = V_layer_fp16 + (size_t)t * kv_dim + kvh * head_dim;
                         float w = att[t];