refactor(batched): vector-accumulator matmul + deeper drift analysis

unamedkr · claude · unamedkr · commit 442c2d708524 · 2026-04-15T19:03:24.000+09:00
bm_q4_worker now uses NEON vector accumulators (float32x4_t sumv[N]
with vmlaq_n_f32 + vaddvq_f32 reduce) to match matmul_q4_rows' FP
rounding. This brings it architecturally in line with baseline's
per-token quantized matmul.

However, integration-level drift persists. Even TQ_BATCHED_SERIAL=1
(which forces bit-for-bit identical per-token matmul via the same
tq_matmul_q4_preq call baseline uses) still produces wrong output.

The bug is therefore NOT in the matmul accumulator but in surrounding
tq_forward_batch orchestration. Divergence is highly specific:
Layer 3 tok1 (pos=1) diverges at indices 1, 5, 6, 7 but matches at
0, 2, 3, 4 — a pattern-based drift, not random noise.

Updated handoff doc with concrete next-session experiments:
- Dump Layer 3 tok0 wo-matmul output byte-for-byte
- Dump Layer 3 tok1 attention scores att[0], att[1]
- If scores differ: trace back to K-cache at layer 3 pos=0
- If K-cache differs: trace back to WK matmul output for tok0

11/11 STRICT tests still pass (batched still TQ_BATCH_PREFILL-gated).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/dev/batched_prefill_handoff.md b/docs/dev/batched_prefill_handoff.md
@@ -27,6 +27,49 @@ DYLD_LIBRARY_PATH=build \
 # prints: " I'm so excited"  ← baseline (correct)
 ```
 
+## Session 2 findings (2026-04-16 early AM)
+
+After extensive layer-by-layer diff between batched and per-token paths:
+
+**What's bit-identical**:
+- tok0 (pos=0) through Layer 15 — every sub-op, every layer
+- tok1 (pos=1) through Layer 2 final Xres
+- Layer 3 attention-residual value at indices 0, 2, 3, 4 (partial match)
+
+**What diverges**:
+- Layer 3 tok1 attention-residual at indices 1, 5, 6, 7 — exactly 1 ULP off
+- This 1-ULP drift compounds ~1%/layer → wrong token by Layer 15
+
+**Surprising**: Setting `TQ_BATCHED_SERIAL=1` (which replaces my bm_q4_worker
+with literal per-token `tq_matmul_q4_preq` calls — the SAME function baseline
+uses) STILL produces the divergence. So the bug is not in the batched matmul
+accumulator order; it's somewhere in the broader orchestration of
+tq_forward_batch when processing multi-token.
+
+**Fixed along the way** (each moved Layer 0 closer to bit-identical):
+- Q8 quantization: `roundf` → `tq_quantize_row_q8` (NEON RNE)
+- FP16 conversion (write): inline → `f32_to_fp16_vec`
+- FP16 conversion (read): inline → NEON `vcvt_f32_f16`
+- Attention score accumulation: scalar → `vfmaq_f32` NEON
+- bm_q4_worker: scalar accumulator → NEON `float32x4_t sumv[]` + `vaddvq_f32`
+
+**Remaining hypothesis** (to test next session):
+The drift is at specific indices, not random. Index 1 of Layer 3 tok1 diverges
+but indices 0, 2, 3 don't. This is consistent with a SPECIFIC memory location
+being read slightly off. Possibilities:
+- Aliasing: my X buffer might be accidentally read before fully written in
+  some multi-token iteration (out-of-order thread effect)
+- FP16 round-trip on a specific value whose LSB happens to sit on a boundary
+- The `tq_forward` final call (after batched) reads K-cache positions [0..pos-1]
+  written by batched; if ANY of those are 1 ULP off for any layer, final
+  attention sees slightly wrong history. Could be compounding effect.
+
+**Concrete next-session experiment**:
+1. Dump Layer 3 tok0 wo matmul output (OB→X) byte-for-byte vs baseline
+2. Dump Layer 3 tok1 attention scores (att[0], att[1]) vs baseline
+3. If scores differ, dump K-cache at layer 3 pos=0 vs baseline
+4. If K-cache differs, dump the WK matmul output for tok0 at layer 3
+
 ## Latest session findings (2026-04-15 evening)
 
 - ✅ **SANITY mode confirms orchestration is correct**. Setting
diff --git a/src/engine/tq_ops.c b/src/engine/tq_ops.c
@@ -1081,26 +1081,31 @@ static void* bm_q4_worker(void* arg) {
 #ifdef __ARM_NEON
     const uint8x16_t mask_0f = vdupq_n_u8(0x0F);
     const uint8x16_t v8 = vdupq_n_u8(8);
+#endif
+    /* Per-row, per-token NEON vector accumulators. To match matmul_q4_rows'
+     * FP rounding bit-for-bit, we must use vmlaq_n_f32 into a 4-lane
+     * float32x4_t accumulator and reduce with vaddvq_f32 at the end. A
+     * scalar `acc[n] += ...` path produces 1-ULP drift per block which
+     * compounds 1% per transformer layer and flips output tokens. */
+#ifdef __ARM_NEON
+    if (N > 64) { /* safety limit for stack-alloc'd sumv array */
+        /* Fallback: serial per-token via tq_matmul_q4_preq would be needed
+         * here. For now reject large batches to keep the hot path simple. */
+        return NULL;
+    }
+    float32x4_t sumv[64];
 #endif
     for (int i = t->start_row; i < t->end_row; i++) {
         const uint8_t* wi = t->w_qs + (size_t)i * n_blocks * 16;
         const float*   si = t->w_scales + (size_t)i * n_blocks;
 
-        /* Per-row N-element accumulator (FP32, on stack — N usually small). */
-        /* For very large N callers will need a different design (chunk N). */
-        float acc[256];
-        if (N > 256) { /* shouldn't happen at sane batch sizes */ continue; }
-        memset(acc, 0, sizeof(float) * N);
+#ifdef __ARM_NEON
+        for (int n = 0; n < N; n++) sumv[n] = vdupq_n_f32(0.0f);
 
         for (int b = 0; b < n_blocks; b++) {
-#ifdef __ARM_NEON
-            /* Unpack 16 packed bytes → 32 signed int8 nibbles, range [-8, 7]. */
             uint8x16_t pk = vld1q_u8(wi + b * 16);
             int8x16_t lo = vreinterpretq_s8_u8(vsubq_u8(vandq_u8(pk, mask_0f), v8));
             int8x16_t hi = vreinterpretq_s8_u8(vsubq_u8(vshrq_n_u8(pk, 4), v8));
-            /* The packed layout interleaves (lo,hi) pairs. Use vld2q_s8 on
-             * x_q to deinterleave to the same scheme: x_q[0,2,4,...] vs
-             * x_q[1,3,5,...]. matmul_q4_rows uses this; we match it. */
 
             const float wd = si[b];
             for (int n = 0; n < N; n++) {
@@ -1115,12 +1120,23 @@ static void* bm_q4_worker(void* arg) {
                 a0 = vaddq_s32(a0, vpaddlq_s16(vmull_s8(vget_low_s8(hi),  vget_low_s8(xd.val[1]))));
                 a0 = vaddq_s32(a0, vpaddlq_s16(vmull_s8(vget_high_s8(hi), vget_high_s8(xd.val[1]))));
 #endif
-                int32_t s = vaddvq_s32(a0);
                 float xd_n = t->X_d[(size_t)n * n_blocks + b];
-                acc[n] += wd * xd_n * (float)s;
+                /* Match matmul_q4_rows exactly: vmlaq_n_f32 with combined scale.
+                 * vcvtq_f32_s32(a0) on the int32 accumulator, scalar scale =
+                 * wd * xd_n, accumulate into sumv[n]. */
+                sumv[n] = vmlaq_n_f32(sumv[n], vcvtq_f32_s32(a0), wd * xd_n);
             }
+        }
+
+        for (int n = 0; n < N; n++) {
+            t->out[(size_t)n * n_rows + i] = vaddvq_f32(sumv[n]);
+        }
 #else
-            /* Scalar fallback */
+        /* Scalar fallback (x86 / no NEON). */
+        float acc[256];
+        if (N > 256) continue;
+        memset(acc, 0, sizeof(float) * N);
+        for (int b = 0; b < n_blocks; b++) {
             const float wd = si[b];
             int8_t lo[32], hi[32];
             for (int j = 0; j < 16; j++) {
@@ -1134,13 +1150,11 @@ static void* bm_q4_worker(void* arg) {
                 float xd_n = t->X_d[(size_t)n * n_blocks + b];
                 acc[n] += wd * xd_n * (float)s;
             }
-#endif
         }
-
-        /* Scatter accumulator into output row */
         for (int n = 0; n < N; n++) {
             t->out[(size_t)n * n_rows + i] = acc[n];
         }
+#endif
     }
     return NULL;
 }