Commit 442c2d7
refactor(batched): vector-accumulator matmul + deeper drift analysis
bm_q4_worker now uses NEON vector accumulators (float32x4_t sumv[N]
with vmlaq_n_f32 + vaddvq_f32 reduce) to match matmul_q4_rows' FP
rounding. This brings it architecturally in line with baseline's
per-token quantized matmul.
However, integration-level drift persists. Even TQ_BATCHED_SERIAL=1
(which forces bit-for-bit identical per-token matmul via the same
tq_matmul_q4_preq call baseline uses) still produces wrong output.
The bug is therefore NOT in the matmul accumulator but in surrounding
tq_forward_batch orchestration. Divergence is highly specific:
Layer 3 tok1 (pos=1) diverges at indices 1, 5, 6, 7 but matches at
0, 2, 3, 4 — a pattern-based drift, not random noise.
Updated handoff doc with concrete next-session experiments:
- Dump Layer 3 tok0 wo-matmul output byte-for-byte
- Dump Layer 3 tok1 attention scores att[0], att[1]
- If scores differ: trace back to K-cache at layer 3 pos=0
- If K-cache differs: trace back to WK matmul output for tok0
11/11 STRICT tests still pass (batched still TQ_BATCH_PREFILL-gated).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent bd063e0 commit 442c2d7
2 files changed
+73
-16
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
30 | 73 | | |
31 | 74 | | |
32 | 75 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1081 | 1081 | | |
1082 | 1082 | | |
1083 | 1083 | | |
| 1084 | + | |
| 1085 | + | |
| 1086 | + | |
| 1087 | + | |
| 1088 | + | |
| 1089 | + | |
| 1090 | + | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
| 1096 | + | |
1084 | 1097 | | |
1085 | 1098 | | |
1086 | 1099 | | |
1087 | 1100 | | |
1088 | 1101 | | |
1089 | | - | |
1090 | | - | |
1091 | | - | |
1092 | | - | |
1093 | | - | |
| 1102 | + | |
| 1103 | + | |
1094 | 1104 | | |
1095 | 1105 | | |
1096 | | - | |
1097 | | - | |
1098 | 1106 | | |
1099 | 1107 | | |
1100 | 1108 | | |
1101 | | - | |
1102 | | - | |
1103 | | - | |
1104 | 1109 | | |
1105 | 1110 | | |
1106 | 1111 | | |
| |||
1115 | 1120 | | |
1116 | 1121 | | |
1117 | 1122 | | |
1118 | | - | |
1119 | 1123 | | |
1120 | | - | |
| 1124 | + | |
| 1125 | + | |
| 1126 | + | |
| 1127 | + | |
1121 | 1128 | | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
1122 | 1134 | | |
1123 | | - | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
| 1139 | + | |
1124 | 1140 | | |
1125 | 1141 | | |
1126 | 1142 | | |
| |||
1134 | 1150 | | |
1135 | 1151 | | |
1136 | 1152 | | |
1137 | | - | |
1138 | 1153 | | |
1139 | | - | |
1140 | | - | |
1141 | 1154 | | |
1142 | 1155 | | |
1143 | 1156 | | |
| 1157 | + | |
1144 | 1158 | | |
1145 | 1159 | | |
1146 | 1160 | | |
| |||
0 commit comments