Summary
Investigate creating or fine-tuning a small (1.7-3.8B) model with a 32K vocabulary optimized for English document QA on quant.cpp. This addresses the fundamental speed/quality tension discovered in our benchmarking.
The Vocab Size Dilemma
All 2025-2026 models have moved to large vocabularies for multilingual support:
| Model |
Year |
Vocab |
Our tok/s (M3 Q8) |
| Phi-3.5-mini |
2024 |
32K |
6.5 |
| SmolLM2-1.7B |
2024 |
49K |
23 |
| Qwen3-4B |
2025 |
152K |
~2 (est) |
| Phi-4-mini |
2025 |
200K |
~1 (est) |
| Gemma-3-4B |
2025 |
262K |
~0.8 (est) |
The industry trend (bigger vocab) is the opposite of what local CPU inference needs (smaller vocab).
Phi-3.5's 32K vocab is the last model with a small English-focused vocabulary. Its benchmarks are now outdated (2024).
Options
Option A: Vocabulary pruning
Take Qwen3-4B (best quality) and prune its 152K vocab to ~32K English-only tokens. Re-train the embedding/lm_head layers.
- Pro: Best underlying model quality
- Con: Requires GPU training, may degrade quality
Option B: Knowledge distillation
Distill Qwen3-4B's knowledge into a Phi-3.5-architecture student with 32K vocab.
- Pro: Purpose-built architecture
- Con: Significant training effort
Option C: Fine-tune Phi-3.5 on document QA
Keep Phi-3.5's 32K vocab but fine-tune on document QA tasks (SQuAD, NaturalQuestions, etc.).
- Pro: No vocabulary changes, just quality improvement
- Con: Limited by Phi-3.5's 2024-era pre-training
Option D: Community model search
Monitor HuggingFace for new models with small vocabularies. Some research groups may release English-focused models.
- Pro: Zero effort
- Con: May never appear (industry trend is opposite)
Why This Matters
The speed formula for local inference is approximately:
tok/s ∝ 1 / (vocab_size × params^0.5 × quant_overhead)
A 3.8B model with 32K vocab is 7.5x faster than the same model with 200K vocab. This is not an optimization — it's a fundamental architectural advantage for the English-only use case.
Priority: P3
Long-term research direction. Immediate impact comes from #83 (KV cache) and #84 (coherence API).
Summary
Investigate creating or fine-tuning a small (1.7-3.8B) model with a 32K vocabulary optimized for English document QA on quant.cpp. This addresses the fundamental speed/quality tension discovered in our benchmarking.
The Vocab Size Dilemma
All 2025-2026 models have moved to large vocabularies for multilingual support:
The industry trend (bigger vocab) is the opposite of what local CPU inference needs (smaller vocab).
Phi-3.5's 32K vocab is the last model with a small English-focused vocabulary. Its benchmarks are now outdated (2024).
Options
Option A: Vocabulary pruning
Take Qwen3-4B (best quality) and prune its 152K vocab to ~32K English-only tokens. Re-train the embedding/lm_head layers.
Option B: Knowledge distillation
Distill Qwen3-4B's knowledge into a Phi-3.5-architecture student with 32K vocab.
Option C: Fine-tune Phi-3.5 on document QA
Keep Phi-3.5's 32K vocab but fine-tune on document QA tasks (SQuAD, NaturalQuestions, etc.).
Option D: Community model search
Monitor HuggingFace for new models with small vocabularies. Some research groups may release English-focused models.
Why This Matters
The speed formula for local inference is approximately:
A 3.8B model with 32K vocab is 7.5x faster than the same model with 200K vocab. This is not an optimization — it's a fundamental architectural advantage for the English-only use case.
Priority: P3
Long-term research direction. Immediate impact comes from #83 (KV cache) and #84 (coherence API).