Research: 32K-vocab English-optimized small model for quant.cpp

## Summary

Investigate creating or fine-tuning a small (1.7-3.8B) model with a 32K vocabulary optimized for English document QA on quant.cpp. This addresses the fundamental speed/quality tension discovered in our benchmarking.

## The Vocab Size Dilemma

All 2025-2026 models have moved to large vocabularies for multilingual support:

| Model | Year | Vocab | Our tok/s (M3 Q8) |
|-------|------|------:|-------------------:|
| Phi-3.5-mini | 2024 | 32K | 6.5 |
| SmolLM2-1.7B | 2024 | 49K | 23 |
| Qwen3-4B | 2025 | 152K | ~2 (est) |
| Phi-4-mini | 2025 | 200K | ~1 (est) |
| Gemma-3-4B | 2025 | 262K | ~0.8 (est) |

**The industry trend (bigger vocab) is the opposite of what local CPU inference needs (smaller vocab).**

Phi-3.5's 32K vocab is the last model with a small English-focused vocabulary. Its benchmarks are now outdated (2024).

## Options

### Option A: Vocabulary pruning
Take Qwen3-4B (best quality) and prune its 152K vocab to ~32K English-only tokens. Re-train the embedding/lm_head layers.
- Pro: Best underlying model quality
- Con: Requires GPU training, may degrade quality

### Option B: Knowledge distillation
Distill Qwen3-4B's knowledge into a Phi-3.5-architecture student with 32K vocab.
- Pro: Purpose-built architecture
- Con: Significant training effort

### Option C: Fine-tune Phi-3.5 on document QA
Keep Phi-3.5's 32K vocab but fine-tune on document QA tasks (SQuAD, NaturalQuestions, etc.).
- Pro: No vocabulary changes, just quality improvement
- Con: Limited by Phi-3.5's 2024-era pre-training

### Option D: Community model search
Monitor HuggingFace for new models with small vocabularies. Some research groups may release English-focused models.
- Pro: Zero effort
- Con: May never appear (industry trend is opposite)

## Why This Matters

The speed formula for local inference is approximately:
```
tok/s ∝ 1 / (vocab_size × params^0.5 × quant_overhead)
```

A 3.8B model with 32K vocab is **7.5x faster** than the same model with 200K vocab. This is not an optimization — it's a fundamental architectural advantage for the English-only use case.

## Priority: P3

Long-term research direction. Immediate impact comes from #83 (KV cache) and #84 (coherence API).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: 32K-vocab English-optimized small model for quant.cpp #92

Summary

The Vocab Size Dilemma

Options

Option A: Vocabulary pruning

Option B: Knowledge distillation

Option C: Fine-tune Phi-3.5 on document QA

Option D: Community model search

Why This Matters

Priority: P3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Year	Vocab	Our tok/s (M3 Q8)
Phi-3.5-mini	2024	32K	6.5
SmolLM2-1.7B	2024	49K	23
Qwen3-4B	2025	152K	~2 (est)
Phi-4-mini	2025	200K	~1 (est)
Gemma-3-4B	2025	262K	~0.8 (est)

Research: 32K-vocab English-optimized small model for quant.cpp #92

Description

Summary

The Vocab Size Dilemma

Options

Option A: Vocabulary pruning

Option B: Knowledge distillation

Option C: Fine-tune Phi-3.5 on document QA

Option D: Community model search

Why This Matters

Priority: P3

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions