Lessons Learned

Mistakes made during configuration and optimization, root causes, and rules to prevent them in the future. This document is a living reference — add new entries as issues are discovered.

1. Assumed GLM-4.7-Flash was a dense model

What happened: GLM-4.7-Flash was configured and documented as a "dense, compact model" throughout models.conf. GPU strategies were designed assuming all parameters are active per token. This led to using FIT=on (which auto-distributes but doesn't understand expert/attention priorities) and --tensor-split (designed for dense models) instead of the correct MoE approach with -ot.

Root cause: The model name and file size (~18 GB for Q4) "felt" small, so it was assumed to be dense. The model card was never checked. It clearly states: "GLM-4.7-Flash is a 30B-A3B MoE model."

Actual architecture (from GGUF metadata):

30B total parameters, 3B active per token (A3B)
47 layers (1 dense lead + 46 MoE)
64 experts per MoE layer, 4 active + 1 shared
Expert FFN: 1536, Dense FFN: 10240
GQA: 20 attention heads, 1 KV head

Impact: The model worked (it always loads and runs), but GPU placement was suboptimal. With FIT=on, experts were spread across both GPUs unnecessarily. With --split-mode none, it happened to work for Q4 because the full model fits on the 4090, but this was not a reasoned decision.

Prevention rule: Always verify model architecture from the model card (models/documentation/) before making any GPU placement decisions. Check dense vs MoE, expert count, active parameters. Never infer architecture from model name or file size. See docs/gpu-strategy-guide.md for the decision tree.

2. Applied "exps=CPU" as a universal MoE best practice

What happened: After discovering GLM is MoE, the immediate reaction was "it's MoE, so we need exps=CPU" — the same approach used for GPT-OSS and Qwen3. This was applied without checking whether the model actually exceeds GPU VRAM.

Root cause: Pattern matching from other MoE models (GPT-OSS at 61 GB, Qwen3 at 57-64 GB) where exps=CPU is genuinely necessary. The rule "MoE = exps=CPU" was treated as universal when it's actually conditional.

The correct rule: exps=CPU is a trade-off, not a default. Expert weights in VRAM are accessed at ~1 TB/s (GPU memory bandwidth). On CPU, they require PCIe transfer at ~32 GB/s plus slower CPU compute. Keeping experts on GPU is always faster if they fit. Only offload to CPU when GPU VRAM is insufficient to hold the full model.

Decision logic:

Model fits on GPU entirely → keep everything on GPU (fastest)
Model doesn't fit → offload experts to CPU, keep attention on GPU
The threshold is total VRAM available, not the model architecture

Prevention rule: After verifying architecture, calculate total VRAM needed. Only add exps=CPU when the model exceeds available GPU VRAM. Document the reasoning in models.conf comments.

3. Wrote incorrect documentation, then treated it as truth

What happened: models.conf contained comments like "Dense model, FIT=on auto-optimizes" for GLM. Later, when optimizing bench profiles, these comments were referenced as facts instead of being verified.

Root cause: The original comments were written based on assumptions (see #1). Once written, they became "trusted documentation" that was referenced without question. This created a self-reinforcing error: wrong assumption → wrong docs → wrong docs used as source → more wrong decisions.

Impact: Multiple rounds of incorrect optimization: first FIT=on, then --tensor-split, then exps=CPU — each based on the previous wrong assumption.

Prevention rule:

When writing documentation about model architecture, always verify against the model card or GGUF metadata first.
When referencing existing documentation for decisions, verify the key facts independently — especially architecture claims.
Include the source of architecture information in comments (e.g., "MoE per model card" not just "MoE").

4. Estimated model sizes without checking actual file sizes

What happened: GLM Q8 was estimated at "~16-17 GB" based on mental math. The actual file is 30 GB. This led to confidently claiming "Q8 fits on 4090 alone" when it doesn't.

Root cause: The estimate was based on a vague assumption about Q8 being "roughly double Q4 in parameter efficiency" without checking the actual GGUF file. Q4_K_M averages ~4.5 bits/param, Q8_0 is 8 bits/param, so the ratio is 8/4.5 ≈ 1.78x, not 2x. Applied to 18 GB Q4: 18 × 1.78 = 32 GB, close to the actual 30 GB.

Prevention rule: Always check actual file sizes with ls -lh before making VRAM calculations. Never estimate from quantization names alone. File sizes for all models should be documented in models.conf comments or in the model quick reference table in docs/gpu-strategy-guide.md.

5. Set `-ub` equal to `-b`, wasting VRAM on oversized compute buffers

What happened: Production profiles for GPT-OSS 120B used -b 2048 -ub 2048 and the gpu-optimizer agent recommended -b 1024 -ub 1024 for GLM Q8. The GLM Q8 aggressive profile (33+14=47/47) OOMed on CUDA0 by 372 MiB — the compute buffer was 897 MiB with -ub 1024. With the default -ub 512, it would have been ~448 MiB, and the profile would have fit.

Root cause: Treating -b and -ub as a single parameter. They control different things:

-b (logical batch) = how many tokens per prompt processing step → speed
-ub (micro-batch) = GPU compute chunk size → determines compute buffer VRAM

Setting both to the same value wastes VRAM on a compute buffer sized for the full batch when the GPU only needs a buffer sized for one micro-batch. The default -ub 512 is what both llama.cpp and Ollama use.

Measured impact:

-ub 512: ~448 MiB compute buffer (GLM Q8)
-ub 1024: ~897 MiB compute buffer (+449 MiB wasted)
-ub 2048: ~1,500-2,400 MiB compute buffer (+1,000-2,000 MiB wasted)

Prevention rule: Always use -ub 512 (or omit it to use the default). Only set -b explicitly when you need to control prompt processing speed. The correct production setting is -b 2048 -ub 512 (or just -b 2048). Never write -b X -ub X with the same value. See docs/gpu-strategy-guide.md "Batch size and VRAM" for the full reference.

General prevention rules

Read the model card first. Before any configuration work on a model, read models/documentation/README_modelcard_*.md. If it doesn't exist, download the card from the model's source.
Verify, don't assume. Check GGUF metadata, file sizes, and architecture independently. Don't trust existing comments without verification.
Follow the decision tree. Use docs/gpu-strategy-guide.md step by step. Don't skip steps or take shortcuts based on pattern matching from other models.
Document the reasoning. In models.conf comments, explain WHY a particular strategy was chosen, not just WHAT it is. Include the source of key facts.
Test and measure. After making changes, verify with actual load logs and nvidia-smi. Check that model buffers, KV cache, and compute buffers are on the expected devices.
Test llama.cpp updates with real workloads before committing. A successful build and a short prompt don't prove correctness. Always test with longer prompts (100+ tokens) on all model types and GPU configurations before declaring an update safe. See lesson #6 below.

6. Updated llama.cpp without testing multi-GPU inference with real prompts

What happened: Pulled 93 new commits from llama.cpp upstream, rebuilt, tested with a 20-token prompt via curl — worked fine. Started the benchmark (164 HumanEval problems) and it crashed immediately. Every request with 100+ tokens caused CUDA error: an illegal memory access was encountered on the first generated token. Both Qwen3-Next and Qwen3-Coder-Next were affected.

Root cause: A regression in the new upstream code that breaks qwen3next model inference with -ot multi-GPU tensor splits. Prompt processing succeeds, but the first decode step crashes. Short prompts (<~100 tokens) happen to work, which masked the issue during manual testing.

Resolution: Reverted to previous commit (b48e80f67, b8022, 2026-02-13), rebuilt, verified all models work. Filed upstream: ggml-org/llama.cpp#19816. Similar issue was previously reported as #18580 and fixed by PR #18593.

Prevention rule: When updating llama.cpp, always test with:

A short prompt (smoke test — does it load?)
A long prompt (100+ tokens — does inference actually work?)
All model architectures, especially MoE models with multi-GPU placement
Only then consider the update safe

7. Hardcoded N_GPU_LAYERS=99 prevented FIT from working

What happened: docker-compose.yml had N_GPU_LAYERS defaulting to 99 and Dockerfile set ENV N_GPU_LAYERS=99. When testing --fit on without -ot on Qwen3-Next (53 GB model, 40 GB total GPU VRAM), it OOM'd immediately. The conclusion at the time was "FIT without -ot exps=CPU doesn't fit".

During investigation of issue #19816, pwilkin's suggestion to try --fit without -ot was tested again — but with N_GPU_LAYERS=auto. FIT then worked correctly: it placed layers on GPU as long as VRAM allowed, then offloaded experts to CPU. Result: 32.9 t/s at 262K context, 55 graph splits.

Root cause: With N_GPU_LAYERS=99, llama.cpp interpreted the flag as "try to put 99 layers on GPU" — overriding FIT's automatic capacity-aware placement logic. FIT's offloading logic only activates when it is allowed to set the layer count itself (--n-gpu-layers auto). With a hardcoded 99, it tried to load all layers on GPU before FIT could decide to offload, causing the OOM.

Impact:

All previous FIT tests with MoE models were invalid — they all ran with N_GPU_LAYERS=99
The comparison "FIT gives 21.7 t/s, -ot gives 28 t/s" was comparing FIT-broken (hardcoded 99 layers) against -ot (manual but correct). Fair comparison: FIT gives 32.9 t/s vs -ot 28 t/s (with patch).
The old -ot approach was replaced in all profiles on 2026-02-23.

Prevention rule:

Never hardcode N_GPU_LAYERS to a specific number in infrastructure files (Dockerfile, docker-compose.yml). Use auto as the default.
When testing auto-placement features like FIT, verify that no other config variable is overriding the auto behavior.
When a feature "doesn't work", check for env var overrides before concluding the feature is broken.

8. Used uniform FIT_TARGET on asymmetric GPU hardware

What happened: After migrating to FIT auto, production speeds were good but not optimal. GLM Q8 had 33 graph splits even with FIT. After testing with a tuned FIT_TARGET=128,1024, speeds improved across all models and GLM Q8 dropped from 33 to 5 graph splits.

Root cause: FIT_TARGET sets the VRAM headroom FIT reserves per device before placing tensors. The default was either unset or a single uniform value. On this hardware:

CUDA0 (RTX 4090) is a dedicated compute GPU — nothing else uses its VRAM
CUDA1 (RTX 5070 Ti) shares VRAM with the OS/display stack (~3 GB consumed)

A uniform conservative margin wastes headroom on CUDA0 while being appropriate for CUDA1. With FIT_TARGET=128,1024, CUDA0 gets a minimal 128 MiB margin (enough for safety, not more) and CUDA1 gets 1024 MiB to account for display.

Measured impact:

GLM Q4: ~140 → ~147 t/s
GLM Q8: ~105 t/s / 33 splits → ~112 t/s / 5 splits
GPT-OSS 120B: ~21 → ~22 t/s
Qwen3-Coder-Next: ~28 → ~33 t/s
Qwen3-Next: ~27 → ~33 t/s

Prevention rule: For asymmetric GPU setups where one GPU is dedicated and another shares with display/OS, always set FIT_TARGET as a comma-separated per-device list in docker-compose.yml. Use a small margin for dedicated GPUs (128-256 MiB) and a larger margin for GPUs that share with the display stack (1024+ MiB, depending on the display workload). See docs/gpu-strategy-guide.md for the current hardware configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lessons Learned

1. Assumed GLM-4.7-Flash was a dense model

2. Applied "exps=CPU" as a universal MoE best practice

3. Wrote incorrect documentation, then treated it as truth

4. Estimated model sizes without checking actual file sizes

5. Set `-ub` equal to `-b`, wasting VRAM on oversized compute buffers

General prevention rules

6. Updated llama.cpp without testing multi-GPU inference with real prompts

7. Hardcoded N_GPU_LAYERS=99 prevented FIT from working

8. Used uniform FIT_TARGET on asymmetric GPU hardware

FilesExpand file tree

lessons_learned.md

Latest commit

History

lessons_learned.md

File metadata and controls

Lessons Learned

1. Assumed GLM-4.7-Flash was a dense model

2. Applied "exps=CPU" as a universal MoE best practice

3. Wrote incorrect documentation, then treated it as truth

4. Estimated model sizes without checking actual file sizes

5. Set -ub equal to -b, wasting VRAM on oversized compute buffers

General prevention rules

6. Updated llama.cpp without testing multi-GPU inference with real prompts

7. Hardcoded N_GPU_LAYERS=99 prevented FIT from working

8. Used uniform FIT_TARGET on asymmetric GPU hardware

5. Set `-ub` equal to `-b`, wasting VRAM on oversized compute buffers