Tested 2026-02-15. All profiles use CTX_SIZE=10240, -b 512 -ub 512 (except GLM Q4 which uses default batch size), --cache-type-k q8_0 --cache-type-v q8_0.
Note: UD-Q6_K_XL was removed from
models.confon 2026-02-16 after benchmarks showed UD-Q5_K_XL is both faster (30 vs 24 t/s) and higher scoring (93.9% vs 92.1% HumanEval). UD-Q6 data below is historical.
GPU placement note (2026-02-23): All results below were measured with the old
-otexplicit layer assignment approach (FIT=off,N_GPU_LAYERS=99). All bench profiles were converted to--fitwith--n-gpu-layers autoon 2026-02-23, andFIT_TARGET=128,1024was set as the default indocker-compose.ymlto tune per-device VRAM headroom for this asymmetric GPU setup (CUDA0 dedicated, CUDA1 shares with display). Speeds and VRAM usage will differ under FIT auto with tuned FIT_TARGET. A re-benchmark is needed to update these figures — seebenchmarks/evalplus/results/REPORT.mdfor current scores (benchmark scores are unaffected by GPU placement changes).
- CUDA0: RTX 4090 (24 GB) — nothing else running
- CUDA1: RTX 5070 Ti (16 GB, ~12.5 GB usable) — runs display/OS
- CPU: 64 GB DDR4
| Model | Quant | Strategy | Split | Speed | CUDA0 | CUDA1 | Graph splits (bs=1) |
|---|---|---|---|---|---|---|---|
| GLM-4.7-Flash | Q4_K_M | A (4090 only) | 47/47 | ~140 t/s | 74% | — | 2 |
| GLM-4.7-Flash | Q8_0 | C (both GPUs) | 35+12=47/47 | ~105 t/s | 93% | 65% | 33 |
| GPT-OSS 120B | F16 | D (GPU+CPU) | 13+5=18/36 | ~22 t/s | 96% | 83% | 68 |
| Qwen3-Coder-Next | UD-Q5_K_XL | D (GPU+CPU) | 19+9=28/48 | ~30 t/s | 93% | 89% | 136 |
| Qwen3-Coder-Next | UD-Q6_K_XL | D (GPU+CPU) | 16+8=24/48 | ~24 t/s | 93% | 96% | 132 |
| Model | Attempted split | Result | Reason |
|---|---|---|---|
| GLM Q8 | 37+10 | ~102 t/s (slower) | Graph splits jumped 33→53, overhead outweighed faster GPU |
| GPT-OSS 120B | 13+6=19/36 | OOM on CUDA1 | 15.4/16.3 GB after load, no room for runtime allocations |
| Qwen3 UD-Q6 | 17+8=25/48 | OOM (load loop) | Layers too large (~1.33 GB each) for both GPUs at +1 |
-
Strategy A is king. When a model fits on 1 GPU: no splits, no transfers, maximum speed. GLM Q4 at 140 t/s vs Q8 at 105 t/s is partly quant size but also 2 splits vs 33.
-
Graph splits matter more for GPU↔GPU (Strategy C) than GPU↔CPU (Strategy D). GLM Q8 lost speed with more layers on the faster GPU because of extra splits. Qwen3 Q5 gained speed with more GPU layers despite more splits — because avoiding CPU is a bigger win than avoiding GPU↔GPU splits.
-
+1 layer per GPU is the sweet spot when headroom allows. Going from 18+8 to 19+9 on Qwen3 Q5 improved speed (29→30 t/s). Going further risks OOM for marginal gains.
-
CUDA1 (5070 Ti) is the limiting factor. OOM always happened on CUDA1 first due to display/OS overhead eating ~3.5 GB. The 4090 consistently had more room.
-
Rule of thumb: if there's room for 2+ layers of headroom, add 1 per GPU. If only 1 layer of headroom, leave it — the gain is minimal or risks OOM.
These numbers help estimate production profile optimization. Production uses larger context (64K-256K) and larger batch sizes (2048-4096), consuming more VRAM for KV cache and compute buffers. So production splits will always have fewer GPU layers.
| Model | Bench split | Production split | Difference |
|---|---|---|---|
| GLM Q4 | 47/47 (all GPU) | all GPU | Same — fits either way |
| GLM Q8 | 35+12=47/47 | TBD (review needed) | — |
| GPT-OSS 120B | 13+5=18/36 | 12+4=16/36 | +2 layers at bench |
| Qwen3 UD-Q5 | 19+9=28/48 | 15+7=22/48 | +6 layers at bench |
| Qwen3 UD-Q6 | 16+8=24/48 | 13+6=19/48 | +5 layers at bench |