| layout | title | nav_order | has_children | parent |
|---|---|---|---|---|
default |
llama.cpp Tutorial - Chapter 5: GPU Acceleration |
5 |
false |
llama.cpp Tutorial |
Welcome to Chapter 5: GPU Acceleration. In this part of llama.cpp Tutorial: Local LLM Inference, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Enable GPU acceleration with CUDA, Metal, and ROCm for dramatically faster inference.
GPU acceleration can provide 5-10x speed improvements over CPU-only inference. llama.cpp supports multiple GPU platforms: NVIDIA CUDA, Apple Metal, and AMD ROCm.
- GPU: Pascal architecture or newer (GTX 1000 series, RTX series, Tesla, etc.)
- VRAM: At least 4GB for 7B models, 8GB+ recommended
- Driver: Latest NVIDIA drivers
- CUDA Toolkit: 11.0 or higher
- Hardware: Apple Silicon (M1, M2, M3 chips)
- macOS: 12.3 or higher
- Xcode: Command Line Tools installed
- GPU: RDNA architecture GPUs (RX 5000/6000/7000 series)
- Linux: Ubuntu 20.04+ or RHEL/CentOS
- ROCm: Version 5.0 or higher
# Install CUDA toolkit (Ubuntu)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run
# Build llama.cpp with CUDA
cd llama.cpp
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON
make -j$(nproc)# Build with Metal support (automatically enabled on Apple Silicon)
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.ncpu)
# Copy Metal shader
cp build/bin/ggml-metal.metal .# Install ROCm (Ubuntu)
wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/5.4/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-dev
# Build with ROCm
cd llama.cpp
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030,gfx1031,gfx1032
make -j$(nproc)# Basic GPU inference
./llama-cli -m model.gguf \
--gpu-layers 35 \ # Number of layers to offload to GPU
--main-gpu 0 # Primary GPU device
# Multi-GPU setup
./llama-cli -m model.gguf \
--gpu-layers 35 \
--main-gpu 0 \
--tensor-split 0.5,0.5 # Split across 2 GPUs# Metal acceleration (automatic on Apple Silicon)
./llama-cli -m model.gguf \
--gpu-layers 35 # Offload to GPU
# Check Metal usage
./llama-cli -m model.gguf --prompt "Hello" --gpu-layers 35 --verbose-prompt# ROCm acceleration
./llama-cli -m model.gguf \
--gpu-layers 35 \
--main-gpu 0def calculate_optimal_gpu_layers(model_size_gb, vram_gb, safety_margin=0.8):
"""
Calculate optimal number of layers to offload.
model_size_gb: Model size in GB
vram_gb: Available VRAM in GB
safety_margin: Leave some VRAM for KV cache and overhead
"""
available_vram = vram_gb * safety_margin
# Rough estimate: each layer needs about model_size / num_layers GB
# For 7B models (32 layers), each layer ~ 0.2-0.3 GB
# For 13B models (40 layers), each layer ~ 0.3-0.4 GB
if model_size_gb <= 4: # 7B models
layers_per_gb = 32 / model_size_gb
optimal_layers = int(available_vram * layers_per_gb)
elif model_size_gb <= 8: # 13B models
layers_per_gb = 40 / model_size_gb
optimal_layers = int(available_vram * layers_per_gb)
else: # Larger models
# Conservative estimate
optimal_layers = int(available_vram * 2.5)
return max(0, min(optimal_layers, 60)) # Cap at reasonable maximum
# Usage
model_size = 4.5 # GB for 7B Q4_K model
vram = 8 # GB VRAM
gpu_layers = calculate_optimal_gpu_layers(model_size, vram)
print(f"Optimal GPU layers: {gpu_layers}")# Conservative (good for most cases)
./llama-cli -m model.gguf \
--gpu-layers 28 \
--batch-size 512 \
--ubatch-size 512
# Aggressive (for high-end GPUs)
./llama-cli -m model.gguf \
--gpu-layers 35 \
--batch-size 1024 \
--ubatch-size 512 \
--flash-attn
# Multi-GPU (NVIDIA)
./llama-cli -m model.gguf \
--gpu-layers 35 \
--tensor-split 0.7,0.3 # 70% on GPU 0, 30% on GPU 1
# Memory efficient
./llama-cli -m model.gguf \
--gpu-layers 20 \
--ctx-size 2048 \
--memory-f32# GPU-accelerated server
./llama-server -m model.gguf \
--gpu-layers 35 \
--host 0.0.0.0 \
--port 8080 \
--threads 8 \
--batch-size 1024 \
--ubatch-size 512 \
--flash-attn \
--ctx-size 4096#!/bin/bash
# gpu_benchmark.sh
model="$1"
gpu_layers="$2"
echo "Benchmarking $model with $gpu_layers GPU layers"
# CPU baseline
echo "CPU baseline:"
./llama-bench -m "$model" -p 10 -n 128 -t $(nproc) 2>/dev/null | grep "tokens/sec"
# GPU test
echo "GPU ($gpu_layers layers):"
./llama-bench -m "$model" -p 10 -n 128 -t 1 -ngl "$gpu_layers" 2>/dev/null | grep "tokens/sec"
# Memory usage
echo "Memory usage:"
./llama-cli -m "$model" -ngl "$gpu_layers" --prompt "Hello" --n-predict 1 2>&1 | grep "VRAM\|RAM"Typical performance improvements:
| Configuration | Tokens/sec | Speedup |
|---|---|---|
| CPU only (i7-12700K) | 25-35 | 1x |
| CUDA RTX 3060 | 60-80 | 2.5x |
| CUDA RTX 4070 | 120-150 | 5x |
| CUDA RTX 4090 | 180-220 | 7x |
| Apple M2 Metal | 40-60 | 2x |
| Apple M3 Metal | 80-100 | 3.5x |
# Monitor VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Memory-efficient settings
./llama-cli -m model.gguf \
--gpu-layers 25 \ # Leave VRAM for KV cache
--ctx-size 2048 \ # Smaller context saves VRAM
--memory-f32 \ # Use f32 for KV cache
--no-mmap # Avoid memory mapping overhead
# Large context optimization
./llama-cli -m model.gguf \
--gpu-layers 20 \
--ctx-size 8192 \
--rope-scaling yarn \
--rope-scale 2.0# Load balancing
./llama-cli -m model.gguf \
--gpu-layers 35 \
--tensor-split 0.6,0.4 # 60/40 split
# Dedicated GPUs
./llama-cli -m model.gguf \
--gpu-layers 35 \
--main-gpu 1 \ # Use GPU 1 as primary
--tensor-split 0,1 # All work on GPU 1
# PCIe optimization
export CUDA_VISIBLE_DEVICES=0,1 # Only expose specific GPUs
./llama-cli -m model.gguf --gpu-layers 35 --tensor-split 0.5,0.5"CUDA out of memory":
# Reduce GPU layers
./llama-cli -m model.gguf --gpu-layers 20
# Use smaller context
./llama-cli -m model.gguf --ctx-size 2048
# Check VRAM usage
nvidia-smi"CUDA error":
# Update NVIDIA drivers
sudo apt update && sudo apt install nvidia-driver-XXX
# Check CUDA installation
nvcc --version
nvidia-smiSlow CUDA performance:
# Enable TensorRT (if available)
export LLAMA_CUDA_USE_TENSORRT=1
# Use flash attention
./llama-cli -m model.gguf --gpu-layers 35 --flash-attn"Metal not available":
# Check macOS version
sw_vers
# Ensure Xcode tools
xcode-select --install
# Check Metal support
system_profiler SPDisplaysDataType | grep MetalPoor Metal performance:
# Limit GPU layers
./llama-cli -m model.gguf --gpu-layers 25
# Check Activity Monitor for GPU usageInstallation problems:
# Add user to video group
sudo usermod -a -G video $USER
# Check ROCm installation
/opt/rocm/bin/rocminfo
# Set environment variables
export ROCM_PATH=/opt/rocm
export HIP_VISIBLE_DEVICES=0# Enable CUDA graphs (faster kernel launches)
export LLAMA_CUDA_ENABLE_CUBLAS=1
# Use TensorRT for faster inference (experimental)
export LLAMA_CUDA_USE_TENSORRT=1
# Multi-stream processing
./llama-server -m model.gguf \
--gpu-layers 35 \
--parallel 4 \
--cont-batching# Enable Metal optimizations
export LLAMA_METAL_ENABLE_DEBUG=0
# Use unified memory efficiently
./llama-cli -m model.gguf \
--gpu-layers 35 \
--ctx-size 4096 \
--mlock # Lock in unified memory
# Check Neural Engine usage (if available)
powermetrics --samplers gpu_power | grep "Neural Engine"# Set GPU device
export HIP_VISIBLE_DEVICES=0
# Enable HIP optimizations
export LLAMA_HIP_ROCM_VERSION=1
# Use ROCm-specific settings
./llama-cli -m model.gguf \
--gpu-layers 35 \
--main-gpu 0 \
--threads 8# Real-time GPU stats
nvidia-smi -l 1
# GPU utilization during inference
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total --format=csv
# Power consumption
nvidia-smi --query-gpu=power.draw,power.limit --format=csv# ROCm system management
rocm-smi
# GPU usage
rocm-smi --showuse
# Temperature and power
rocm-smi --showtemp --showpower# Activity Monitor (GUI)
# Or command line
powermetrics --samplers gpu_power -n 1def gpu_vs_cpu_comparison(model_size_gb, vram_gb, electricity_cost_per_kwh=0.12):
"""
Compare GPU vs CPU costs for inference.
"""
# Assumptions
cpu_tokens_per_second = 25
gpu_tokens_per_second = 120 # RTX 4070 example
gpu_power_watts = 200 # GPU power consumption
cpu_power_watts = 125 # CPU power consumption
# Calculate hourly costs
gpu_hourly_power_cost = (gpu_power_watts / 1000) * electricity_cost_per_kwh
cpu_hourly_power_cost = (cpu_power_watts / 1000) * electricity_cost_per_kwh
# Performance comparison
speedup = gpu_tokens_per_second / cpu_tokens_per_second
gpu_efficiency = speedup / (gpu_power_watts / cpu_power_watts)
print(f"GPU speedup: {speedup:.1f}x")
print(f"GPU power cost per hour: ${gpu_hourly_power_cost:.3f}")
print(f"CPU power cost per hour: ${cpu_hourly_power_cost:.3f}")
print(f"GPU efficiency: {gpu_efficiency:.1f}x tokens per watt")
# Break-even analysis
gpu_initial_cost = 600 # Example GPU cost
gpu_daily_usage_hours = 8
gpu_daily_power_cost = gpu_hourly_power_cost * gpu_daily_usage_hours
# Days to break even on power savings alone
daily_savings = cpu_hourly_power_cost * gpu_daily_usage_hours - gpu_daily_power_cost
if daily_savings > 0:
breakeven_days = gpu_initial_cost / daily_savings
print(f"Break-even period: {breakeven_days:.0f} days")
else:
print("GPU uses more power, focus on performance benefits")
gpu_vs_cpu_comparison(4.5, 12)- Right-size GPU layers: Balance VRAM usage with performance
- Monitor temperatures: GPUs can throttle when hot
- Use appropriate precision: F16 for speed, F32 for accuracy
- Batch efficiently: Larger batches improve GPU utilization
- Update drivers: Keep GPU drivers and CUDA/ROCm current
- Test thoroughly: Validate outputs match CPU results
- Power management: Use appropriate power limits for stability
GPU acceleration dramatically improves inference speed and enables larger model usage. Choose the right GPU platform for your hardware and optimize layer offloading for maximum performance.
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for layers, llama, model so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 5: GPU Acceleration as an operating subsystem inside llama.cpp Tutorial: Local LLM Inference, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around gguf, size, rocm as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 5: GPU Acceleration usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
layers. - Input normalization: shape incoming data so
llamareceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
model. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- View Repo
Why it matters: authoritative reference on
View Repo(github.com). - Awesome Code Docs
Why it matters: authoritative reference on
Awesome Code Docs(github.com).
Suggested trace strategy:
- search upstream code for
layersandllamato map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production