layout	default
title	llama.cpp Tutorial - Chapter 2: Model Formats
nav_order	2
has_children	false
parent	llama.cpp Tutorial

Chapter 2: Model Formats and GGUF

Welcome to Chapter 2: Model Formats and GGUF. In this part of llama.cpp Tutorial: Local LLM Inference, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Understand GGUF format, quantization levels, and how to choose the right model for your hardware.

Overview

llama.cpp uses the GGUF format for models. This chapter explains what GGUF is, different quantization options, and how to select models that fit your hardware constraints.

What is GGUF?

GGUF (GPT-Generated Unified Format) is the successor to GGML format, designed specifically for llama.cpp. It contains:

Model weights in optimized format
Metadata about architecture, vocabulary, and quantization
Tokenizer information for text encoding/decoding
Hyperparameters like context length, embedding size

GGUF vs Other Formats

Format	Pros	Cons	Use Case
GGUF	Optimized for llama.cpp, fast loading	llama.cpp only	Production inference
GGML	Legacy format	Slower, deprecated	Migration only
PyTorch	Flexible, easy conversion	Large files, slow inference	Development/training
SafeTensors	Safe loading, multi-platform	Requires conversion	Model storage

Quantization Explained

Quantization reduces model precision to save memory and improve speed:

Quantization Levels

Type	Bits	Memory	Speed	Quality
F32	32	100%	Slowest	Best
F16	16	50%	Slow	Very Good
Q8_0	8	50%	Good	Excellent
Q6_K	6	37.5%	Fast	Very Good
Q5_K	5	31.25%	Faster	Good
Q4_K	4	25%	Fast	Good
Q3_K	3	18.75%	Very Fast	Acceptable
Q2_K	2	12.5%	Fastest	Basic

Understanding K-Quants

K-quantization uses advanced techniques:

Q4_K: Balances quality and speed, recommended for most use
Q5_K: Better quality, slightly slower than Q4_K
Q6_K: High quality, good for creative tasks
Q8_0: Maximum quality, minimal compression

Memory Requirements

Calculate VRAM/RAM needed:

def calculate_memory_gb(parameters, quantization, context_length=2048):
    """Estimate memory requirements in GB."""

    # Base calculation (rough approximation)
    if quantization == "F16":
        bytes_per_param = 2
    elif quantization == "Q8_0":
        bytes_per_param = 1
    elif quantization == "Q6_K":
        bytes_per_param = 0.75
    elif quantization == "Q5_K":
        bytes_per_param = 0.625
    elif quantization == "Q4_K":
        bytes_per_param = 0.5
    elif quantization == "Q3_K":
        bytes_per_param = 0.375
    elif quantization == "Q2_K":
        bytes_per_param = 0.25
    else:
        bytes_per_param = 2  # Default F16

    # Model weights
    model_memory = (parameters * bytes_per_param) / (1024**3)

    # KV cache (context)
    kv_memory = (parameters * 2 * context_length * bytes_per_param) / (1024**3)

    # Overhead
    overhead = 0.5  # GB

    return model_memory + kv_memory + overhead

# Examples
print(f"Llama-2-7B Q4_K: {calculate_memory_gb(7e9, 'Q4_K'):.1f} GB")
print(f"Llama-2-13B Q4_K: {calculate_memory_gb(13e9, 'Q4_K'):.1f} GB")
print(f"Llama-2-70B Q4_K: {calculate_memory_gb(70e9, 'Q4_K'):.1f} GB")

Typical memory requirements:

7B model, Q4_K: ~4-5 GB
13B model, Q4_K: ~8-9 GB
30B model, Q4_K: ~18-20 GB

Choosing the Right Model

Hardware-Based Selection

def recommend_quantization(ram_gb, has_gpu=False, gpu_vram_gb=None):
    """Recommend quantization based on hardware."""

    if has_gpu and gpu_vram_gb:
        available_memory = gpu_vram_gb
    else:
        available_memory = ram_gb

    if available_memory >= 24:
        return "Q4_K"  # Best quality
    elif available_memory >= 16:
        return "Q5_K"  # Good balance
    elif available_memory >= 12:
        return "Q6_K"  # Quality over speed
    elif available_memory >= 8:
        return "Q8_0"  # Maximum quality
    elif available_memory >= 6:
        return "Q3_K"  # Acceptable quality
    else:
        return "Q2_K"  # Basic functionality

# Usage
ram_gb = 16
quant = recommend_quantization(ram_gb)
print(f"With {ram_gb}GB RAM, use {quant} quantization")

Use Case-Based Selection

Use Case	Recommended Quant	Model Size	Notes
Code completion	Q4_K - Q6_K	7B - 13B	Quality matters
Chat/conversation	Q4_K	7B - 13B	Balance speed/quality
Creative writing	Q5_K - Q6_K	13B+	Higher quality helps
Simple Q&A	Q3_K - Q4_K	7B	Speed prioritized
Analysis/research	Q4_K - Q5_K	13B+	Complex reasoning

Popular GGUF Models

Llama 2 Series

# Download Llama 2 models (requires license agreement)
# Visit: https://huggingface.co/meta-llama

# 7B Chat Model
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

# 13B Chat Model
wget https://huggingface.co/TheBloke/Llama-2-13B-Chat-GGUF/resolve/main/llama-2-13b-chat.Q4_K_M.gguf

Mistral Series

# Mistral 7B Instruct (excellent quality/speed balance)
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf

# Mixtral 8x7B (MoE model, high quality)
wget https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

Code Models

# CodeLlama 7B Instruct
wget https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF/resolve/main/codellama-7b-instruct.Q4_K_M.gguf

# DeepSeek Coder 6.7B
wget https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct-GGUF/resolve/main/deepseek-coder-6.7b-instruct.Q4_K_M.gguf

Small/Fast Models

# Phi-2 (2.7B parameters, great quality for size)
wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf

# TinyLlama 1.1B (very fast, basic quality)
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

Inspecting GGUF Files

Check model information:

# Use llama.cpp to inspect model
./llama-cli -m model.gguf --verbose-prompt --prompt ""

# Or use Python script
python3 -c "
import sys
sys.path.insert(0, '.')
from llama_cpp import Llama
model = Llama('model.gguf', verbose=False)
print('Model info:')
print(f'  Vocab size: {model.n_vocab()}')
print(f'  Context length: {model.n_ctx()}')
print(f'  Embedding size: {model.n_embd()}')
print(f'  Layers: {model.n_layer()}')
"

Converting Models to GGUF

Convert from PyTorch/SafeTensors (covered in Chapter 6):

# Basic conversion (simplified)
python convert.py model_path \
    --outfile model.gguf \
    --outtype f16  # Convert to F16 first

# Then quantize
./llama-quantize model.gguf model-Q4_K.gguf Q4_K

Model Storage and Organization

Organize your model collection:

# Create organized structure
mkdir -p models/{llama,mistral,code,small}

# Download with descriptive names
cd models/llama
wget -O llama-2-7b-chat-Q4_K.gguf \
    https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

# Create inventory
find models -name "*.gguf" -exec ls -lh {} \; | \
    awk '{print $5, $9}' > model_inventory.txt

Model Testing and Validation

Test model quality:

#!/bin/bash
# test_models.sh

MODELS=("models/small/phi-2-Q4_K.gguf" "models/llama/llama-2-7b-chat-Q4_K.gguf")
TEST_PROMPT="Explain quantum computing in one sentence."

for model in "${MODELS[@]}"; do
    if [ -f "$model" ]; then
        echo "Testing $model..."
        timeout 30s ./llama-cli -m "$model" \
            --prompt "$TEST_PROMPT" \
            --n-predict 50 \
            --temp 0.1 \
            --seed 42  # Deterministic output
        echo "---"
    fi
done

Performance Comparison

Benchmark different quantizations:

#!/bin/bash
# benchmark_quantizations.sh

MODEL_BASE="llama-2-7b-chat"
QUANTS=("Q2_K" "Q3_K" "Q4_K" "Q5_K" "Q6_K")

echo "Quantization | Memory | Tokens/sec | Quality"
echo "-------------|--------|------------|--------"

for quant in "${QUANTS[@]}"; do
    model_file="${MODEL_BASE}.${quant}.gguf"

    if [ -f "$model_file" ]; then
        # Quick benchmark
        result=$(./llama-bench -m "$model_file" -p 10 -n 64 -t 4 2>/dev/null | \
                 grep "tokens/sec" | head -1)

        if [ ! -z "$result" ]; then
            tokens_sec=$(echo $result | grep -o "[0-9.]* tokens/sec")
            memory=$(ls -lh "$model_file" | awk '{print $5}')
            echo "$quant | $memory | $tokens_sec | TBD"
        fi
    fi
done

Best Practices

Match Hardware to Needs: Choose quantization based on your RAM/VRAM
Test Quality: Always test model outputs for your specific use case
Monitor Performance: Track tokens/second and memory usage
Update Models: Newer GGUF conversions often have better quality
Backup Models: Keep multiple quantizations for different scenarios
Version Control: Track which model versions work best for your use cases

Troubleshooting Model Issues

Common Problems

Model won't load:

# Check file path and permissions
ls -la model.gguf

# Verify GGUF format
file model.gguf  # Should show "GGUF" format

Poor quality output:

# Try higher quantization
# Increase temperature/top_p for creativity
# Use models specifically fine-tuned for your task

Slow inference:

# Use lower quantization (Q2_K, Q3_K)
# Reduce context size
# Enable GPU acceleration (Chapter 5)
# Use smaller models

Next: Master the command-line interface with advanced options and interactive modes.

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for gguf, model, llama so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

coupling core logic too tightly to one implementation path
missing the handoff boundaries between setup, execution, and validation
shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 2: Model Formats and GGUF as an operating subsystem inside llama.cpp Tutorial: Local LLM Inference, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around Q4_K, quantization, print as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 2: Model Formats and GGUF usually follows a repeatable control path:

Context bootstrap: initialize runtime config and prerequisites for gguf.
Input normalization: shape incoming data so model receives stable contracts.
Core execution: run the main logic branch and propagate intermediate state through llama.
Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
Output composition: return canonical result payloads for downstream consumers.
Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

View Repo Why it matters: authoritative reference on View Repo (github.com).
Awesome Code Docs Why it matters: authoritative reference on Awesome Code Docs (github.com).

Suggested trace strategy:

search upstream code for gguf and model to map concrete implementation paths
compare docs claims against actual runtime/config code before reusing patterns in production

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 2: Model Formats and GGUF

Overview

What is GGUF?

GGUF vs Other Formats

Quantization Explained

Quantization Levels

Understanding K-Quants

Memory Requirements

Choosing the Right Model

Hardware-Based Selection

Use Case-Based Selection

Popular GGUF Models

Llama 2 Series

Mistral Series

Code Models

Small/Fast Models

Inspecting GGUF Files

Converting Models to GGUF

Model Storage and Organization

Model Testing and Validation

Performance Comparison

Best Practices

Troubleshooting Model Issues

Common Problems

What Problem Does This Solve?

How it Works Under the Hood

Source Walkthrough

Chapter Connections

FilesExpand file tree

02-model-formats.md

Latest commit

History

02-model-formats.md

File metadata and controls

Chapter 2: Model Formats and GGUF

Overview

What is GGUF?

GGUF vs Other Formats

Quantization Explained

Quantization Levels

Understanding K-Quants

Memory Requirements

Choosing the Right Model

Hardware-Based Selection

Use Case-Based Selection

Popular GGUF Models

Llama 2 Series

Mistral Series

Code Models

Small/Fast Models

Inspecting GGUF Files

Converting Models to GGUF

Model Storage and Organization

Model Testing and Validation

Performance Comparison

Best Practices

Troubleshooting Model Issues

Common Problems

What Problem Does This Solve?

How it Works Under the Hood

Source Walkthrough

Chapter Connections