| layout | title | nav_order | has_children | parent |
|---|---|---|---|---|
default |
llama.cpp Tutorial - Chapter 4: Server Mode |
4 |
false |
llama.cpp Tutorial |
Welcome to Chapter 4: Server Mode. In this part of llama.cpp Tutorial: Local LLM Inference, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Run llama.cpp as an OpenAI-compatible HTTP server for API access and integration with applications.
llama.cpp includes a built-in HTTP server that provides an OpenAI-compatible API. This allows you to use any OpenAI client or library with your local models.
# Start server with default settings
./llama-server -m model.gguf
# With custom host and port
./llama-server -m model.gguf --host 0.0.0.0 --port 8080
# With verbose logging
./llama-server -m model.gguf --verbose# Production server configuration
./llama-server -m model.gguf \
--host 0.0.0.0 \
--port 8000 \
--threads $(nproc) \
--ctx-size 4096 \
--batch-size 512 \
--ubatch-size 512 \
--gpu-layers 0 \
--flash-attn \
--mlock \
--parallel 1 \
--cont-batching \
--metrics \
--log-format jsonThe server provides OpenAI-compatible endpoints:
import requests
# POST /v1/chat/completions
response = requests.post("http://localhost:8080/v1/chat/completions",
json={
"model": "local-model",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 100
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])# POST /v1/completions
response = requests.post("http://localhost:8080/v1/completions",
json={
"model": "local-model",
"prompt": "The capital of France is",
"max_tokens": 50,
"temperature": 0.1
}
)
result = response.json()
print(result["choices"][0]["text"])# Enable streaming
response = requests.post("http://localhost:8080/v1/chat/completions",
json={
"model": "local-model",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:] # Remove 'data: ' prefix
if data == '[DONE]':
break
try:
chunk = json.loads(data)
content = chunk["choices"][0]["delta"].get("content", "")
print(content, end="", flush=True)
except json.JSONDecodeError:
continueDrop-in replacement for OpenAI API:
from openai import OpenAI
# Point to your llama.cpp server
client = OpenAI(
api_key="not-needed", # Any string works
base_url="http://localhost:8080/v1"
)
# Use like normal OpenAI API
response = client.chat.completions.create(
model="local-model", # Model name doesn't matter
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
temperature=0.7,
max_tokens=100
)
print(response.choices[0].message.content)# High-performance configuration
./llama-server -m model.gguf \
--threads $(nproc) \
--ctx-size 4096 \
--batch-size 2048 \
--ubatch-size 512 \
--gpu-layers 35 \ # Use GPU if available
--flash-attn \ # Flash attention for speed
--mlock \ # Lock model in RAM
--no-mmap \ # Alternative memory management
--numa # NUMA-aware memory allocation# Load multiple models
./llama-server \
--model models/llama-7b.gguf \
--model models/mistral-7b.gguf \
--model models/codellama.gguf \
--alias llama=llama-7b.gguf \
--alias mistral=mistral-7b.gguf \
--alias code=codellama.gguf# Optimize for different use cases
./llama-server -m model.gguf \
--ctx-size 8192 \ # Large context for analysis
--rope-scaling yarn \ # YaRN RoPE scaling
--rope-scale 2.0 \ # Scale factor
--yarn-ext-factor 1.0 \
--yarn-attn-factor 1.0 \
--yarn-beta-fast 32 \
--yarn-beta-slow 1# Enable authentication
./llama-server -m model.gguf \
--api-key sk-your-secret-key
# Use in requests
headers = {"Authorization": "Bearer sk-your-secret-key"}
response = requests.post("http://localhost:8080/v1/chat/completions",
headers=headers,
json={...}
)# Enable CORS for web applications
./llama-server -m model.gguf \
--cors \
--cors-origin http://localhost:3000,http://localhost:5173
# Or allow all origins (development only)
./llama-server -m model.gguf --cors# Enable Prometheus metrics
./llama-server -m model.gguf --metrics
# Access metrics at /metrics
curl http://localhost:8080/metrics# Health endpoint
response = requests.get("http://localhost:8080/health")
if response.status_code == 200:
print("Server is healthy")# Structured JSON logging
./llama-server -m model.gguf \
--log-format json \
--verbose
# Log to file
./llama-server -m model.gguf \
--log-file server.log \
--log-verbosity 1#!/bin/bash
# start_multiple_servers.sh
ports=(8080 8081 8082)
models=("llama-7b.gguf" "mistral-7b.gguf" "codellama.gguf")
for i in "${!ports[@]}"; do
port=${ports[$i]}
model=${models[$i]}
echo "Starting server on port $port with $model"
./llama-server -m "models/$model" \
--port $port \
--host 0.0.0.0 \
--threads $(nproc) &
done
wait# nginx.conf
upstream llama_backend {
server localhost:8080;
server localhost:8081;
server localhost:8082;
}
server {
listen 80;
server_name api.yourdomain.com;
location / {
proxy_pass http://llama_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeout settings
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
}
}from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
# Use llama.cpp server with LangChain
llm = ChatOpenAI(
model="local-model",
openai_api_key="dummy",
openai_api_base="http://localhost:8080/v1",
temperature=0.7
)
chain = LLMChain(
llm=llm,
prompt=PromptTemplate(
input_variables=["topic"],
template="Explain {topic} in simple terms for a beginner."
)
)
result = chain.run(topic="quantum computing")
print(result)from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import requests
app = FastAPI()
class ChatRequest(BaseModel):
message: str
temperature: float = 0.7
max_tokens: int = 100
class ChatResponse(BaseModel):
response: str
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
try:
response = requests.post("http://localhost:8080/v1/chat/completions",
json={
"model": "local-model",
"messages": [{"role": "user", "content": request.message}],
"temperature": request.temperature,
"max_tokens": request.max_tokens
},
timeout=60
)
if response.status_code == 200:
result = response.json()
return ChatResponse(
response=result["choices"][0]["message"]["content"]
)
else:
raise HTTPException(status_code=500, detail="LLM server error")
except requests.exceptions.Timeout:
raise HTTPException(status_code=504, detail="Request timeout")
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# Health check
@app.get("/health")
async def health():
try:
response = requests.get("http://localhost:8080/health", timeout=5)
return {"status": "healthy" if response.status_code == 200 else "unhealthy"}
except:
return {"status": "unhealthy"}# docker-compose.yml
version: '3.8'
services:
llama-server:
build:
context: .
dockerfile: Dockerfile.server
ports:
- "8080:8080"
environment:
- LLAMA_MODEL_PATH=/models/model.gguf
volumes:
- ./models:/models:ro
deploy:
resources:
limits:
cpus: '4.0'
memory: 8G
reservations:
cpus: '2.0'
memory: 4G
api-gateway:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- llama-server
monitoring:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro# Dockerfile.server
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
&& rm -rf /var/lib/apt/lists/*
# Clone and build llama.cpp
RUN git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
mkdir build && \
cd build && \
cmake .. -DCMAKE_BUILD_TYPE=Release && \
make -j$(nproc) llama-server
WORKDIR /llama.cpp/build/bin
EXPOSE 8080
CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp Server
After=network.target
[Service]
Type=simple
User=llama
Group=llama
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/build/bin/llama-server \
-m /opt/models/model.gguf \
--host 0.0.0.0 \
--port 8080 \
--threads 8 \
--ctx-size 4096 \
--mlock
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target# Install service
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
# Monitor service
sudo systemctl status llama-server
sudo journalctl -u llama-server -f# NVIDIA CUDA
./llama-server -m model.gguf \
--gpu-layers 35 \
--main-gpu 0 \
--tensor-split 0,1 # Multi-GPU
# AMD ROCm (Linux)
./llama-server -m model.gguf \
--gpu-layers 35 \
--main-gpu 0
# Apple Metal
./llama-server -m model.gguf \
--gpu-layers 35 \
--metal# Large model optimization
./llama-server -m model.gguf \
--ctx-size 4096 \
--rope-scaling yarn \
--rope-scale 2.0 \
--mlock \
--memory-f32 \
--flash-attn# Handle multiple requests
./llama-server -m model.gguf \
--parallel 4 \ # Number of parallel requests
--cont-batching \ # Continuous batching
--batch-size 2048 \ # Batch size
--ubatch-size 512 # Micro batch sizeServer won't start:
# Check model file exists
ls -la model.gguf
# Check port availability
netstat -tlnp | grep 8080
# Run with verbose logging
./llama-server -m model.gguf --verboseOut of memory:
# Reduce context size
./llama-server -m model.gguf --ctx-size 2048
# Use lower quantization
# Try Q3_K or Q2_K modelsSlow responses:
# Increase threads
./llama-server -m model.gguf --threads $(nproc)
# Enable GPU layers
./llama-server -m model.gguf --gpu-layers 35
# Use flash attention
./llama-server -m model.gguf --flash-attnConnection refused:
# Check server is running
ps aux | grep llama-server
# Check firewall
sudo ufw status
sudo ufw allow 8080- Resource Planning: Calculate memory requirements before deployment
- Health Checks: Implement proper health checks and monitoring
- Security: Use authentication and restrict network access
- Scaling: Plan for load balancing and horizontal scaling
- Monitoring: Set up comprehensive logging and metrics
- Updates: Keep llama.cpp updated for performance improvements
- Testing: Thoroughly test your API endpoints before production
The server mode makes llama.cpp accessible via standard HTTP APIs, enabling integration with any application that supports OpenAI-compatible endpoints. This is the most practical way to use llama.cpp in production applications.
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for llama, server, model so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 4: Server Mode as an operating subsystem inside llama.cpp Tutorial: Local LLM Inference, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around gguf, response, size as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 4: Server Mode usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
llama. - Input normalization: shape incoming data so
serverreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
model. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- View Repo
Why it matters: authoritative reference on
View Repo(github.com). - Awesome Code Docs
Why it matters: authoritative reference on
Awesome Code Docs(github.com).
Suggested trace strategy:
- search upstream code for
llamaandserverto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production