Run Open Source LLMs on remote GPUs with Ollama. Includes a lightweight Web UI and OpenAI/Anthropic-compatible API endpoints.
cd templates/ollama-models
# (Optional) Edit models.json to configure which models to pre-pull
# Start the pod
gpu runAfter startup, two endpoints are available on your local machine:
| Endpoint | Port | Description |
|---|---|---|
| Web UI | 8080 | Chat interface at http://localhost:8080 |
| Ollama API | 11434 | REST API at http://localhost:11434 |
- Dual Endpoints: Both Web UI and API forwarded to localhost
- OpenAI-Compatible API: Use existing OpenAI SDK code with local models
- Anthropic Messages API: Supported in Ollama 0.14.0+ (for tools like Claude Code)
- Model Pre-Pull: Configure models in
models.jsonto pre-download on startup - Streaming: Real-time response streaming in both UI and API
Edit models.json to specify which models to pre-pull when the pod starts:
{
"models": [
"llama3.2:3b",
"mistral:7b",
"codellama:7b"
],
"default": "llama3.2:3b"
}| Model | Size | VRAM | Best For |
|---|---|---|---|
llama3.2:3b |
2GB | 4GB | Fast, general purpose |
llama3.2:7b |
4GB | 8GB | Balanced quality/speed |
llama3.1:70b |
40GB | 48GB | High quality, slower |
mistral:7b |
4GB | 8GB | Fast, multilingual |
codellama:7b |
4GB | 8GB | Code generation |
codellama:13b |
8GB | 16GB | Better code quality |
deepseek-r1:7b |
4GB | 8GB | Reasoning tasks |
phi3:mini |
2GB | 4GB | Small, fast |
Browse all models: https://ollama.com/library
# List available models
curl http://localhost:11434/api/tags
# Chat with a model
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Pull a new model
curl http://localhost:11434/api/pull -d '{"name": "codellama:7b"}'curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}'from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but ignored
)
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:11434",
api_key="ollama" # Required but ignored
)
message = client.messages.create(
model="llama3.2:3b",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}]
)
print(message.content)This template can serve as a backend for Claude Code via Ollama's Anthropic API compatibility (requires Ollama 0.14.0+).
1. Start the pod:
gpu run2. Log out of your current Claude Code session (if active):
Claude Code caches authentication. You must log out first to switch to a different backend.
claude /logout3. Configure environment and run Claude Code:
Option A: Inline (one-time use)
ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 claude --model llama3.2:3bOption B: Shell config (persistent)
Add to your ~/.zshrc or ~/.bashrc:
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434Then reload and run:
source ~/.zshrc # or ~/.bashrc
claude --model llama3.2:3bTo switch back to using Claude models via Anthropic's API:
# Remove or comment out the environment variables from your shell config
# Then unset them in your current session:
unset ANTHROPIC_AUTH_TOKEN
unset ANTHROPIC_BASE_URL
# Log out and log back in
claude /logout
claude # Will prompt for Anthropic authentication| Model | Size | Context | VRAM | Best For |
|---|---|---|---|---|
llama3.2:3b |
2GB | 128K | 4GB | Fast, good for simple tasks |
qwen2.5-coder:7b |
4.7GB | 32K | 8GB | Code-focused, good quality |
qwen2.5-coder:14b |
9GB | 32K | 16GB | Great balance of speed/quality |
glm-4.7-flash |
19GB | 198K | 24GB | Excellent quality, huge context |
Note: Claude Code benefits from models with large context windows (32K+ recommended).
While the pod is running, you can pull more models:
# Via API
curl http://localhost:11434/api/pull -d '{"name": "codellama:7b"}'
# Via the Web UI (click the + button in model selector)
# Via SSH into the pod
gpu exec ollama pull deepseek-r1:7bThe template is configured to use GPUs in this priority order:
- RTX 4090 (24GB) - Great for 7B-13B models
- A40 (48GB) - Good for 30B models
- L40S (48GB) - Newer, good availability
- A100 80GB - For 70B+ models
Edit gpu.jsonc to change GPU preferences:
| Field | Description |
|---|---|
ports |
Ports to forward (default: [11434, 8080]) |
keep_alive_minutes |
Idle time before pod stops (default: 5) |
gpu_types |
Preferred GPUs in priority order |
min_vram |
Minimum VRAM in GB (default: 16) |
| Field | Description |
|---|---|
models |
Array of model names to pre-pull |
default |
Default model selected in Web UI |
Pull a model first:
curl http://localhost:11434/api/pull -d '{"name": "llama3.2:3b"}'Large models (30B+) can take 30-60 seconds to load. The cooldown is set to 5 minutes to account for this.
Check Ollama is running:
curl http://localhost:11434/api/tagsIf ports 11434 or 8080 are already in use locally, stop other services or modify gpu.jsonc to use different ports.
The following features are being developed by other engineers:
- Activity Proxy: HTTP/WebSocket monitoring for smarter cooldown management
- Cooldown Hook: Automatic cooldown extension during model loading
Current workarounds:
- Use longer
keep_alive_minutesfor large models - Pre-pull models in
models.jsonto reduce first-use latency