Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Ollama Models - GPU CLI Template

Run Open Source LLMs on remote GPUs with Ollama. Includes a lightweight Web UI and OpenAI/Anthropic-compatible API endpoints.

Quick Start

cd templates/ollama-models

# (Optional) Edit models.json to configure which models to pre-pull

# Start the pod
gpu run

After startup, two endpoints are available on your local machine:

Endpoint Port Description
Web UI 8080 Chat interface at http://localhost:8080
Ollama API 11434 REST API at http://localhost:11434

Features

  • Dual Endpoints: Both Web UI and API forwarded to localhost
  • OpenAI-Compatible API: Use existing OpenAI SDK code with local models
  • Anthropic Messages API: Supported in Ollama 0.14.0+ (for tools like Claude Code)
  • Model Pre-Pull: Configure models in models.json to pre-download on startup
  • Streaming: Real-time response streaming in both UI and API

Configuring Models

Edit models.json to specify which models to pre-pull when the pod starts:

{
  "models": [
    "llama3.2:3b",
    "mistral:7b",
    "codellama:7b"
  ],
  "default": "llama3.2:3b"
}

Popular Models

Model Size VRAM Best For
llama3.2:3b 2GB 4GB Fast, general purpose
llama3.2:7b 4GB 8GB Balanced quality/speed
llama3.1:70b 40GB 48GB High quality, slower
mistral:7b 4GB 8GB Fast, multilingual
codellama:7b 4GB 8GB Code generation
codellama:13b 8GB 16GB Better code quality
deepseek-r1:7b 4GB 8GB Reasoning tasks
phi3:mini 2GB 4GB Small, fast

Browse all models: https://ollama.com/library

API Usage

Native Ollama API

# List available models
curl http://localhost:11434/api/tags

# Chat with a model
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

# Pull a new model
curl http://localhost:11434/api/pull -d '{"name": "codellama:7b"}'

OpenAI-Compatible API

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Using with OpenAI SDK (Python)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but ignored
)

response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Using with Anthropic SDK (Ollama 0.14.0+)

import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:11434",
    api_key="ollama"  # Required but ignored
)

message = client.messages.create(
    model="llama3.2:3b",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(message.content)

Using with Claude Code

This template can serve as a backend for Claude Code via Ollama's Anthropic API compatibility (requires Ollama 0.14.0+).

Setup

1. Start the pod:

gpu run

2. Log out of your current Claude Code session (if active):

Claude Code caches authentication. You must log out first to switch to a different backend.

claude /logout

3. Configure environment and run Claude Code:

Option A: Inline (one-time use)

ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 claude --model llama3.2:3b

Option B: Shell config (persistent)

Add to your ~/.zshrc or ~/.bashrc:

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434

Then reload and run:

source ~/.zshrc  # or ~/.bashrc
claude --model llama3.2:3b

Switching Back to Anthropic

To switch back to using Claude models via Anthropic's API:

# Remove or comment out the environment variables from your shell config
# Then unset them in your current session:
unset ANTHROPIC_AUTH_TOKEN
unset ANTHROPIC_BASE_URL

# Log out and log back in
claude /logout
claude  # Will prompt for Anthropic authentication

Recommended Models for Claude Code

Model Size Context VRAM Best For
llama3.2:3b 2GB 128K 4GB Fast, good for simple tasks
qwen2.5-coder:7b 4.7GB 32K 8GB Code-focused, good quality
qwen2.5-coder:14b 9GB 32K 16GB Great balance of speed/quality
glm-4.7-flash 19GB 198K 24GB Excellent quality, huge context

Note: Claude Code benefits from models with large context windows (32K+ recommended).

Pull Additional Models

While the pod is running, you can pull more models:

# Via API
curl http://localhost:11434/api/pull -d '{"name": "codellama:7b"}'

# Via the Web UI (click the + button in model selector)

# Via SSH into the pod
gpu exec ollama pull deepseek-r1:7b

GPU Selection

The template is configured to use GPUs in this priority order:

  1. RTX 4090 (24GB) - Great for 7B-13B models
  2. A40 (48GB) - Good for 30B models
  3. L40S (48GB) - Newer, good availability
  4. A100 80GB - For 70B+ models

Edit gpu.jsonc to change GPU preferences:

"gpu_types": [
  { "type": "RTX 4090" },
  { "type": "A40" }
],
"min_vram": 16

Configuration Reference

gpu.jsonc

Field Description
ports Ports to forward (default: [11434, 8080])
keep_alive_minutes Idle time before pod stops (default: 5)
gpu_types Preferred GPUs in priority order
min_vram Minimum VRAM in GB (default: 16)

models.json

Field Description
models Array of model names to pre-pull
default Default model selected in Web UI

Troubleshooting

"No models available"

Pull a model first:

curl http://localhost:11434/api/pull -d '{"name": "llama3.2:3b"}'

Slow model loading

Large models (30B+) can take 30-60 seconds to load. The cooldown is set to 5 minutes to account for this.

API returns errors

Check Ollama is running:

curl http://localhost:11434/api/tags

Port already in use

If ports 11434 or 8080 are already in use locally, stop other services or modify gpu.jsonc to use different ports.

Future Work

The following features are being developed by other engineers:

  • Activity Proxy: HTTP/WebSocket monitoring for smarter cooldown management
  • Cooldown Hook: Automatic cooldown extension during model loading

Current workarounds:

  • Use longer keep_alive_minutes for large models
  • Pre-pull models in models.json to reduce first-use latency