A universal OpenAI-compatible API proxy that bridges standard API requests to multiple backend providers (RunPod, Ollama, OpenAI-compatible APIs, Together AI, etc.). Configure endpoints through a web admin UI and map virtual model names to actual backend models.
Client (OpenAI format) → Serverless Proxy (port 8002) → Configured Backends
- Universal: Connect to any LLM backend (RunPod, Ollama, OpenAI, Together AI, etc.)
- Virtual Models: Map user-facing model names to actual backend models
- Admin UI: Configure endpoints and virtual models via web interface
- Tool-Call Compatibility: Normalize misformatted model tool calls with DB-driven regex patterns
- OpenAI-compatible: Works with any OpenAI client library
This guide walks you through getting the Serverless Proxy up and running in just a few minutes.
Install Docker first if you don't have it:
- Docker Desktop (Windows/Mac): https://www.docker.com/products/docker-desktop
- Docker Engine (Linux): https://docs.docker.com/engine/install/
# Clone the repository
git clone https://github.com/TyRoden/serverless_proxy.git
cd serverless_proxy
# Copy the example environment file
cp .env.example .envOpen the .env file in a text editor and check these settings:
# Required: Set AUTH_ENABLED to false for first-time setup (no auth service needed yet)
AUTH_ENABLED=false
# Optional: If using Ollama locally, it should work out of the box# Build and start the container
docker compose up -d --build
# Verify FastAPI is served via Uvicorn (required for API routes)
docker compose exec serverless-proxy sh -c "ps aux | grep uvicorn" || echo "WARNING: Uvicorn not running. Ensure serverless-proxy service uses 'uvicorn simple_bridge:app' in docker-compose.yml. See docs for details."- Open your browser and go to: http://localhost:5001/proxy-dashboard
- You'll see the admin dashboard (no login needed since AUTH_ENABLED=false)
- Click + Add Endpoint under Endpoints
- Fill in:
- Name: Something like "My Ollama" or "RunPod Production"
- URL: Your backend URL (e.g.,
http://localhost:11434for local Ollama, or your RunPod endpoint URL) - API Key: Your API key if required (leave blank for local Ollama)
- Type: Select the type (
openwebui,openai,ollama,runpod,anthropic,deepinfra, etc.)- Click Save
- Click + Add Virtual Model under Virtual Models
- Fill in:
- Name: What you want to call it (e.g.,
gpt-4,llama-production) - Endpoint: Select the endpoint you just created
- Actual Model: The actual model name on the backend (e.g.,
gpt-4o,llama3:70b) - Click Save
- Name: What you want to call it (e.g.,
Your AI tools can now connect to the proxy:
| Service | URL |
|---|---|
| API Endpoint | http://localhost:8002 |
| Admin UI | http://localhost:5001/proxy-dashboard |
Example - Using with OpenWebUI or any OpenAI-compatible client:
Base URL: http://localhost:8002/v1
API Key: any-key-works (or your endpoint's key)
Model: the-virtual-model-name-you-created
Example - Test with curl:
curl http://localhost:8002/v1/models
curl -X POST http://localhost:8002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-virtual-model-name",
"messages": [{"role": "user", "content": "Hello!"}]
}'# Check if the proxy is running
curl http://localhost:8002/health
# View container logs
docker logs serverless-proxy
# Restart the container
docker restart serverless-proxy| Variable | Description | Default |
|---|---|---|
API_PORT |
OpenAI-compatible API port | 8002 |
FLASK_PORT |
Admin UI port | 5001 |
DATABASE_PATH |
SQLite database path | /data/proxy.db |
TIMEOUT |
Request timeout (seconds) | 300 |
AUTH_ENABLED |
Enable admin authentication | true |
AIMENU_URL |
Auth service URL | http://localhost:5000 |
By default, the admin dashboard requires authentication. See docs/authentication.md for:
- How to disable authentication for fresh installs
- How to implement your own auth service
- Full API specification for the
/session/validateendpoint
The admin dashboard includes a Patterns tab for fixing model-specific tool call formats without editing code.
- Add/update/delete regex-based extraction patterns
- Control match priority (higher first)
- Map tool names and parameter keys into schema-compatible names
- Support malformed or non-standard XML/bracket/inline formats
See docs/tool_patterns.md for full details and examples.
| Port | Service |
|---|---|
8002 |
OpenAI-compatible API |
5001 |
Admin UI API |
Access the admin dashboard at /proxy-dashboard. Authentication is handled by the AI Menu System.
- Endpoint Management: Add, edit, delete backend endpoints
- Virtual Model Mapping: Map virtual model names to actual backend models
- Patterns Tab: Manage tool-call translation patterns in the UI
- Model Discovery: Fetch available models from endpoints
- Enable/Disable: Toggle endpoints and virtual models
Configure backend endpoints with:
- Name: Friendly identifier
- URL: Base URL (e.g.,
http://localhost:11434,https://api.runpod.ai/v2/xxxx) - API Key: Authorization token (if required)
- Type:
openwebui,openai,ollama,vllm,together,runpod,anthropic,deepinfra,queue - Priority: Higher priority endpoints are preferred
- Enabled: Enable/disable endpoint
Map virtual model names to actual backend models:
- Virtual Name: What clients will request (e.g.,
gpt-4,prod-llama) - Endpoint: Which backend to route to
- Actual Model: The model name on the backend (e.g.,
gpt-4o,llama3:70b) - Show Reasoning: Toggle chain-of-thought display (for models like MiniMax that output thinking separately)
- Cost per 1M Input Tokens ($): Price per 1M input tokens you send
- Cost per 1M Output Tokens ($): Price per 1M output tokens you receive
- Cost per 1M Cached Input Tokens ($): Discounted price per 1M cached input tokens (see provider pricing)
- Cost per 1M Cached Output Tokens ($): Discounted price per 1M cached output tokens
The proxy supports tracking and pricing for cached tokens:
- How it works: When you make repeated requests with similar prompts, providers cache the input tokens
- Pricing: Cached tokens are billed at a significantly discounted rate (typically 10-90% cheaper)
- Configuration: Enter your provider's cached token pricing in the virtual model settings
- Tracking: The Usage page displays cached token counts and costs separately
- Supported Providers: OpenAI, DeepInfra, and Anthropic APIs return cached token information
To configure:
- Look up your provider's pricing (e.g., DeepInfra pricing page shows "$0.26 / $0.13 cached")
- Enter the base price in "Cost per 1M Input Tokens"
- Enter the cached price in "Cost per 1M Cached Input Tokens"
The proxy provides comprehensive cost tracking per model:
- Per-model pricing: Configure input/output/cached token rates for each virtual model
- Usage dashboard: View token counts, costs, and response times in the admin UI
- Daily breakdown: Track usage patterns over time
- Cost estimation: Automatic calculation based on configured rates
Configure pricing per virtual model:
- Input tokens: Tokens sent in requests (prompt)
- Output tokens: Tokens received in responses (completion)
- Cached tokens: Discounted rate for cached input tokens (when providers support caching)
The Usage page shows:
- Total requests and token counts
- Input vs Output token breakdown
- Cached token counts and costs
- Average response times
- Cost per model and daily trends
# List models
curl http://localhost:8002/v1/models
# Chat completions
curl -X POST http://localhost:8002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "my-virtual-model", "messages": [{"role": "user", "content": "Hello!"}]}'GET /v1/models- List available models (virtual models + default)POST /v1/chat/completions- Chat completionsPOST /v1/completions- Text completionsPOST /v1/embeddings- EmbeddingsGET /health- Health check
| Endpoint | Method | Description |
|---|---|---|
/api/admin/endpoints |
GET, POST | List/create endpoints |
/endpoints |
GET, POST | Manage endpoints |
/endpoints/<id> |
PUT | Update endpoint |
/endpoints/<id>/delete |
GET, DELETE | Delete endpoint |
/endpoints/<id>/test |
POST | Test endpoint connection |
/endpoints/<id>/models |
GET | Fetch available models |
/api/admin/virtual-models |
GET | List virtual models |
/virtual-models |
POST | Create virtual model |
/virtual-models/<id> |
PUT | Update virtual model |
/virtual-models/<id>/delete |
GET, DELETE | Delete virtual model |
/api/admin/tool-patterns |
GET, POST | List/create tool patterns |
/api/admin/tool-patterns/<id> |
PUT, DELETE | Update/delete tool pattern |
| Type | Description |
|---|---|
openwebui |
OpenWebUI API (/api/chat/completions, /api/models, /api/v1/embeddings) |
openai |
OpenAI-compatible API (/v1/chat/completions, /v1/models, /v1/embeddings) |
ollama |
Ollama API |
vllm |
vLLM API |
together |
Together AI |
runpod |
RunPod Serverless |
anthropic |
Anthropic Messages API (/v1/messages) |
deepinfra |
DeepInfra OpenAI-compatible API (/v1/openai/chat/completions) |
queue |
AI Queue endpoint (/v1/chat/completions, /v1/embeddings) |
Route requests through AI Queue Master for priority queuing and request tracking.
USE_AI_QUEUE=true
AI_QUEUE_URL=http://host.docker.internal:8102
AI_QUEUE_API_KEY=your_queue_api_key
AI_QUEUE_PRIORITY=NORMAL- Tool call parsing — Automatically extracts tool calls from model output
- Chain-of-thought stripping — Removes reasoning prefixes
- Streaming & non-streaming — Full SSE streaming support
- Job polling — Automatically polls for queued job completion
- Session-based auth — Uses AI Menu System for admin authentication
- Claude Code / OpenCode support — Compatible with AI coding assistants
AI coding assistants require specific configurations to work properly. The proxy includes special handling to ensure compatibility:
- Tool call normalization — Automatically fixes malformed tool calls from models
- System prompt preservation — Maintains context across code generation sessions
- Streaming optimization — Real-time tool execution for interactive coding
- Response format conversion — Ensures OpenAI-compatible format for tool results
- Error handling — Graceful fallbacks when models produce unexpected output
- Claude Code compatibility — Claude Code works best with OpenAI-compatible endpoints through the proxy, even when using non-OpenAI models
Use models with strong tool-calling capabilities. Recommended:
- Qwen series (e.g., Qwen3-80B, Qwen3-Coder) - Excellent tool calling
- Claude 3.5+ - Native tool support via Anthropic API
- DeepSeek-V3 - Good tool calling performance
For best results with coding assistants:
- Use OpenAI-compatible or DeepInfra endpoint types
- Enable streaming for real-time tool execution
- Configure adequate max_tokens (8192-128000 for code generation)
When creating virtual models for coding assistants:
- Set appropriate
max_tokensto allow long code outputs - Use models that support tool calls (check provider docs)
- For Anthropic models, ensure endpoint type is set to
anthropic
Tools not executing:
- Check model supports tool calls (not all models do)
- Verify streaming is enabled
- Check response format in logs
Code execution errors:
- Verify model output is valid JSON for tool calls
- Check custom headers if required by your setup
# View container logs
docker logs serverless-proxy
# Restart container
docker restart serverless-proxy
# Check health
curl http://localhost:8002/health.
├── simple_bridge.py # Main proxy application (FastAPI + Flask)
├── docker-compose.yml # Docker Compose configuration
├── Dockerfile # Container image definition
├── requirements.txt # Python dependencies
├── templates/
│ └── admin_dashboard.html # Admin UI (static HTML)
├── .env.example # Environment variable template
├── README.md
└── CHANGELOG.md
MIT License — see LICENSE.md
Based on RunPod serverless API patterns. Extended with virtual model configuration, Anthropic API compatibility, and admin UI capabilities.