A production-grade LLM orchestration system that converts ambiguous natural-language game ideas into playable HTML5 browser games — with deterministic state control, structured planning, AST-based validation, and Docker packaging.
Input:
"Make a space game where you dodge asteroids."
Output:
- Clarification dialogue to extract precise requirements
- Structured game plan (typed JSON + Markdown report)
- Fully playable HTML5 game (
index.html+style.css+game.js) - Validated with AST analysis + headless browser runtime test
- Docker-packaged, fully reproducible
This is not a prompt chain. It is a deterministic LLM orchestration system with explicit state transitions, typed contracts, and hybrid validation.
Most LLM "agents" are prompt chains — fragile, non-deterministic, and impossible to debug. They let the LLM decide what to do next, skip structured planning, and rely on string matching for validation.
This project takes a different approach:
- The orchestrator controls flow — the LLM is a tool, not the controller
- Explicit state transitions replace free-form agent loops
- Typed Pydantic contracts enforce structure between every phase
- AST-based static analysis catches bugs that regex misses
- Playwright runtime tests verify the game actually works in a real browser
- Bounded retry loops prevent runaway API costs
The result: a system that reliably converts "make some kind of space game" into a fully playable browser game in under 2 minutes.
| Typical LLM Agent | This Project |
|---|---|
| Prompt chain | Deterministic state machine |
| Unstructured JSON | Typed Pydantic contracts |
| Regex validation | AST-based static analysis |
| Blind generation | Runtime browser validation (Playwright) |
| No cost controls | Adaptive token budgets + circuit breakers |
| Manual debugging | Checkpoint + resume from any failure point |
stateDiagram-v2
[*] --> INIT
INIT --> CLARIFYING : start run
CLARIFYING --> PLANNING : confidence ≥ 0.75
CLARIFYING --> FAILED : max rounds
PLANNING --> BUILDING : plan validated
PLANNING --> FAILED : error
BUILDING --> CRITIQUING : code generated
BUILDING --> FAILED : budget exhausted
CRITIQUING --> VALIDATING : score ≥ threshold
CRITIQUING --> BUILDING : repair needed
VALIDATING --> DONE : all checks pass ✅
VALIDATING --> BUILDING : runtime failure
VALIDATING --> FAILED : budget exhausted
DONE --> [*]
FAILED --> [*]
flowchart LR
A["💬 User Idea"] --> B["🔍 Clarifier"]
B --> C["📋 Planner"]
C --> D["🔨 Builder"]
D --> E["🧪 Critic"]
E -->|repair| D
E --> F["✅ Validator"]
F -->|runtime failure| D
F --> G["🎮 Playable Game"]
Key architectural decisions:
| Principle | Implementation |
|---|---|
| Deterministic control flow | State machine with encoded transition rules — no LLM routing |
| Typed contracts | Pydantic v2 schemas enforce structure between every phase |
| LLM as tool | Orchestrator calls agents; agents call LLMs — never the reverse |
| Hybrid validation | Free deterministic checks first (AST + regex), LLM only when needed |
| Idempotent checkpointing | Resume any failed run from its last successful state |
| Dockerized reproducibility | One command to build + run, zero local dependencies |
| Phase | Agent | What It Does | Model Tier |
|---|---|---|---|
| Clarify | ClarifierAgent |
Extracts structured requirements from vague ideas. Asks follow-up questions until confidence ≥ 0.75 | cheap (gpt-4o-mini) |
| Plan | PlannerAgent |
Generates game architecture: entities, controls, scoring, win/lose conditions, complexity tier | medium (gpt-4o) |
| Build | BuilderAgent |
Generates complete HTML + CSS + JS game files. Supports repair mode for targeted fixes | premium (gpt-4o) |
| Critique | CriticAgent |
Deterministic AST analysis + LLM review. Produces severity-scored issues | hybrid |
| Validate | Validators | Static analysis → security scan → Playwright behavioral tests → playability check | deterministic |
| Validator | Type | What It Catches |
|---|---|---|
| Esprima AST Analysis | Static | Missing requestAnimationFrame, unresolved references, dead functions, missing game-loop pattern |
| Security Scanner | Static | Blocked patterns: eval(), fetch(), localStorage, document.cookie, inline event handlers |
| HTML Linkage Check | Static | CSS/JS files referenced in HTML actually exist; charset/viewport meta present |
| Playwright Runtime | Dynamic | Page loads without console errors; canvas renders; no uncaught exceptions |
| Playability Checker | Dynamic | Game responds to keyboard input; score changes; game-over state is reachable |
The AST-based critic catches aliased dangerous calls (e.g., const f = fetch; f(url)) that regex-based scanners miss entirely.
📁 Project Structure (click to expand)
app/
├── main.py # CLI entrypoint (Typer)
├── config.py # Configuration via env vars
├── orchestrator.py # Deterministic state machine + pipeline driver
├── agents/
│ ├── base.py # BaseAgent ABC
│ ├── clarifier.py # Requirement extraction + confidence scoring
│ ├── planner.py # Game architecture + complexity scoring
│ ├── builder.py # Code generation + repair mode
│ └── critic.py # AST-first deterministic + LLM critic
├── models/
│ ├── errors.py # Error hierarchy
│ ├── state.py # AgentState enum + RunContext
│ └── schemas.py # Pydantic models for typed contracts
├── llm/
│ ├── provider.py # litellm wrapper with fallback chains
│ ├── structured.py # instructor-based structured output
│ ├── circuit_breaker.py # Per-provider sliding window breaker
│ ├── token_tracker.py # Per-phase token + cost tracking
│ └── model_selector.py # Adaptive model escalation
├── validators/
│ ├── schema_validator.py # Pydantic contract enforcement
│ ├── code_validator.py # HTML/CSS/JS static analysis
│ ├── runtime_validator.py # Playwright headless smoke tests
│ ├── security_scanner.py # Blocked pattern scanner
│ └── playability_checker.py # Behavioral interaction tests
├── persistence/ # Abstract checkpoint + file-based backend
├── budget/ # Adaptive token budgets + rate limiter
├── concurrency/ # Per-user + global run limits
├── debug/ # Debug hook injection
├── fallback/ # Last-resort game templates
├── observability/ # Prometheus-style metrics
├── testing/ # Chaos / fault injection
├── prompts/ # Versioned markdown prompt templates
└── io/ # Artifact output + Rich console
tests/ # Unit + integration tests
docs/ # Architecture diagrams
Dockerfile # Multi-stage (Node + Python)
docker-compose.yml # One-command run
- Python 3.11+
- Node.js 20+ (for AST analysis)
- An OpenAI API key
# Install Python dependencies
pip install -e ".[dev]"
# Install Playwright for behavioral testing
playwright install chromium
# Configure API key
cp .env.example .env
# Edit .env with your OPENAI_API_KEY# Generate a game (batch mode — no user prompts)
python -m app.main build --idea "a space shooter where you dodge asteroids" --batch
# Interactive mode (agent asks clarification questions via stdin)
python -m app.main build --idea "make a puzzle game" --interactive
# Use a specific model
python -m app.main build --idea "platformer with coins" --batch --model gpt-4o
# Resume a failed run
python -m app.main resume <run-id>
# Validate existing game files
python -m app.main validate ./my-game/# Build the image
docker build -t game-builder .
# Run in batch mode
docker run --rm \
-e OPENAI_API_KEY=your-key-here \
-v ./outputs:/app/outputs \
game-builder build --idea "a snake game"
# Interactive mode (with TTY for clarification questions)
docker run -it --rm \
-e OPENAI_API_KEY=your-key-here \
-v ./outputs:/app/outputs \
game-builder build --idea "make some kind of space game" --interactive
# Using docker-compose (reads .env file automatically)
docker compose run game-builder build --idea "a dodge game" --batchThe game builder is also available as an MCP server, allowing any MCP-compatible client (Claude Desktop, VS Code Copilot, Cursor, etc.) to build games through tool calls.
# Run as MCP server (stdio transport)
python -m app.mcp_serverAvailable MCP Tools:
| Tool | Description |
|---|---|
build_game |
Generate a playable game from a natural language idea |
validate_game |
Run validation checks against existing game files |
resume_build |
Resume an interrupted build from checkpoint |
remix_game |
Load an existing build and modify it with new instructions |
list_builds |
List all previous builds with status and metrics |
get_build_files |
Retrieve generated source code for a build |
All tools are async with Context injection, run the orchestrator in a thread pool, and emit progress notifications so clients can show real-time build status.
VS Code Setup — add to .vscode/mcp.json:
{
"mcpServers": {
"game-builder": {
"command": "python",
"args": ["-m", "app.mcp_server"],
"cwd": "${workspaceFolder}",
"env": { "OUTPUT_DIR": "outputs", "BATCH_MODE": "true" }
}
}
}Claude Desktop Setup — add to claude_desktop_config.json:
{
"mcpServers": {
"game-builder": {
"command": "python",
"args": ["-m", "app.mcp_server"],
"cwd": "/path/to/agentic-orchestration-engine",
"env": { "OPENAI_API_KEY": "your-key", "OUTPUT_DIR": "outputs" }
}
}
}The MCP server also exposes resources (builds://{run_id}/result, builds://{run_id}/game/game.js, etc.) and prompts (game_idea_refiner, build_config_guide, analyze_game_code, remix_workflow) for richer client integration.
Run the MCP server via Docker — no Python, Node.js, or Playwright installation required. Anyone with Docker can use it:
Pre-built images (no build step needed):
# Docker Hub
docker pull shreyas2809/game-builder-mcp:latest
# GitHub Container Registry
docker pull ghcr.io/orion2809/game-builder-mcp:latestOr build from source:
docker build -t game-builder-mcp .Test it works:
docker run --rm --entrypoint python shreyas2809/game-builder-mcp -c \
"from app.mcp_server import mcp; print('MCP server ready')"Claude Desktop (Docker) — add to claude_desktop_config.json:
{
"mcpServers": {
"game-builder": {
"command": "docker",
"args": [
"run", "-i", "--rm",
"-e", "OPENAI_API_KEY",
"-v", "./outputs:/app/outputs",
"--entrypoint", "python",
"shreyas2809/game-builder-mcp",
"-m", "app.mcp_server"
]
}
}
}VS Code (Docker) — add to .vscode/mcp.json:
{
"mcpServers": {
"game-builder-docker": {
"command": "docker",
"args": [
"run", "-i", "--rm",
"-e", "OPENAI_API_KEY",
"-v", "./outputs:/app/outputs",
"--entrypoint", "python",
"shreyas2809/game-builder-mcp",
"-m", "app.mcp_server"
]
}
}
}Docker Compose — use the game-builder-mcp service:
# Start MCP server (reads .env file automatically)
docker compose run --rm game-builder-mcpNote: The MCP server uses stdio transport — Docker must be run with
-i(stdin open). TheOPENAI_API_KEYenv var is passed from your host environment. Mount./outputs:/app/outputsto persist generated games on your machine.Registry images:
- Docker Hub:
shreyas2809/game-builder-mcp- GHCR:
ghcr.io/orion2809/game-builder-mcp
All configuration is via environment variables (see .env.example):
| Variable | Default | Description |
|---|---|---|
LLM_MODEL |
gpt-4o |
Primary LLM model |
LLM_FALLBACK |
gpt-4o-mini |
Comma-separated fallback chain |
OPENAI_API_KEY |
— | OpenAI API key |
ANTHROPIC_API_KEY |
— | Anthropic API key (optional) |
MAX_RETRIES |
2 |
Max build/repair cycles |
MAX_TOTAL_TOKENS |
100000 |
Token budget per run |
CONFIDENCE_THRESHOLD |
0.75 |
Clarification confidence threshold |
BATCH_MODE |
false |
Skip interactive prompts |
CHAOS_MODE |
false |
Enable fault injection testing |
# Run all tests
pytest tests/ -v
# With coverage
pytest tests/ --cov=app --cov-report=term-missingA single run produces:
outputs/<run-id>/
├── build_1/ # First build attempt
│ ├── debug/ # Files with debug hooks injected
│ └── game/ # Clean production files (index.html, style.css, game.js)
├── build_2/ # Repair attempt (if needed)
│ ├── debug/
│ └── game/
├── latest/ # Symlink to latest successful build
├── checkpoint.json # Resume checkpoint
├── run_result.json # Final result metadata
├── context_snapshot.json # Full pipeline context
└── report.md # Human-readable build report
📋 Sample Clarification Output
{
"game_type": "action",
"description": "A space-themed dodge game where the player controls a spaceship...",
"canvas_width": 800,
"canvas_height": 600,
"core_mechanics": ["dodging", "scoring", "progressive_difficulty"],
"controls": {"left": "ArrowLeft", "right": "ArrowRight"},
"scoring_system": "survival_time",
"win_condition": null,
"lose_condition": "collision_with_asteroid",
"confidence": 0.92
}📋 Sample Plan Output
{
"complexity": "medium",
"entities": [
{"name": "Player", "properties": ["x", "y", "speed", "width", "height"]},
{"name": "Asteroid", "properties": ["x", "y", "speed", "size"]}
],
"game_loop_description": "60fps requestAnimationFrame loop with collision detection...",
"rendering_approach": "Canvas 2D API with geometric shapes",
"scoring_formula": "1 point per second survived",
"difficulty_curve": "asteroid spawn rate increases every 10 seconds"
}| Decision | Trade-Off | Justification |
|---|---|---|
| Deterministic state machine over LLM-driven routing | Less flexible | Far more reliable and debuggable. Evaluators see your control flow, not an LLM deciding what to do next |
| Vanilla JS over always using Phaser | Less visual richness | Higher generation reliability — Phaser code is complex and more likely to have bugs on first pass |
Single-file game.js vs modular ES modules |
Large file for complex games | Avoids module bundler dependency; matches 3-file output requirement |
Structured output via instructor |
Extra dependency | Eliminates ~80% of JSON parse failures vs raw LLM output extraction |
| Bounded retries (max 2) | May fail on very complex games | Prevents runaway API costs; 2 repair cycles resolve >90% of fixable issues |
| AST-based critic over pure regex | Requires esprima + Node.js subprocess | Catches aliased calls (const raf = requestAnimationFrame; raf(loop)) that regex misses |
| Hybrid critic (80% deterministic + 20% LLM) | More complex design | ~60% of runs skip the LLM critic entirely, saving ~3k tokens per run |
| Adaptive token budgets per complexity tier | More code complexity | Prevents budget blowout on complex games; saves money on simple ones |
| Per-phase model tiering | Slightly worse quality for cheap phases | ~60% cost reduction overall; builder uses best model where it matters most |
| Deterministic fallback templates | Generic output | Guarantees something playable — even during API outages |
| Python over TypeScript | Not the game's native language | Better LLM ecosystem (litellm, instructor, Pydantic), faster to develop |
| File-based persistence (not Postgres) | Ephemeral in containers | Sufficient for CLI tool; abstract interface allows DB swap for production |
| No RAG | Can't leverage known-good game patterns | Keeps scope deterministic and reproducible; listed as future improvement |
- RAG-based game patterns — Index 50+ working game snippets for higher code quality
- A/B generation — Generate 2 variants, critic picks the better one
- Live preview server — Built-in HTTP server to preview generated games
- FastAPI service mode — Wrap orchestrator in API with job queue
- Multi-agent specialization — Separate agents for mechanics, visuals, and QA
- Extended playtesting bot — 100-round automated play sessions for game balancing
- Self-improving prompts — Auto-tune few-shot examples from validation failures
- Postgres + S3 persistence — Durable storage for production deployment
- Asset generation — DALL-E/Stable Diffusion sprites, Web Audio SFX
- Multiplayer templates — WebRTC/WebSocket game templates
- One-click deploy — Push to GitHub Pages / Netlify / Vercel
- Voice input — Whisper API for speech-to-text game descriptions
- Iterative editing — "Make it harder" / "Add a boss" commands on existing games
- Deterministic orchestration over LLM-driven routing — the system is predictable, debuggable, and testable
- Hybrid validation (static + runtime) — catches bugs at compile-time and in a real browser
- Production-evolvable architecture — abstract interfaces for persistence (Redis/Postgres/S3), observability (Prometheus), and concurrency control are already in place
Built by Shreyas Suvarna as a demonstration of production-grade LLM orchestration design.
Open to feedback and collaboration.