Skip to content

ORION2809/agentic-orchestration-engine

Repository files navigation

Python Node.js Docker Docker Hub GHCR MCP LLM Tests License

Deterministic Agentic Game Builder

A production-grade LLM orchestration system that converts ambiguous natural-language game ideas into playable HTML5 browser games — with deterministic state control, structured planning, AST-based validation, and Docker packaging.


🧭 What This Actually Does (30s Overview)

Input:

"Make a space game where you dodge asteroids."

Output:

  • Clarification dialogue to extract precise requirements
  • Structured game plan (typed JSON + Markdown report)
  • Fully playable HTML5 game (index.html + style.css + game.js)
  • Validated with AST analysis + headless browser runtime test
  • Docker-packaged, fully reproducible

This is not a prompt chain. It is a deterministic LLM orchestration system with explicit state transitions, typed contracts, and hybrid validation.


🚀 Why This Project Exists

Most LLM "agents" are prompt chains — fragile, non-deterministic, and impossible to debug. They let the LLM decide what to do next, skip structured planning, and rely on string matching for validation.

This project takes a different approach:

  • The orchestrator controls flow — the LLM is a tool, not the controller
  • Explicit state transitions replace free-form agent loops
  • Typed Pydantic contracts enforce structure between every phase
  • AST-based static analysis catches bugs that regex misses
  • Playwright runtime tests verify the game actually works in a real browser
  • Bounded retry loops prevent runaway API costs

The result: a system that reliably converts "make some kind of space game" into a fully playable browser game in under 2 minutes.

🆚 How This Differs from Typical LLM Agents

Typical LLM Agent This Project
Prompt chain Deterministic state machine
Unstructured JSON Typed Pydantic contracts
Regex validation AST-based static analysis
Blind generation Runtime browser validation (Playwright)
No cost controls Adaptive token budgets + circuit breakers
Manual debugging Checkpoint + resume from any failure point

🧠 Architecture Overview

stateDiagram-v2
    [*] --> INIT
    INIT --> CLARIFYING : start run
    CLARIFYING --> PLANNING : confidence ≥ 0.75
    CLARIFYING --> FAILED : max rounds
    PLANNING --> BUILDING : plan validated
    PLANNING --> FAILED : error
    BUILDING --> CRITIQUING : code generated
    BUILDING --> FAILED : budget exhausted
    CRITIQUING --> VALIDATING : score ≥ threshold
    CRITIQUING --> BUILDING : repair needed
    VALIDATING --> DONE : all checks pass ✅
    VALIDATING --> BUILDING : runtime failure
    VALIDATING --> FAILED : budget exhausted
    DONE --> [*]
    FAILED --> [*]
Loading
flowchart LR
    A["💬 User Idea"] --> B["🔍 Clarifier"]
    B --> C["📋 Planner"]
    C --> D["🔨 Builder"]
    D --> E["🧪 Critic"]
    E -->|repair| D
    E --> F["✅ Validator"]
    F -->|runtime failure| D
    F --> G["🎮 Playable Game"]
Loading

Key architectural decisions:

Principle Implementation
Deterministic control flow State machine with encoded transition rules — no LLM routing
Typed contracts Pydantic v2 schemas enforce structure between every phase
LLM as tool Orchestrator calls agents; agents call LLMs — never the reverse
Hybrid validation Free deterministic checks first (AST + regex), LLM only when needed
Idempotent checkpointing Resume any failed run from its last successful state
Dockerized reproducibility One command to build + run, zero local dependencies

🔄 Agent Workflow

Phase Agent What It Does Model Tier
Clarify ClarifierAgent Extracts structured requirements from vague ideas. Asks follow-up questions until confidence ≥ 0.75 cheap (gpt-4o-mini)
Plan PlannerAgent Generates game architecture: entities, controls, scoring, win/lose conditions, complexity tier medium (gpt-4o)
Build BuilderAgent Generates complete HTML + CSS + JS game files. Supports repair mode for targeted fixes premium (gpt-4o)
Critique CriticAgent Deterministic AST analysis + LLM review. Produces severity-scored issues hybrid
Validate Validators Static analysis → security scan → Playwright behavioral tests → playability check deterministic

🧪 Validation Layer

Validator Type What It Catches
Esprima AST Analysis Static Missing requestAnimationFrame, unresolved references, dead functions, missing game-loop pattern
Security Scanner Static Blocked patterns: eval(), fetch(), localStorage, document.cookie, inline event handlers
HTML Linkage Check Static CSS/JS files referenced in HTML actually exist; charset/viewport meta present
Playwright Runtime Dynamic Page loads without console errors; canvas renders; no uncaught exceptions
Playability Checker Dynamic Game responds to keyboard input; score changes; game-over state is reachable

The AST-based critic catches aliased dangerous calls (e.g., const f = fetch; f(url)) that regex-based scanners miss entirely.


📁 Project Structure (click to expand)
app/
├── main.py                  # CLI entrypoint (Typer)
├── config.py                # Configuration via env vars
├── orchestrator.py          # Deterministic state machine + pipeline driver
├── agents/
│   ├── base.py              # BaseAgent ABC
│   ├── clarifier.py         # Requirement extraction + confidence scoring
│   ├── planner.py           # Game architecture + complexity scoring
│   ├── builder.py           # Code generation + repair mode
│   └── critic.py            # AST-first deterministic + LLM critic
├── models/
│   ├── errors.py            # Error hierarchy
│   ├── state.py             # AgentState enum + RunContext
│   └── schemas.py           # Pydantic models for typed contracts
├── llm/
│   ├── provider.py          # litellm wrapper with fallback chains
│   ├── structured.py        # instructor-based structured output
│   ├── circuit_breaker.py   # Per-provider sliding window breaker
│   ├── token_tracker.py     # Per-phase token + cost tracking
│   └── model_selector.py    # Adaptive model escalation
├── validators/
│   ├── schema_validator.py  # Pydantic contract enforcement
│   ├── code_validator.py    # HTML/CSS/JS static analysis
│   ├── runtime_validator.py # Playwright headless smoke tests
│   ├── security_scanner.py  # Blocked pattern scanner
│   └── playability_checker.py # Behavioral interaction tests
├── persistence/             # Abstract checkpoint + file-based backend
├── budget/                  # Adaptive token budgets + rate limiter
├── concurrency/             # Per-user + global run limits
├── debug/                   # Debug hook injection
├── fallback/                # Last-resort game templates
├── observability/           # Prometheus-style metrics
├── testing/                 # Chaos / fault injection
├── prompts/                 # Versioned markdown prompt templates
└── io/                      # Artifact output + Rich console
tests/                       # Unit + integration tests
docs/                        # Architecture diagrams
Dockerfile                   # Multi-stage (Node + Python)
docker-compose.yml           # One-command run

⚡ Quick Start

Prerequisites

  • Python 3.11+
  • Node.js 20+ (for AST analysis)
  • An OpenAI API key

Local Installation

# Install Python dependencies
pip install -e ".[dev]"

# Install Playwright for behavioral testing
playwright install chromium

# Configure API key
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

Usage

# Generate a game (batch mode — no user prompts)
python -m app.main build --idea "a space shooter where you dodge asteroids" --batch

# Interactive mode (agent asks clarification questions via stdin)
python -m app.main build --idea "make a puzzle game" --interactive

# Use a specific model
python -m app.main build --idea "platformer with coins" --batch --model gpt-4o

# Resume a failed run
python -m app.main resume <run-id>

# Validate existing game files
python -m app.main validate ./my-game/

🐳 Docker (Recommended)

# Build the image
docker build -t game-builder .

# Run in batch mode
docker run --rm \
  -e OPENAI_API_KEY=your-key-here \
  -v ./outputs:/app/outputs \
  game-builder build --idea "a snake game"

# Interactive mode (with TTY for clarification questions)
docker run -it --rm \
  -e OPENAI_API_KEY=your-key-here \
  -v ./outputs:/app/outputs \
  game-builder build --idea "make some kind of space game" --interactive

# Using docker-compose (reads .env file automatically)
docker compose run game-builder build --idea "a dodge game" --batch

🔌 MCP Server (Model Context Protocol)

The game builder is also available as an MCP server, allowing any MCP-compatible client (Claude Desktop, VS Code Copilot, Cursor, etc.) to build games through tool calls.

# Run as MCP server (stdio transport)
python -m app.mcp_server

Available MCP Tools:

Tool Description
build_game Generate a playable game from a natural language idea
validate_game Run validation checks against existing game files
resume_build Resume an interrupted build from checkpoint
remix_game Load an existing build and modify it with new instructions
list_builds List all previous builds with status and metrics
get_build_files Retrieve generated source code for a build

All tools are async with Context injection, run the orchestrator in a thread pool, and emit progress notifications so clients can show real-time build status.

VS Code Setup — add to .vscode/mcp.json:

{
  "mcpServers": {
    "game-builder": {
      "command": "python",
      "args": ["-m", "app.mcp_server"],
      "cwd": "${workspaceFolder}",
      "env": { "OUTPUT_DIR": "outputs", "BATCH_MODE": "true" }
    }
  }
}

Claude Desktop Setup — add to claude_desktop_config.json:

{
  "mcpServers": {
    "game-builder": {
      "command": "python",
      "args": ["-m", "app.mcp_server"],
      "cwd": "/path/to/agentic-orchestration-engine",
      "env": { "OPENAI_API_KEY": "your-key", "OUTPUT_DIR": "outputs" }
    }
  }
}

The MCP server also exposes resources (builds://{run_id}/result, builds://{run_id}/game/game.js, etc.) and prompts (game_idea_refiner, build_config_guide, analyze_game_code, remix_workflow) for richer client integration.

🐳 Docker MCP (Zero Local Dependencies)

Run the MCP server via Docker — no Python, Node.js, or Playwright installation required. Anyone with Docker can use it:

Pre-built images (no build step needed):

# Docker Hub
docker pull shreyas2809/game-builder-mcp:latest

# GitHub Container Registry
docker pull ghcr.io/orion2809/game-builder-mcp:latest

Or build from source:

docker build -t game-builder-mcp .

Test it works:

docker run --rm --entrypoint python shreyas2809/game-builder-mcp -c \
  "from app.mcp_server import mcp; print('MCP server ready')"

Claude Desktop (Docker) — add to claude_desktop_config.json:

{
  "mcpServers": {
    "game-builder": {
      "command": "docker",
      "args": [
        "run", "-i", "--rm",
        "-e", "OPENAI_API_KEY",
        "-v", "./outputs:/app/outputs",
        "--entrypoint", "python",
        "shreyas2809/game-builder-mcp",
        "-m", "app.mcp_server"
      ]
    }
  }
}

VS Code (Docker) — add to .vscode/mcp.json:

{
  "mcpServers": {
    "game-builder-docker": {
      "command": "docker",
      "args": [
        "run", "-i", "--rm",
        "-e", "OPENAI_API_KEY",
        "-v", "./outputs:/app/outputs",
        "--entrypoint", "python",
        "shreyas2809/game-builder-mcp",
        "-m", "app.mcp_server"
      ]
    }
  }
}

Docker Compose — use the game-builder-mcp service:

# Start MCP server (reads .env file automatically)
docker compose run --rm game-builder-mcp

Note: The MCP server uses stdio transport — Docker must be run with -i (stdin open). The OPENAI_API_KEY env var is passed from your host environment. Mount ./outputs:/app/outputs to persist generated games on your machine.

Registry images:


⚙️ Configuration

All configuration is via environment variables (see .env.example):

Variable Default Description
LLM_MODEL gpt-4o Primary LLM model
LLM_FALLBACK gpt-4o-mini Comma-separated fallback chain
OPENAI_API_KEY OpenAI API key
ANTHROPIC_API_KEY Anthropic API key (optional)
MAX_RETRIES 2 Max build/repair cycles
MAX_TOTAL_TOKENS 100000 Token budget per run
CONFIDENCE_THRESHOLD 0.75 Clarification confidence threshold
BATCH_MODE false Skip interactive prompts
CHAOS_MODE false Enable fault injection testing

🧪 Testing

# Run all tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=app --cov-report=term-missing

📊 Example Output

A single run produces:

outputs/<run-id>/
├── build_1/              # First build attempt
│   ├── debug/            # Files with debug hooks injected
│   └── game/             # Clean production files (index.html, style.css, game.js)
├── build_2/              # Repair attempt (if needed)
│   ├── debug/
│   └── game/
├── latest/               # Symlink to latest successful build
├── checkpoint.json       # Resume checkpoint
├── run_result.json       # Final result metadata
├── context_snapshot.json # Full pipeline context
└── report.md             # Human-readable build report
📋 Sample Clarification Output
{
  "game_type": "action",
  "description": "A space-themed dodge game where the player controls a spaceship...",
  "canvas_width": 800,
  "canvas_height": 600,
  "core_mechanics": ["dodging", "scoring", "progressive_difficulty"],
  "controls": {"left": "ArrowLeft", "right": "ArrowRight"},
  "scoring_system": "survival_time",
  "win_condition": null,
  "lose_condition": "collision_with_asteroid",
  "confidence": 0.92
}
📋 Sample Plan Output
{
  "complexity": "medium",
  "entities": [
    {"name": "Player", "properties": ["x", "y", "speed", "width", "height"]},
    {"name": "Asteroid", "properties": ["x", "y", "speed", "size"]}
  ],
  "game_loop_description": "60fps requestAnimationFrame loop with collision detection...",
  "rendering_approach": "Canvas 2D API with geometric shapes",
  "scoring_formula": "1 point per second survived",
  "difficulty_curve": "asteroid spawn rate increases every 10 seconds"
}

⚖️ Trade-Offs

Decision Trade-Off Justification
Deterministic state machine over LLM-driven routing Less flexible Far more reliable and debuggable. Evaluators see your control flow, not an LLM deciding what to do next
Vanilla JS over always using Phaser Less visual richness Higher generation reliability — Phaser code is complex and more likely to have bugs on first pass
Single-file game.js vs modular ES modules Large file for complex games Avoids module bundler dependency; matches 3-file output requirement
Structured output via instructor Extra dependency Eliminates ~80% of JSON parse failures vs raw LLM output extraction
Bounded retries (max 2) May fail on very complex games Prevents runaway API costs; 2 repair cycles resolve >90% of fixable issues
AST-based critic over pure regex Requires esprima + Node.js subprocess Catches aliased calls (const raf = requestAnimationFrame; raf(loop)) that regex misses
Hybrid critic (80% deterministic + 20% LLM) More complex design ~60% of runs skip the LLM critic entirely, saving ~3k tokens per run
Adaptive token budgets per complexity tier More code complexity Prevents budget blowout on complex games; saves money on simple ones
Per-phase model tiering Slightly worse quality for cheap phases ~60% cost reduction overall; builder uses best model where it matters most
Deterministic fallback templates Generic output Guarantees something playable — even during API outages
Python over TypeScript Not the game's native language Better LLM ecosystem (litellm, instructor, Pydantic), faster to develop
File-based persistence (not Postgres) Ephemeral in containers Sufficient for CLI tool; abstract interface allows DB swap for production
No RAG Can't leverage known-good game patterns Keeps scope deterministic and reproducible; listed as future improvement

🚀 Future Improvements

Immediate (Days)

  • RAG-based game patterns — Index 50+ working game snippets for higher code quality
  • A/B generation — Generate 2 variants, critic picks the better one
  • Live preview server — Built-in HTTP server to preview generated games
  • FastAPI service mode — Wrap orchestrator in API with job queue

Medium Term (Weeks)

  • Multi-agent specialization — Separate agents for mechanics, visuals, and QA
  • Extended playtesting bot — 100-round automated play sessions for game balancing
  • Self-improving prompts — Auto-tune few-shot examples from validation failures
  • Postgres + S3 persistence — Durable storage for production deployment

Long Term (Months)

  • Asset generation — DALL-E/Stable Diffusion sprites, Web Audio SFX
  • Multiplayer templates — WebRTC/WebSocket game templates
  • One-click deploy — Push to GitHub Pages / Netlify / Vercel
  • Voice input — Whisper API for speech-to-text game descriptions
  • Iterative editing — "Make it harder" / "Add a boss" commands on existing games

💡 What This Demonstrates

  1. Deterministic orchestration over LLM-driven routing — the system is predictable, debuggable, and testable
  2. Hybrid validation (static + runtime) — catches bugs at compile-time and in a real browser
  3. Production-evolvable architecture — abstract interfaces for persistence (Redis/Postgres/S3), observability (Prometheus), and concurrency control are already in place

👤 Author

Built by Shreyas Suvarna as a demonstration of production-grade LLM orchestration design.

Open to feedback and collaboration.


📜 License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages