Describe what you want. The Factory builds it, tests it, and keeps improving it — autonomously. You give it a spec file, a rough idea, or an existing codebase. The Factory researches best practices, scaffolds the project, sets up evaluation, and runs a continuous improvement loop — measuring every change and keeping only what makes things better.
Prerequisites: Python 3.11+, uv, and Claude Code (installed and authenticated).
git clone https://github.com/akashgit/remote-factory.git
cd remote-factory
uv syncThat's it. You're ready to go. Every command runs from the factory repo directory — you pass the target project as an argument.
# Option A: run directly (no install needed)
uv run python -m factory ceo "Build a personal homepage with a blog"
# Option B: install as a CLI tool, then use `factory` anywhere
uv tool install -e .
factory ceo "Build a personal homepage with a blog"Both forms are equivalent. This README uses uv run python -m factory throughout so you can copy-paste without installing first. If you've installed the CLI, just replace uv run python -m factory with factory.
The factory stores all state locally — no external services required beyond Claude Code. Per-project state lives in .factory/ (add it to .gitignore). Global state (project registry, evolved playbooks) lives in ~/.factory/.
See the full setup guide for authentication, environment variables, and tmux configuration.
A CEO agent orchestrates eight specialists — Researcher, Strategist, Builder, Reviewer, Evaluator, Archivist, Distiller, and Failure Analyst — each running as an independent Claude Code subprocess.
The Python CLI provides measurement and storage tools (Layer 1). The CEO agent makes all workflow decisions (Layer 2). Specialist agents execute narrow tasks (Layer 3). See Architecture for the full technical deep-dive.
The CEO detects your project's state and routes to the right mode automatically:
Here's a complete workflow — from a one-line idea to a continuously improving project.
uv run python -m factory ceo "Build a personal homepage with a blog and projects section"The Factory creates a project directory at ~/factory-projects/build-a-personal-homepage-..., initializes a git repo, and launches Build mode. Here's what happens:
- The Researcher surveys similar projects, tech stacks, and architecture patterns
- The Strategist creates a phased implementation plan (scaffold first, then features)
- The Builder implements each phase, committing code and opening PRs
- An E2E verification gate confirms the project actually runs end-to-end
The input is flexible — a raw string, a spec file (~/ideas/spec.md), or a GitHub URL (https://github.com/user/repo) all work.
After the first build, the Factory creates a backlog — .factory/strategy/backlog.md. This is a work queue that feeds all future improvement. It contains:
- Features deferred during build (things that need API keys, external services, or manual setup)
- Everything that could be built was built — only human-blocked items remain
# Check what's in the backlog
uv run python -m factory backlog-list ~/factory-projects/build-a-personal-homepage-...Point the factory at the project again. It detects the existing .factory/ directory and enters Improve mode:
uv run python -m factory ceo ~/factory-projects/build-a-personal-homepage-...Each improvement cycle:
- Observe — the Researcher analyzes the codebase and searches for best practices
- Hypothesize — the Strategist picks items from the backlog using FEEC priority (Fix > Exploit > Explore > Combine)
- Build — the Builder implements one hypothesis on an experiment branch
- Guard — the Reviewer checks for violations and code quality
- Measure — the Evaluator scores before and after using the three-tier eval
- Decide — the CEO runs a hard precheck gate, then keeps (score went up) or reverts (score went down)
- Record — the Archivist records the outcome to
.factory/archive/for future learning
When you know exactly what you want, --focus pins a single target — one hypothesis, one experiment, done:
uv run python -m factory ceo ~/factory-projects/build-a-personal-homepage-... --focus "add dark mode toggle"The entire pipeline scopes to that target: the Researcher focuses its research, the Strategist generates exactly one hypothesis, and after the keep/revert decision the cycle ends.
Other ways to steer:
# Add an item to the backlog manually
uv run python -m factory backlog-add ~/factory-projects/... "add RSS feed for the blog"
# File a GitHub issue — the Strategist reads open issues
gh issue create --title "Add contact form" --body "Simple form with email notification"
# Pass a spec file to guide the next build phase
uv run python -m factory ceo ~/factory-projects/... --prompt ~/ideas/performance-spec.mdFor unattended operation, wrap the CEO in a heartbeat loop:
# Continuous improvement — one cycle every 30 min (default)
uv run python -m factory run ~/factory-projects/... --loop
# Custom interval
uv run python -m factory run ~/factory-projects/... --loop --interval 900
# In a detached tmux session — come back later
uv run python -m factory tmux ~/factory-projects/... --loop
uv run python -m factory tmux-ls # list active sessions
uv run python -m factory tmux-stop --path ~/factory-projects/... # stop a sessionEach cycle is a full observe → hypothesize → build → measure → decide pass. Failed experiments don't stop the loop — the factory learns from them and moves on.
Research mode is for projects where the goal is to improve a measurable metric against a dataset — benchmarks, model tuning, prompt optimization, solver agents.
The factory isn't just a software builder — it's a harness creator. Give it a research objective and it builds the evaluation harness, then iteratively improves the system under test.
uv run python -m factory ceo "SWE-bench solver agent" --mode researchResearch ideation works like interactive mode but collects additional configuration:
- Research Target — the metric to improve, the command to run, where results are written
- Mutable Surfaces — files the Builder is allowed to modify (e.g.,
src/agent.py,prompts/*.md) - Fixed Surfaces — ground truth and eval infrastructure that must never be touched (e.g.,
eval/,data/ground_truth.json) - Constraints — additional rules (e.g., "do not use GPT-4 for cost reasons")
You review and approve the spec, then the Factory builds the project and transitions to the research loop.
Each cycle runs seven phases:
| Phase | Agent | What happens |
|---|---|---|
| R0 | Evaluator | Run run_command, record baseline metric |
| R1 | Failure Analyst | Classify failures by root cause, aggregate into categories |
| R1.5 | Researcher | Search for targeted solutions to dominant failure patterns |
| R2 | Strategist | Generate 1-3 hypotheses targeting dominant failure modes |
| R3 | Builder | Implement hypothesis, modifying only mutable surfaces |
| R4 | Evaluator | Re-run run_command, extract new metric |
| R5 | CEO | Keep if metric improved monotonically; revert otherwise |
Here's what progression looks like on a real project:
| Cycle | Metric | Failure Mode Targeted | Verdict | Best |
|---|---|---|---|---|
| 000 | 0.18 | — (baseline) | — | 0.18 |
| 001 | 0.22 | FILE_NOT_FOUND — searched wrong directories | KEEP | 0.22 |
| 002 | 0.24 | SYNTAX_ERROR — indentation bugs in patches | KEEP | 0.24 |
| 003 | 0.21 | TIMEOUT — overly broad search strategy | REVERT | 0.24 |
| 004 | 0.27 | INCOMPLETE_EDIT — partial file modifications | KEEP | 0.27 |
Cycle 003 regressed below the previous best (0.24), so it was automatically reverted. The metric ratchets forward — it can never go below the previous best.
Research mode enforces three layers of ground truth protection to prevent the Builder from "cheating":
| Guard | What it detects |
|---|---|
| Token overlap | Distinctive tokens from fixed surfaces appearing in hypothesis/diff text |
| Negation hints | Patterns like "do NOT use X" that encode answers by exclusion |
| Specific values | Numeric literals or quoted strings from ground truth appearing in code |
These checks run at three hard gates: strategy review, builder review, and precheck. A leakage detection triggers automatic redirect or revert.
The factory itself is an outer optimization loop. It creates inner harnesses for any research objective:
| Project | Metric | What the factory optimizes |
|---|---|---|
| SWE-bench solver | resolve rate | Agent logic, prompts, localization strategies |
| Math reasoning | solve rate | Chain-of-thought templates, tool call patterns |
| CAD query optimization | query accuracy | Query builder, schema mapping, entity resolution |
| Prompt engineering | task accuracy | System prompts, few-shot examples, output parsing |
# Start a new research project from an idea
uv run python -m factory ceo "prompt optimizer for code review" --mode research
# Run the research loop on an existing project
uv run python -m factory ceo ~/my-solver --mode research
# Focus on a specific hypothesis
uv run python -m factory ceo ~/my-solver --focus "try chain-of-thought prompting"
# Continuous research loop
uv run python -m factory run ~/my-solver --mode research --loopSee Getting Started — Research Mode for the full picture.
Every change is measured by a three-tier composite score:
| Tier | What it measures | Examples |
|---|---|---|
| Hygiene (6 dimensions) | Code quality basics | Tests, lint, type checking, coverage |
| Growth (5 dimensions) | Capability evolution | API surface area, experiment diversity, observability |
| Project (user-defined) | Domain-specific metrics | Benchmark accuracy, latency, win rate |
Default weight split is 50/50 hygiene/growth. When you define project-specific evals, it shifts to 30/20/50. Fully configurable via factory.md. See Eval System.
The factory doesn't just improve your project — it improves itself. Every keep/revert decision becomes training data for the next cycle.
This is powered by ACE (Autonomous Context Engineering) — a Reflect → Curate → Inject loop that evolves agent playbooks from real experiment outcomes. Each agent accumulates behavioral rules (DOs and DON'Ts) with evidence counters. Rules that correlate with kept experiments get reinforced; rules from reverts get pruned.
# Run a full improvement cycle, then evolve all agent playbooks
uv run python -m factory ceo ~/my-project --mode metaRun meta mode on a regular cadence — weekly is a good default. It needs at least 5 experiments across projects to produce meaningful playbook updates. See ACE Self-Improvement for details.
The factory has shipped something every day for the last 30 days — products, research experiments, production features, papers. Here are a few examples:
| Project | What it does | Mode |
|---|---|---|
| SWE-bench solver | Autonomous agent that resolves GitHub issues from the SWE-bench dataset, iteratively improved via failure analysis | Research |
| HMMT math solver | Multi-agent team (Explorer, Theorist, Computationalist, Critic, Synthesizer) that solved HMMT Feb 2025 Combinatorics Problem 7 | Research |
| Text/Sketch → CAD | Converts natural language descriptions and hand-drawn sketches into executable CadQuery Python code for 3D model generation | Research |
| HLS design space explorer | Per-function AI agents explore HLS pragma/code variants in parallel, an ILP solver finds the optimal combination under area constraints, then global expert agents apply cross-function optimizations (dataflow, inlining) — achieving up to 92% execution time reduction on cryptographic benchmarks | Build |
| Pluck | iOS app that extracts structured data from screenshots, links, and shared content using on-device AI | Build + Improve |
| Group chat digest | Turns iMessage group chats into weekly family newsletters with AI-curated highlights and photo selection | Build + Improve |
| Production enterprise features | Complete UI components and backend features shipped into a large-scale production codebase | Focus + Improve |
| The Factory itself | The factory runs on itself in meta mode — its own agent playbooks are evolved from its own experiment outcomes | Meta |
Built something with the Factory? Open a PR to add it here — it helps others see what's possible.
The factory supports multiple CLI backends. By default it uses Claude Code (claude CLI). Bob Shell (bob CLI) is available as an alternative:
# Via environment variable
export FACTORY_RUNNER=bob
# Via CLI flag
uv run python -m factory ceo ~/my-project --runner bobBob Shell requires BOBSHELL_API_KEY and enforces per-cycle invocation ceilings (default: 8). Set FACTORY_BOB_DRY_RUN=1 to test without API calls.
# Core workflow
factory ceo <path|url|idea> # Launch the CEO agent
factory run <path> --loop # Continuous heartbeat mode
factory tmux <path> --loop # In detached tmux session
# Agents
factory agent <role> --task "..." --project <path>
# Evaluation & guards
factory eval <path> # Run evals, print composite score
factory precheck <path> # Hard precheck gate (4 checks)
factory guard <path> # Check guard rules
# Experiments
factory begin <path> --hypothesis "..."
factory finalize <path> --id N --verdict keep
factory history <path>
factory diff <path> --exp1 N --exp2 M
factory explain <path> --exp N
# Analysis
factory study <path> # Analyze code + write observations
factory insights <path> # Cross-project patterns
factory ace <path> # ACE playbook evolution
factory report-update <path> # Regenerate performance report
factory registry-list # List registered projects
# Backlog
factory backlog-list <path>
factory backlog-add <path> "..."
factory backlog-remove <path> "..."
# Operations
factory dashboard # Live web dashboard on :8420
factory detect <path> # Print project state
factory discover <path> # Introspect + generate eval profile
factory export <path> # Full project snapshot as JSON
factory checkpoint <path> # Save CEO state for crash recovery
factory resume <path> # Resume from checkpointSee factory --help for the complete list.
| Doc | What's in it |
|---|---|
| Setup Guide | Installation, authentication, environment variables, what you don't need |
| Getting Started | Lifecycle walkthrough, research mode details, factory.md configuration |
| Architecture | Three-layer system, agent roles, state machine, data flow diagrams |
| Eval System | Hygiene/growth/project tiers, scoring, guards, precheck |
| Configuration | factory.md reference — all sections and options |
| ACE Self-Improvement | How the factory evolves its own agent playbooks |
| Contributing | Dev setup, code style, testing, PR workflow |
uv sync --all-groups # Install all deps including dev
uv run pytest -v # Full test suite
uv run ruff check . # Lint
uv run mypy factory/ # Type checkMIT — Akash Srivastava