The Factory: From Idea to Evolving Software

Describe what you want. The Factory builds it, tests it, and keeps improving it — autonomously. You give it a spec file, a rough idea, or an existing codebase. The Factory researches best practices, scaffolds the project, sets up evaluation, and runs a continuous improvement loop — measuring every change and keeping only what makes things better.

Quick Start

Prerequisites: Python 3.11+, uv, and Claude Code (installed and authenticated).

git clone https://github.com/akashgit/remote-factory.git
cd remote-factory
uv sync

That's it. You're ready to go. Every command runs from the factory repo directory — you pass the target project as an argument.

# Option A: run directly (no install needed)
uv run python -m factory ceo "Build a personal homepage with a blog"

# Option B: install as a CLI tool, then use `factory` anywhere
uv tool install -e .
factory ceo "Build a personal homepage with a blog"

Both forms are equivalent. This README uses uv run python -m factory throughout so you can copy-paste without installing first. If you've installed the CLI, just replace uv run python -m factory with factory.

The factory stores all state locally — no external services required beyond Claude Code. Per-project state lives in .factory/ (add it to .gitignore). Global state (project registry, evolved playbooks) lives in ~/.factory/.

See the full setup guide for authentication, environment variables, and tmux configuration.

Architecture

A CEO agent orchestrates eight specialists — Researcher, Strategist, Builder, Reviewer, Evaluator, Archivist, Distiller, and Failure Analyst — each running as an independent Claude Code subprocess.

The Python CLI provides measurement and storage tools (Layer 1). The CEO agent makes all workflow decisions (Layer 2). Specialist agents execute narrow tasks (Layer 3). See Architecture for the full technical deep-dive.

The CEO detects your project's state and routes to the right mode automatically:

Walkthrough: Build → Improve → Steer

Here's a complete workflow — from a one-line idea to a continuously improving project.

Step 1: Build from an idea

uv run python -m factory ceo "Build a personal homepage with a blog and projects section"

The Factory creates a project directory at ~/factory-projects/build-a-personal-homepage-..., initializes a git repo, and launches Build mode. Here's what happens:

The Researcher surveys similar projects, tech stacks, and architecture patterns
The Strategist creates a phased implementation plan (scaffold first, then features)
The Builder implements each phase, committing code and opening PRs
An E2E verification gate confirms the project actually runs end-to-end

The input is flexible — a raw string, a spec file (~/ideas/spec.md), or a GitHub URL (https://github.com/user/repo) all work.

Step 2: The backlog appears

After the first build, the Factory creates a backlog — .factory/strategy/backlog.md. This is a work queue that feeds all future improvement. It contains:

Features deferred during build (things that need API keys, external services, or manual setup)
Everything that could be built was built — only human-blocked items remain

# Check what's in the backlog
uv run python -m factory backlog-list ~/factory-projects/build-a-personal-homepage-...

Step 3: Improve it

Point the factory at the project again. It detects the existing .factory/ directory and enters Improve mode:

uv run python -m factory ceo ~/factory-projects/build-a-personal-homepage-...

Each improvement cycle:

Observe — the Researcher analyzes the codebase and searches for best practices
Hypothesize — the Strategist picks items from the backlog using FEEC priority (Fix > Exploit > Explore > Combine)
Build — the Builder implements one hypothesis on an experiment branch
Guard — the Reviewer checks for violations and code quality
Measure — the Evaluator scores before and after using the three-tier eval
Decide — the CEO runs a hard precheck gate, then keeps (score went up) or reverts (score went down)
Record — the Archivist records the outcome to .factory/archive/ for future learning

Step 4: Focus on something specific

When you know exactly what you want, --focus pins a single target — one hypothesis, one experiment, done:

uv run python -m factory ceo ~/factory-projects/build-a-personal-homepage-... --focus "add dark mode toggle"

The entire pipeline scopes to that target: the Researcher focuses its research, the Strategist generates exactly one hypothesis, and after the keep/revert decision the cycle ends.

Other ways to steer:

# Add an item to the backlog manually
uv run python -m factory backlog-add ~/factory-projects/... "add RSS feed for the blog"

# File a GitHub issue — the Strategist reads open issues
gh issue create --title "Add contact form" --body "Simple form with email notification"

# Pass a spec file to guide the next build phase
uv run python -m factory ceo ~/factory-projects/... --prompt ~/ideas/performance-spec.md

Step 5: Run it continuously

For unattended operation, wrap the CEO in a heartbeat loop:

# Continuous improvement — one cycle every 30 min (default)
uv run python -m factory run ~/factory-projects/... --loop

# Custom interval
uv run python -m factory run ~/factory-projects/... --loop --interval 900

# In a detached tmux session — come back later
uv run python -m factory tmux ~/factory-projects/... --loop
uv run python -m factory tmux-ls                    # list active sessions
uv run python -m factory tmux-stop --path ~/factory-projects/...  # stop a session

Each cycle is a full observe → hypothesize → build → measure → decide pass. Failed experiments don't stop the loop — the factory learns from them and moves on.

Walkthrough: Research Mode

Research mode is for projects where the goal is to improve a measurable metric against a dataset — benchmarks, model tuning, prompt optimization, solver agents.

The factory isn't just a software builder — it's a harness creator. Give it a research objective and it builds the evaluation harness, then iteratively improves the system under test.

Step 1: Start with a research idea

uv run python -m factory ceo "SWE-bench solver agent" --mode research

Research ideation works like interactive mode but collects additional configuration:

Research Target — the metric to improve, the command to run, where results are written
Mutable Surfaces — files the Builder is allowed to modify (e.g., src/agent.py, prompts/*.md)
Fixed Surfaces — ground truth and eval infrastructure that must never be touched (e.g., eval/, data/ground_truth.json)
Constraints — additional rules (e.g., "do not use GPT-4 for cost reasons")

You review and approve the spec, then the Factory builds the project and transitions to the research loop.

Step 2: The research loop

Each cycle runs seven phases:

Phase	Agent	What happens
R0	Evaluator	Run `run_command`, record baseline metric
R1	Failure Analyst	Classify failures by root cause, aggregate into categories
R1.5	Researcher	Search for targeted solutions to dominant failure patterns
R2	Strategist	Generate 1-3 hypotheses targeting dominant failure modes
R3	Builder	Implement hypothesis, modifying only mutable surfaces
R4	Evaluator	Re-run `run_command`, extract new metric
R5	CEO	Keep if metric improved monotonically; revert otherwise

Here's what progression looks like on a real project:

Cycle	Metric	Failure Mode Targeted	Verdict	Best
000	0.18	— (baseline)	—	0.18
001	0.22	FILE_NOT_FOUND — searched wrong directories	KEEP	0.22
002	0.24	SYNTAX_ERROR — indentation bugs in patches	KEEP	0.24
003	0.21	TIMEOUT — overly broad search strategy	REVERT	0.24
004	0.27	INCOMPLETE_EDIT — partial file modifications	KEEP	0.27

Cycle 003 regressed below the previous best (0.24), so it was automatically reverted. The metric ratchets forward — it can never go below the previous best.

Step 3: Leakage guards

Research mode enforces three layers of ground truth protection to prevent the Builder from "cheating":

Guard	What it detects
Token overlap	Distinctive tokens from fixed surfaces appearing in hypothesis/diff text
Negation hints	Patterns like "do NOT use X" that encode answers by exclusion
Specific values	Numeric literals or quoted strings from ground truth appearing in code

These checks run at three hard gates: strategy review, builder review, and precheck. A leakage detection triggers automatic redirect or revert.

The factory as a meta-harness

The factory itself is an outer optimization loop. It creates inner harnesses for any research objective:

Project	Metric	What the factory optimizes
SWE-bench solver	resolve rate	Agent logic, prompts, localization strategies
Math reasoning	solve rate	Chain-of-thought templates, tool call patterns
CAD query optimization	query accuracy	Query builder, schema mapping, entity resolution
Prompt engineering	task accuracy	System prompts, few-shot examples, output parsing

# Start a new research project from an idea
uv run python -m factory ceo "prompt optimizer for code review" --mode research

# Run the research loop on an existing project
uv run python -m factory ceo ~/my-solver --mode research

# Focus on a specific hypothesis
uv run python -m factory ceo ~/my-solver --focus "try chain-of-thought prompting"

# Continuous research loop
uv run python -m factory run ~/my-solver --mode research --loop

See Getting Started — Research Mode for the full picture.

The Eval System

Every change is measured by a three-tier composite score:

Tier	What it measures	Examples
Hygiene (6 dimensions)	Code quality basics	Tests, lint, type checking, coverage
Growth (5 dimensions)	Capability evolution	API surface area, experiment diversity, observability
Project (user-defined)	Domain-specific metrics	Benchmark accuracy, latency, win rate

Default weight split is 50/50 hygiene/growth. When you define project-specific evals, it shifts to 30/20/50. Fully configurable via factory.md. See Eval System.

Self-Evolving Agents

The factory doesn't just improve your project — it improves itself. Every keep/revert decision becomes training data for the next cycle.

This is powered by ACE (Autonomous Context Engineering) — a Reflect → Curate → Inject loop that evolves agent playbooks from real experiment outcomes. Each agent accumulates behavioral rules (DOs and DON'Ts) with evidence counters. Rules that correlate with kept experiments get reinforced; rules from reverts get pruned.

# Run a full improvement cycle, then evolve all agent playbooks
uv run python -m factory ceo ~/my-project --mode meta

Run meta mode on a regular cadence — weekly is a good default. It needs at least 5 experiments across projects to produce meaningful playbook updates. See ACE Self-Improvement for details.

Built with the Factory

The factory has shipped something every day for the last 30 days — products, research experiments, production features, papers. Here are a few examples:

Project	What it does	Mode
SWE-bench solver	Autonomous agent that resolves GitHub issues from the SWE-bench dataset, iteratively improved via failure analysis	Research
HMMT math solver	Multi-agent team (Explorer, Theorist, Computationalist, Critic, Synthesizer) that solved HMMT Feb 2025 Combinatorics Problem 7	Research
Text/Sketch → CAD	Converts natural language descriptions and hand-drawn sketches into executable CadQuery Python code for 3D model generation	Research
HLS design space explorer	Per-function AI agents explore HLS pragma/code variants in parallel, an ILP solver finds the optimal combination under area constraints, then global expert agents apply cross-function optimizations (dataflow, inlining) — achieving up to 92% execution time reduction on cryptographic benchmarks	Build
Pluck	iOS app that extracts structured data from screenshots, links, and shared content using on-device AI	Build + Improve
Group chat digest	Turns iMessage group chats into weekly family newsletters with AI-curated highlights and photo selection	Build + Improve
Production enterprise features	Complete UI components and backend features shipped into a large-scale production codebase	Focus + Improve
The Factory itself	The factory runs on itself in meta mode — its own agent playbooks are evolved from its own experiment outcomes	Meta

Built something with the Factory? Open a PR to add it here — it helps others see what's possible.

Runners

The factory supports multiple CLI backends. By default it uses Claude Code (claude CLI). Bob Shell (bob CLI) is available as an alternative:

# Via environment variable
export FACTORY_RUNNER=bob

# Via CLI flag
uv run python -m factory ceo ~/my-project --runner bob

Bob Shell requires BOBSHELL_API_KEY and enforces per-cycle invocation ceilings (default: 8). Set FACTORY_BOB_DRY_RUN=1 to test without API calls.

CLI Reference

# Core workflow
factory ceo <path|url|idea>             # Launch the CEO agent
factory run <path> --loop               # Continuous heartbeat mode
factory tmux <path> --loop              # In detached tmux session

# Agents
factory agent <role> --task "..." --project <path>

# Evaluation & guards
factory eval <path>                     # Run evals, print composite score
factory precheck <path>                 # Hard precheck gate (4 checks)
factory guard <path>                    # Check guard rules

# Experiments
factory begin <path> --hypothesis "..."
factory finalize <path> --id N --verdict keep
factory history <path>
factory diff <path> --exp1 N --exp2 M
factory explain <path> --exp N

# Analysis
factory study <path>                    # Analyze code + write observations
factory insights <path>                 # Cross-project patterns
factory ace <path>                      # ACE playbook evolution
factory report-update <path>            # Regenerate performance report
factory registry-list                   # List registered projects

# Backlog
factory backlog-list <path>
factory backlog-add <path> "..."
factory backlog-remove <path> "..."

# Operations
factory dashboard                       # Live web dashboard on :8420
factory detect <path>                   # Print project state
factory discover <path>                 # Introspect + generate eval profile
factory export <path>                   # Full project snapshot as JSON
factory checkpoint <path>               # Save CEO state for crash recovery
factory resume <path>                   # Resume from checkpoint

See factory --help for the complete list.

Documentation

Doc	What's in it
Setup Guide	Installation, authentication, environment variables, what you don't need
Getting Started	Lifecycle walkthrough, research mode details, factory.md configuration
Architecture	Three-layer system, agent roles, state machine, data flow diagrams
Eval System	Hygiene/growth/project tiers, scoring, guards, precheck
Configuration	`factory.md` reference — all sections and options
ACE Self-Improvement	How the factory evolves its own agent playbooks
Contributing	Dev setup, code style, testing, PR workflow

Development

uv sync --all-groups              # Install all deps including dev
uv run pytest -v                  # Full test suite
uv run ruff check .               # Lint
uv run mypy factory/              # Type check

License

MIT — Akash Srivastava

Name		Name	Last commit message	Last commit date
Latest commit History 289 Commits
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
docs		docs
eval		eval
factory		factory
skills		skills
templates		templates
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
factory.md		factory.md
install.sh		install.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Factory: From Idea to Evolving Software

Quick Start

Architecture

Walkthrough: Build → Improve → Steer

Step 1: Build from an idea

Step 2: The backlog appears

Step 3: Improve it

Step 4: Focus on something specific

Step 5: Run it continuously

Walkthrough: Research Mode

Step 1: Start with a research idea

Step 2: The research loop

Step 3: Leakage guards

The factory as a meta-harness

The Eval System

Self-Evolving Agents

Built with the Factory

Runners

CLI Reference

Documentation

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Factory: From Idea to Evolving Software

Quick Start

Architecture

Walkthrough: Build → Improve → Steer

Step 1: Build from an idea

Step 2: The backlog appears

Step 3: Improve it

Step 4: Focus on something specific

Step 5: Run it continuously

Walkthrough: Research Mode

Step 1: Start with a research idea

Step 2: The research loop

Step 3: Leakage guards

The factory as a meta-harness

The Eval System

Self-Evolving Agents

Built with the Factory

Runners

CLI Reference

Documentation

Development

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages