MicroGrowAgents (MGA)

Agent-based system for AI-driven microbial cultivation and growth media design

Part of the CultureBotAI initiative led by Dr. Marcin Joachimiak at Lawrence Berkeley National Laboratory.

Last updated: 2026-05-18

Recent updates

✅ DBTL round 2 complete (May 2026): full analysis pipeline executed (3-way Pareto, MC stability, abiotic correction, t2-paired biology, kinetic fits, precipitation risk).
📍 t2 (6 h) adopted as canonical analysis endpoint (2026-05-15) — only timepoint with paired abiotic controls. Round-2 scripts default to t2; pass --endpoint-timepoint t3 for legacy behaviour.
📑 Round 3 (v16) plan published — 8-factor design adding Nd³⁺ and citrate; LanM-fluorescence primary Nd assay + cell-pellet ICP-MS confirmatory subset.
🛂 ResearchAuditor framework added — file / data / provenance auditors orchestrated by src/microgrowagents/agents/analysis/research_auditor.py.
🛠️ Two new Claude Code slash commands: /plot (natural-language plots via scripts/microgrow-plot.py) and /validate-linkml (schema + example validator).

Overview
Documentation Quick Links
DBTL Campaign Status
Installation
Quick Start
Agents & Skills
Experimental Analysis Pipeline
Cofactor & Chemistry Reference
Data Integrity, Provenance & Audit
Core Capabilities
Advanced Usage
Repository Structure
Development
Tools, APIs & Datasets
Historical Appendix
Contributing / License / Citation / Contact

Overview

MicroGrowAgents bridges the microbial cultivation gap through AI-powered multi-agent systems that integrate knowledge graphs, machine learning, and experimental automation. The platform combines specialized agents (LiteratureAgent, AnalogyReasoningAgent, GenomeFunctionAgent, MediaFormulationAgent, …) operating on KG-Microbe (864,000+ validated species) to design optimized growth media for previously uncultured microorganisms, and now drives a multi-round Design-Build-Test-Learn (DBTL) campaign for Methylorubrum extorquens AM1 ΔmxaF lanthanide-dependent growth.

Documentation Quick Links

Project state: docs/STATUS.md
Round-3 plans: outputs/round2_recommendations/v16_design_recommendation.md, outputs/round3_recommendations/nd_assay_alternatives_report.md
Agents / skills: docs/AGENTS_SKILLS_TOOLS.md
Audit: docs/RESEARCH_AUDITOR.md, docs/AUDIT_REPORT_BBOP_SKILLS.md
Cofactors / chemistry: docs/COFACTOR_REFERENCE.md, docs/LANTHANIDE_BIOAVAILABILITY_COMPLETE.md
Optimization pipeline: docs/OPTIMIZATION_GUIDE.md
Architecture diagrams: docs/architecture/README.md
Dev guidance: CLAUDE.md

Key Features

🧪 DBTL campaign infrastructure — multi-round analysis pipeline for M. extorquens AM1 lanthanide-dependent growth, from raw plate-reader CSVs through MaxPro+OptBlock design generation to Bayesian-optimisation seeds.
🔬 Advanced chemistry — osmolarity / water activity, redox potential, C:N:P ratios, Gibbs free energy via eQuilibrator, NdPO₄ precipitation, citrate/malate chelation.
🤖 Multi-agent media formulation — literature mining (245+ papers), analogy reasoning (208K+ chemical embeddings), genome-guided design (57 Bakta-annotated genomes), toxicity flagging.
🗺️ Response surface modelling — Gaussian Process fits, Pareto frontiers, expected-improvement Bayesian optimisation, Sobol sensitivity.
🛂 ResearchAuditor — file / data / provenance / report auditors for scientific reproducibility across the analysis pipeline.
📚 Sheet Query System — entity lookup, cross-reference, publication search, evidence-rich reports.

DBTL Campaign Status

Two rounds of DBTL executed on the v10 MaxPro+OptBlock design (69 designed conditions + ctrl_media baseline, 4 replicates) for M. extorquens AM1 ΔmxaF under lanthanide stress. Round 3 (v16) is in planning. Full narrative in docs/STATUS.md §DBTL Campaign Status.

Round	Date	Growth assay	Nd assay	Status
1	Feb–Mar 2026	OD600 @ 600 nm, 3 timepoints	Arsenazo III @ 660 nm	analysed
2	May 2026	Biolog PM08, 740/590 nm, 144 timepoints	Arsenazo III @ 660 nm (15 µM Nd dose)	analysed
3	planned	TBD	LanM-fluorescence (proposed)	planning

Round 1 (Feb–Mar 2026)

First execution of the v10 design. Two parallel measurement modalities: OD600 plate-reader and arsenazo III Nd-depletion assay. Key findings preserved in the Historical Appendix; summary: peak biomass (MPOB_040, max OD600 0.95) and most stable growth (MPOB_053, mixed C1+C2 metabolism) identified.

Round 2 (May 2026)

Repeat of v10 with minor recipe adjustments. Switched growth modality to Biolog PM08 (740 nm biomass + 590 nm redox in parallel), bumped Nd³⁺ dose to 15 µM, added 48 row-B Nd calibration standards across 4 plates.

Canonical analysis endpoint: t2 (6 h) since 2026-05-15 — t2 is the only timepoint with a paired abiotic control, making it the only timepoint at which chemistry-vs-biology attribution is empirically possible. All round-2 scripts default to t2; pass --endpoint-timepoint t3 for legacy.

At t2 the 3-way Pareto frontier is 3 conditions:

Condition	OD600 t2	Abs590 t2	Nd remaining t2 (µM)	MC freq
MPOB_058	0.241	0.300	2.58	0.99 (stable)
MPOB_008	0.231	0.293	2.07	borderline
MPOB_019	0.208	0.251	1.70	borderline

The five conditions that were t3-only Pareto winners (MPOB_022/_066/_020/_035/_024) fell off the frontier at t2; their late depletion happened between t2 and t3, outside the paired-control window and so cannot be attributed to biology without further measurement (cell-pellet ICP-MS recommended in round 3).

Round-2 analysis outputs live in 8 subdirs under outputs/round2_*:

outputs/round2_3way_pareto/ — joint OD600 × Abs590 × Nd Pareto.
outputs/round2_mc_pareto/ — Monte-Carlo Pareto stability under replicate σ; emits a 2-panel composite figure plus standalone histogram and scatter PDFs/PNGs.
outputs/round2_double_winners/ — cross-cluster join (only MPOB_008 is a majority growth ∩ Nd-uptake double winner).
outputs/round2_abiotic_correction/ — empirical t1→t2 abiotic drift per condition.
outputs/round2_t2_paired_biology/ — (biotic − abiotic) at t2.
outputs/round2_precipitation_risk/ — Q/Ksp NdPO₄ model (refuted by the abiotic data).
outputs/round2_kinetic_fits/ — per-condition kinetic fits.
outputs/round1_vs_round2/ — reproducibility report (Spearman ρ ≈ 0: measurement-modality drift, not biology drift).

Abs590 caveat: r(OD600, Abs590) = 0.982 across all 70 conditions — the redox channel adds no independent information beyond biomass in round 2 and is a candidate to drop in v16 unless the chemistry changes (e.g., a tetrazolium dye that decouples respiration from growth).

Round 3 (v16, planned)

The v16 design upgrades from 6 factors to 8 factors — adding Nd³⁺ (0–30 µM, 5-point grid) and citrate (0–300 µM) as first-class variables so the next round can disentangle MxaF-MDH vs XoxF-MDH and chemistry-vs-biology attribution. The full proposal:

v16_design_recommendation.md — factor ranges, t2-canonical 13-well anchor allocation (MPOB_058 × 4, MPOB_008 × 3, MPOB_019 × 2, plus 4 t3-only references × 1).
v16_bo_seeds.md — 10 Gaussian-Process + Expected-Improvement seed candidates (top predicted OD600 = 0.268 vs round-2 best 0.241).
nd_assay_alternatives_report.md + nd_assay_alternatives_1pager.md — recommends lanmodulin (LanM) fluorescence as primary HT readout (picomolar Nd affinity, 10⁸× Ca²⁺ selectivity, no per-plate calibration), cell-pellet ICP-MS on the 3 t2 Pareto winners + 5 BO seeds (≈ 32 samples) as the confirmatory subset, and an optional 1-plate arsenazo III bridge for cross-round comparability.

Installation

New collaborator? See the full Getting Started Guide for detailed setup including data downloads, database build, and troubleshooting.

Prerequisites

Python 3.10 or higher
uv package manager
just command runner

Quick Install

# Clone the repository
git clone --recurse-submodules https://github.com/CultureBotAI/MicroGrowAgents.git
cd MicroGrowAgents

# Install dependencies using uv
uv sync --group dev

# Download framework data (KG-Microbe, MediaDive, embeddings)
just download

# Download BER-CMM-AM1 project data (optional, for AM1 work)
just download-project

# Build database from downloaded sources
just build-db

# Verify installation
just test

Quick Start

DBTL: round-2 analysis (current campaign)

# Growth (Biolog 740 nm), Nd uptake (arsenazo III), and redox (Biolog 590 nm)
just analyze-experimental-round2     data/experimental/plate_designs_v10_maxprooptblock_long__round2_results
just analyze-experimental-round2-nd  data/experimental/plate_designs_v10_maxprooptblock_long__round2_results_asezuran
just analyze-experimental-round2-redox data/experimental/plate_designs_v10_maxprooptblock_long__round2_results

# Joint OD600 × Nd_uM Pareto at the canonical t2 endpoint
uv run python scripts/three_way_pareto_round2.py
uv run python scripts/mc_pareto_round2.py           # MC stability + 2-panel figure

# Round-3 deliverables (already committed under outputs/round{2,3}_recommendations/)

Media concentration prediction

# Get MP medium concentrations
uv run python run.py gen-media-conc "MP medium"

# Custom ingredients with PubChem enrichment
uv run python run.py gen-media-conc "glucose,NaCl,KH2PO4" --mode ingredients --enrich pubchem

Sensitivity analysis

# Basic pH + salinity sweep
uv run python run.py sensitivity "MP medium"

# With all advanced properties
uv run python run.py sensitivity "MP medium" \
    --calculate-osmotic --calculate-redox --calculate-nutrients --plot

Agents & Skills

MicroGrowAgents provides 29 specialized agent classes, 52 user-facing Python skills, and 22 Claude Code slash commands for microbial cultivation and media design. Complete reference in docs/AGENTS_SKILLS_TOOLS.md.

Specialized Agents (29)

Knowledge & Reasoning:

KGReasoningAgent — query KG-Microbe (1.5M nodes, 5.1M edges)
LiteratureAgent — literature mining and evidence extraction
AnalogyReasoningAgent — chemical similarity search (208K+ embeddings)
SheetQueryAgent — query extended information sheets

Genome Analysis:

GenomeFunctionAgent — genome-guided media design (57 genomes, 667K features)
LanthanideGenesAgent — lanthanide-dependent gene analysis
TransporterAgent — nutrient transporter annotation and analysis

Media Design & Optimization:

MediaFormulationAgent — multi-source media recommendation
GenMediaConcAgent — ML-based concentration prediction
CofactorMediaAgent — cofactor requirement analysis
AlternateIngredientAgent — alternative ingredient suggestions
MediaRoleAgent — ingredient metabolic role classification
MaxProOptBlockAgent — MaxPro optimal blocking design generation
ReconcileAgent — experimental vs prediction reconciliation
EnsembleOptimizationAgent — response surface modelling and BO
DesignRecommendationAgent — interpret results to recommend next design
ExperimentalInterpretationAgent — evidence-based biological interpretations with inline citations

Metabolic Modeling:

MetabolicSourceAgent — metabolic source identification
GapMindAgent — GapMind pathway gap analysis
GEMsemblerAgent — genome-scale metabolic model reconstruction
GrowthCodonAgent — codon usage bias-based growth prediction
MediaMatchAgent — MediaDive database integration

Chemistry & Properties:

ChemistryAgent — osmotic, redox, nutrient-ratio calculations
SensitivityAnalysisAgent — parameter sweep and sensitivity analysis

Audit & Provenance: (new)

ResearchAuditor — orchestrates file / data / provenance / report auditors against the analysis pipeline (src/microgrowagents/agents/analysis/research_auditor.py).
SchemaReviewAgent — LinkML schema review helper.

Data Management:

SQLAgent — database queries
IngredientCooccurrenceAgent, IngredientEffectsEnrichmentAgent, CSVAllDOIsEnrichmentAgent
PDFEvidenceExtractor, EvidenceExtractionOrchestrator — multi-source evidence orchestration

Python Skills (52)

64 skill modules organised under src/microgrowagents/skills/ — Analysis (19+), Prediction & Design (12), Query & Search (5), Chemistry & Validation (5), Workflows (6), Utilities (3), Meta (2, new — includes validate_linkml). The user-facing count is 52 after the recent additions; see docs/AGENTS_SKILLS_TOOLS.md §Skills for the complete categorical listing.

Claude Code Slash Commands (22)

Slash commands under .claude/skills/ callable from Claude Code:

Command	Purpose
`/plot` (new)	Publication-quality plots from data files via natural language. Backed by `scripts/microgrow-plot.py`; 11 plot types, 4 journal style presets (nature/science/minimal/dark), PNG/PDF/SVG output.
`/validate-linkml` (new)	Validate a LinkML schema + example pair. Backed by `src/microgrowagents/skills/meta/validate_linkml.py` and `scripts/validate_linkml_cli.py`.
`/recommend-media`	Multi-agent media formulation recommendation.
`/design-maxpro-optblock`	MaxPro+OptBlock experimental design generation.
`/lhs-design-generation`	Latin-hypercube design generation.
`/predict-concentration`	Predict ingredient concentration ranges.
`/predict-growth-cub`, `/predict-growth-hybrid`	Codon-usage-bias and hybrid growth predictors.
`/analyze-gaps`, `/analyze-limitations`, `/analyze-lanthanide-genes`, `/analyze-ingredient-cooccurrence`	Analysis utilities.
`/check-carbon-sources`, `/compare-gap-fba`	Carbon-source and gap-vs-FBA comparators.
`/fba-gene-knockout-lanthanophore`	FBA-based gene-knockout analysis for lanthanophore biosynthesis.
`/search-ingredients-hierarchical`, `/search-mediadive`	Search utilities.
`/ingredient-report`	Per-ingredient evidence report.
`/validate-media`, `/review-schema`, `/file-naming-conventions`	Validation + standards helpers.

Experimental Analysis Pipeline

Comprehensive dual-pipeline for analysing experimental growth data with both absolute (raw OD600) and relative (vs baseline) analysis modes, plus response surface modelling and Bayesian optimisation.

Features:

📊 Dual-mode analysis (absolute + relative).
🔬 Hierarchical clustering (276 replicates, 6 clusters).
🗺️ Gaussian Process response surfaces with multi-objective Pareto.
🤖 Ensemble optimisation (GP + polynomial + Random Forest).
🎯 Bayesian optimisation with Expected Improvement acquisition.
📈 ANOVA, main effects, Sobol sensitivity indices.
✅ Schema-driven validation of all outputs.
🔍 Evidence-based interpretation with inline citations.

Round-2 recipes

The round-2 data ships as Biolog raw CSVs + per-condition rollups + per-well Nd predictions (a different layout than round-1's flat plate{1,2,3}.tsv). The adapter at scripts/build_round2_replicate_statistics.py converts both into the round-1 schema so all downstream recipes run unchanged:

# Growth (Biolog 740 nm) — builds outputs/..._round2_results_experimental_analysis_absolute/
just analyze-experimental-round2 data/experimental/plate_designs_v10_maxprooptblock_long__round2_results

# Nd uptake (arsenazo III)
just analyze-experimental-round2-nd  data/experimental/plate_designs_v10_maxprooptblock_long__round2_results_asezuran

# Redox channel (Biolog 590 nm) — note the r=0.982 with OD600 (see DBTL §Round 2)
just analyze-experimental-round2-redox data/experimental/plate_designs_v10_maxprooptblock_long__round2_results

The following per-analysis scripts accept --endpoint-timepoint {t1,t2,t3}; default is t2 since 2026-05-15:

scripts/three_way_pareto_round2.py
scripts/mc_pareto_round2.py
scripts/compare_round1_vs_round2.py
scripts/analyze_round2_precipitation_risk.py
scripts/plot_pairwise_response_surfaces.py

The joint Pareto/BO driver (scripts/analyze_response_surfaces.py) auto-detects: tries t2 first, falls back to t3, then to max_* columns — no flag needed.

Round-1 dual-mode analysis

The original (round-1) flat-file pipeline:

# Run BOTH absolute and relative analyses (recommended)
just analyze-experimental data/experimental/plate_designs_v10_maxprooptblock_long__results

# Either mode alone
just analyze-experimental-absolute data/experimental/plate_designs_v10_maxprooptblock_long__results
just analyze-experimental-relative data/experimental/plate_designs_v10_maxprooptblock_long__results

# Cluster
just cluster-experimental outputs/plate_designs_v10_maxprooptblock_long__results_experimental_analysis_absolute/v10_maxprooptblock_long__results_replicate_statistics_absolute.tsv outputs/plate_designs_v10_maxprooptblock_long__results_experimental_analysis_clustering_absolute absolute

# Validate
just validate-experimental plate_designs_v10_maxprooptblock_long__results

Modes:

Absolute (raw OD600): "Which conditions grew best overall?"
Relative (fold-change vs control): "Which variations improved over baseline media?"

Output directories (per source data ID, e.g., v10_maxprooptblock_long__results):

outputs/{source_data_id}_experimental_analysis_{mode}/
outputs/{source_data_id}_experimental_analysis_clustering_{mode}/

Every output file is labelled with the source data ID for full traceability (the auto-generated prefix removes the plate_designs_ portion of the directory name and adds a trailing underscore).

Response Surface Modeling

Optional response surface modelling using Gaussian Processes for ingredient-measurement relationships and multi-objective optimisation:

# Runs automatically with analyze-experimental (enabled by default)
just analyze-experimental data/experimental/plate_designs_v13_latinhypercube_long__results

# Faster analysis without surfaces
python scripts/run_dual_analysis.py data/experimental/plate_designs_v10_maxprooptblock_long__results --disable-response-surfaces

# Standalone
python scripts/analyze_response_surfaces.py \
    outputs/plate_designs_v13_latinhypercube_long__results_experimental_analysis_absolute/ \
    --mode absolute --measurements OD600 Nd_uM

Capabilities: 3D surface plots, Pareto frontiers, predictions over design space, contour maps.

Measurement interpretation:

OD600 — biomass; absolute = raw, relative = fold-change vs control.
Nd_uM — Nd remaining in supernatant. In round-2, the absolute value at t2 is what the canonical pipeline reports; in the round-1 relative framing, negative = more consumption than control baseline. Initial Nd dose: 5.5 µM (round-1), 15 µM (round-2).

Outputs (per mode):

response_surfaces/surface_predictions_{measurement}_{mode}.csv
response_surfaces/surface_3d_{measurement}_{mode}.pdf/png
response_surfaces/pareto_frontier_{mode}.csv/pdf/png
response_surfaces/optimization_report_{mode}.txt

Optimization Workflow

uv run python -m microgrowagents.skills.simple.optimize_growth_conditions \
    --data outputs/experimental_analysis \
    --source-data-id plate_designs_v10_maxprooptblock_long__results \
    --output-dir outputs/optimization \
    --strategy hybrid \
    --n-suggestions 69

Trains ensemble models (GP + Polynomial + Random Forest), analyses ingredient effects + interactions, and uses Bayesian optimisation to suggest next experiments. Strategies: bayesian, local, uncertainty, or hybrid (70% local + 15% uncertainty + 15% space-filling).

Evidence-Based Interpretation

Generate publication-ready biological interpretations with inline citations and bibliography via ExperimentalInterpretationAgent:

from microgrowagents.agents.analysis import ExperimentalInterpretationAgent

agent = ExperimentalInterpretationAgent(source_version="v10")
result = agent.run()

Produces four artifacts:

INTERPRETATION_REPORT.md — clean biological interpretation (executive summary, factor-by-factor analysis, metabolic insights, testable hypotheses, recommendations for next design iteration).
INTERPRETATION_EVIDENCE.md — evidence companion file with data evidence E1–E# (cited file + section + snippet) and literature evidence L1–L# (DOIs).
INTERPRETATION_REPORT_evidence.md — citation-based version with inline [E1], [L2] markers and a complete bibliography.
interpretation_metadata.json — execution metadata.

See docs/EXPERIMENTAL_INTERPRETATION_AGENT.md for the complete documentation.

Cofactor & Chemistry Reference

Cofactor reference (60 cofactors)

The CofactorMediaAgent and the generate_cofactor_reference script integrate 6 biological databases (ChEBI, KEGG, BRENDA, ExplorEnz, KG-Microbe, plus literature) to produce two reference TSVs:

just generate-cofactor-reference
# emits:
#   data/references/cofactors_complete.tsv  (60 cofactors with CHEBI IDs, EC associations, usage tracking)
#   data/references/cofactors_metals.tsv    (19-cofactor metal/REE subset including 5 lanthanides)

Primary data sources:

ChEBI — chemical identifiers (DOI: 10.1093/nar/gkv1031)
KEGG — biosynthesis pathways (DOI: 10.1093/nar/gkac963)
BRENDA — EC ↔ cofactor relationships (DOI: 10.1093/nar/gky1048)
ExplorEnz — Enzyme Commission nomenclature (DOI: 10.1093/nar/gkn582)
KG-Microbe — enzyme-substrate relationships, pathway context.

Reference files (in-repo):

src/microgrowagents/data/cofactor_hierarchy.yaml — 44 cofactors across 5 categories (curated)
src/microgrowagents/data/ec_to_cofactor_map.yaml — 68 EC pattern mappings
data/references/cofactors_complete.tsv — 60 cofactors (generated)
data/references/cofactors_metals.tsv — 19 metals / REEs (generated)
data/processed/ingredient_cofactor_mapping.csv — 13 MP medium cofactor providers

Docs:

docs/COFACTOR_REFERENCE.md — data dictionary
docs/COFACTOR_REFERENCE_V3_USAGE.md — enrichment workflow
docs/cofactor_data_sources.md — methodology and citations

Chemistry modules

Module: src/microgrowagents/chemistry/

Osmotic Properties (osmotic_properties.py):

calculate_osmolarity(ingredients, temperature=25.0)
calculate_water_activity(ingredients, temperature=25.0, method="raoult")
estimate_van_hoff_factor(formula, charge, name)
Methods: Raoult's law (dilute), Robinson-Stokes (concentrated), Bromley (high ionic strength).

Redox Properties (redox_properties.py):

calculate_redox_potential(ingredients, ph, temperature=25.0) — Eh and pE via Nernst.
calculate_electron_balance(ingredients).

Nutrient Ratios (nutrient_ratios.py):

calculate_cnp_ratios(ingredients) — C:N:P, limiting-nutrient classification.
calculate_trace_metal_ratios(ingredients) — Fe:P, Mn:P, Zn:P with deficiency / excess flags.
Redfield ratio comparison (marine: 106:16:1, terrestrial: ~60:7:1).

Thermodynamic Properties (thermodynamic_properties.py):

calculate_gibbs_free_energy(reactants, products, ph=7.0) — ΔG via eQuilibrator + Component Contribution.
calculate_formation_energy(compound) — ΔGf°.

Lanthanide chemistry

Specialised modules and docs for Nd³⁺ bioavailability — Ksp-based NdPO₄ precipitation, citrate / malate chelation, bioavailable-fraction calculation:

src/microgrowagents/chemistry/precipitation.py — NdPO₄ Ksp + activity-coefficient model.
src/microgrowagents/chemistry/chelation.py — citrate / malate Nd chelation.
src/microgrowagents/chemistry/bioavailability.py — bioavailable-fraction calculation.
docs/LANTHANIDE_BIOAVAILABILITY_COMPLETE.md — consolidated chemistry reference.
docs/LANTHANIDE_PRECIPITATION_IMPLEMENTATION_STATUS.md — Ksp values, activity-coefficient model, validated condition ranges, known limitations.

Cited from the round-3 Nd-assay recommendation (nd_assay_alternatives_report.md) as the chemistry rationale for the LanM + cell-pellet ICP-MS protocol.

Data Integrity, Provenance & Audit

Input Data Checksums

All input data files are protected with SHA256 checksums for cryptographic reproducibility (bbop-skills Criterion 4):

just verify-data-integrity   # check `data/checksums.txt` against current data files

Stored at data/checksums.txt (global) and outputs/*/input_data_checksums.json (per-analysis). Every analysis records checksums of its input files. See docs/ARTIFACT_CLEANUP_POLICY.md for the generation procedure.

ResearchAuditor

End-to-end auditing of the analysis pipeline, scored against the bbop-skills criteria for local-first agentic systems. Components:

Agent: src/microgrowagents/agents/analysis/research_auditor.py — orchestrator.
Provenance auditor: src/microgrowagents/provenance/auditor.py — replays a session's action log against actual file/directory state.
Utility auditors: src/microgrowagents/utils/ — data_auditor (input checksums), file_auditor (output presence + size + checksums), audit_report_generator (markdown rendering), audit_structures (shared dataclasses).
Schema: src/microgrowagents/schema/audit_outputs_schema.yaml — LinkML definitions for report objects.
Runner: scripts/run_research_audit.py (production), scripts/demo_research_audit.py (example).
Docs: docs/RESEARCH_AUDITOR.md, docs/RESEARCH_AUDITOR_IMPLEMENTATION.md.

uv run python scripts/run_research_audit.py \
    --session-id <session-uuid> \
    --output outputs/research_audit_<date>/

Artifact Cleanup Policy

Three-tier retention (docs/ARTIFACT_CLEANUP_POLICY.md):

Archival (keep): published designs (v10, v13, …), validated analysis with interpretations, response surface models. Temporary (30 days): per-run analysis outputs, clustering, intermediate optimisation runs. Ephemeral (7 days): test outputs, debug artifacts, scratch visualisations.

just archive-outputs      # move to archive/
just clean-old-outputs    # >30 days
just clean-ephemeral      # >7 days

Steady-state ~185 MB (with cleanup) vs ~4 GB/year unmanaged (96% reduction).

Audit Compliance

Overall: 78% (7/9 PASS) — full breakdown in docs/AUDIT_REPORT_BBOP_SKILLS.md:

✅ PASS (7): provenance tracking, model tracking, reasoning/code separation, validation (LinkML schemas + validators), error-correction (DOI validation + corrections), RAG (KG-Microbe + literature + genomes), artifact cleanup. ⚠️ PARTIAL (1): documentation/automation. ❌ FAIL (1): MCP integration (under consideration).

Action checklist with implementation steps + target dates: docs/AUDIT_ACTIONS_CHECKLIST.md.

Citation Coverage

DOI validation: 90.5% (143/158 DOIs) — 92 PDFs, 44 abstracts, 15 missing.

uv run python scripts/doi_validation/validate_failed_dois.py
uv run python scripts/doi_corrections/apply_doi_corrections.py
uv run python scripts/pdf_downloads/download_all_pdfs_automated.py

History: notes/DOI_CORRECTIONS_FINAL_UPDATED.md.

Core Capabilities

Media Concentration Generation (`gen-media-conc`)

Predicts LOW, DEFAULT, and HIGH concentration ranges for media ingredients:

uv run python run.py gen-media-conc "MP medium"
uv run python run.py gen-media-conc "PIPES,NaCl,glucose" --mode ingredients
uv run python run.py gen-media-conc "MP medium" --enrich pubchem

Output: predicted concentration ranges (mM), molecular weights, chemical formulas, confidence scores.

Sensitivity Analysis (`sensitivity`)

uv run python run.py sensitivity "MP medium"
uv run python run.py sensitivity "MP medium" --calculate-osmotic --calculate-nutrients
uv run python run.py sensitivity "MP medium" --plot --plot-output analysis.png

Calculates pH, salinity (TDS + NaCl-equivalent), ionic strength; optionally osmotic / redox / nutrient ratios.

Media Comparison (`compare-media`)

uv run python run.py compare-media "MP medium" "LB medium"

Common vs unique ingredients, concentration differences.

Media Formulation Recommendation (`recommend-media`)

from microgrowagents.skills.workflows import RecommendMediaWorkflow

workflow = RecommendMediaWorkflow()
result = workflow.run(
    query="Recommend medium for methanotrophic bacteria",
    organism="Methylococcus capsulatus",
    temperature=42.0, pH=6.8,
    carbon_source="methane", oxygen="aerobic",
    goals="defined,selective",
    output_format="markdown",
)

Multi-source evidence integration (KG-Microbe + literature + MP database). Complete formulation with ingredient list, concentrations, roles, alternatives, confidence scores. Goal presets: minimal, defined, complex, cost_effective, high_yield, selective. Full skill docs in .claude/skills/recommend-media.md.

Genome Function Interpretation

Organism-specific media design using 57 Bakta-annotated genomes (667,502 features). EC-number queries with wildcard support (1.1.*.*), auxotrophy detection, cofactor analysis, transporter analysis. Automatically integrated into MediaFormulationAgent, GenMediaConcAgent, and KGReasoningAgent. See docs/GENOME_FUNCTION.md for examples.

Advanced Usage

Combining Multiple Property Calculations

uv run python run.py sensitivity "MP medium" \
    --calculate-osmotic --calculate-redox --calculate-nutrients \
    --ph 7.0 --temperature 30 \
    --format json --output complete_analysis.json

Pipeline Mode

uv run python run.py gen-media-conc "MP medium" --format json > predictions.json
uv run python run.py sensitivity --input-file predictions.json --calculate-osmotic

Python API

from microgrowagents.agents.sensitivity_analysis_agent import SensitivityAnalysisAgent

agent = SensitivityAnalysisAgent(db_path="data/microgrowdb.db")
result = agent.run(
    query="MP medium",
    mode="medium",
    calculate_osmotic=True, calculate_redox=True, calculate_nutrients=True,
    temperature=37.0,
)
print(f"pH: {result['baseline']['ph']}")
print(f"Limiting nutrient: {result['baseline']['nutrient_ratios']['limiting_nutrient']}")

Repository Structure

MicroGrowAgents/
├── CLAUDE.md                        # Project instructions for Claude Code
├── README.md                        # this file
├── justfile + project.justfile      # task recipes
├── pyproject.toml + uv.lock         # dependencies (uv)
│
├── src/microgrowagents/
│   ├── agents/                      # 29 specialized agent classes
│   │   ├── analysis/                # research_auditor, schema_review, interpretation, design_recommendation
│   │   └── …
│   ├── skills/                      # 52 user-facing skills (64 .py modules)
│   │   ├── analysis/{experimental,statistical,visualization}/
│   │   ├── core/{chemistry,genome,knowledge,modeling}/
│   │   ├── design/{doe,media,validation}/
│   │   ├── meta/                    # validate_linkml (new)
│   │   └── workflows/, utilities/, formatters/, executors/, simple/, development/
│   ├── chemistry/                   # osmotic, redox, nutrient_ratios, thermodynamic, precipitation, chelation, bioavailability
│   ├── provenance/                  # auditor (ResearchAuditor backend)
│   ├── utils/                       # audit_{report_generator,structures}, data_auditor, file_auditor, checksums, …
│   └── schema/                      # LinkML schemas including audit_outputs_schema.yaml
│
├── scripts/                         # Integration / analysis scripts
│   ├── build_round2_replicate_statistics.py  # DBTL2 adapter
│   ├── mc_pareto_round2.py, three_way_pareto_round2.py, …
│   ├── microgrow-plot.py            # /plot skill backend
│   ├── validate_linkml_cli.py       # /validate-linkml skill CLI
│   ├── run_research_audit.py, demo_research_audit.py
│   ├── generate_cofactor_reference.py + enhance_cofactor_references_v3.py + validate_enhance_cofactors_metals.py
│   ├── generate_ko_to_{ec,go_map_{bakta,uniprot}}.py
│   ├── generate_architecture_diagrams_{simplified,abstract,vivid}.py
│   ├── generate_explanatory_heatmap.py + regenerate_*heatmap.py + generate_provenance_heatmap.py
│   ├── compute_toxicity_report_bioavailability.py, extract_presentation_data.py
│   └── doi_validation/, doi_corrections/, pdf_downloads/, enrichment/, schema/
│
├── tests/                           # 1900+ pytest tests across modules
├── data/
│   ├── raw/                         # source data with checksums
│   ├── experimental/
│   │   ├── plate_designs_v10_maxprooptblock_long__results/                       # round 1 OD600
│   │   ├── plate_designs_v10_maxprooptblock_long__results_asezuran/              # round 1 arsenazo III
│   │   ├── plate_designs_v10_maxprooptblock_long__round2_results/                # round 2 Biolog
│   │   └── plate_designs_v10_maxprooptblock_long__round2_results_asezuran/       # round 2 arsenazo III
│   ├── references/                  # cofactors_{complete,metals}.tsv
│   ├── corrections/, results/, sheets_cmm/, pdfs/, designs/
│   └── checksums.txt
│
├── outputs/
│   ├── round1_vs_round2/, round2_3way_pareto/, round2_mc_pareto/, round2_double_winners/,
│   ├── round2_abiotic_correction/, round2_t2_paired_biology/, round2_kinetic_fits/,
│   ├── round2_precipitation_risk/   # round-2 analyses (the dirs above are reproducible from scripts/)
│   ├── round2_recommendations/      # v16_design_recommendation.md, v16_bo_seeds.md
│   ├── round3_recommendations/      # nd_assay_alternatives_report.md (+ _1pager.md)
│   ├── cofactor_analysis/, lanthanide_genes/, optimization/, media/
│   └── plate_designs_v10_*_experimental_analysis_{absolute,relative,clustering_*}/
│
├── docs/                            # MkDocs documentation
│   ├── STATUS.md, AGENTS_SKILLS_TOOLS.md, RESEARCH_AUDITOR.md, …
│   ├── COFACTOR_REFERENCE.md, LANTHANIDE_BIOAVAILABILITY_COMPLETE.md, …
│   ├── architecture/                # simplified / abstract / vivid diagram variants
│   └── figures/                     # explanatory heatmaps, provenance heatmap
│
├── notes/                           # research notes, DOI corrections, session summaries
└── .claude/
    ├── provenance/                  # session manifests + action logs
    └── skills/                      # 22 slash commands (/plot, /validate-linkml, …)

Development

# All tests + type checking + formatting
just test

# Targeted
uv run pytest tests/test_chemistry/test_osmotic_properties.py -v
uv run pytest --cov=microgrowagents --cov-report=html

# Type / format
just mypy
just format

# Documentation
just _serve          # local mkdocs
mkdocs build

The full test suite is ~2000 tests across the agents, skills, chemistry, KG, validators, and scripts trees; coverage report via --cov-report=html.

Documentation Website

https://CultureBotAI.github.io/MicroGrowAgents

Tools, APIs & Datasets

External tools, APIs, and datasets integrated with MicroGrowAgents.

External APIs

Chemical:

PubChem — chemical structures, properties, identifiers.
ChEBI — Chemical Entities of Biological Interest (DOI: 10.1093/nar/gkv1031).
eQuilibrator — biochemical thermodynamics, ΔG calculations.

Biological:

KEGG — pathway definitions (DOI: 10.1093/nar/gkac963).
BRENDA — enzyme info, EC ↔ cofactor (DOI: 10.1093/nar/gky1048).
ExplorEnz — EC nomenclature (DOI: 10.1093/nar/gkn582).
UniProt — protein sequences and annotations.
NCBI — genome sequences, taxonomy, literature.

Planned:

NIST WebBook — inorganic thermodynamic data.

Knowledge Graphs & Datasets

KG-Microbe — 1.5M nodes, 5.1M edges; 864,363 species (GTDB + LPSN + NCBI). Genome annotations — 57 Bakta-annotated genomes, 667,502 features (incl. M. extorquens AM1, M. capsulatus). Chemical embeddings — 208K+ Morgan fingerprints / descriptors for analogy-based reasoning. MP Medium database — 158 ingredients × 68 columns, 158 unique DOIs, 90.5% citation coverage. Literature corpus — 245+ papers with extracted excerpts.

External Software

Metabolic modelling: GapMind, GEMsembler, COBRApy. Genome annotation: Bakta, NCBI BLAST. Experimental design: MaxPro+OptBlock (custom), Latin Hypercube Sampling. Growth prediction: GrowthCodon (codon usage bias), MediaDive.

Python Libraries

Scientific: numpy, pandas, scipy, scikit-learn. Chemistry: rdkit, equilibrator-api. Visualization: matplotlib, seaborn, plotly. Database / KG: duckdb, sqlalchemy, linkml. Optimization: scikit-optimize, pymoo, SALib, statsmodels. PDF / PPTX: pypdf, python-pptx. Dev: pytest, mypy, ruff, uv.

Data Provenance

data/raw/mp_medium_ingredient_properties.csv — ingredient data with DOIs.
docs/STATUS.md — citation coverage metrics.
notes/DOI_CORRECTIONS_FINAL_UPDATED.md — DOI validation history.
docs/cofactor_data_sources.md — cofactor source methodology.

Historical Appendix

Round 1 key findings (Feb–Mar 2026, v10 design, OD600)

Top performer: MPOB_040

Max OD600: 0.95 (highest overall).
Strategy: pure C1 methylotrophy (67.9 mM methanol, low succinate).
Challenge: 98% crash at 48 h due to methanol depletion.
Crash analysis: outputs/optimization/MPOB_040_CRASH_ANALYSIS.md.

Most stable: MPOB_053

Max OD600: 0.66 (sustained growth across all timepoints).
Strategy: mixed C1+C2 metabolism (19.9 mM methanol, 58.7 mM succinate).
Key finding: 40–60 mM succinate provides metabolic backup when methanol depletes, preventing culture crash while maintaining high peak growth.

These v10-era results motivated the round-2 design (Biolog dual-channel, arsenazo III at higher 15 µM Nd dose). Round-2 supersedes round-1 for foreground decision-making — see DBTL Campaign Status §Round 2.

v13 lanthanide-dependent growth design

v13 varied Neodymium 0–5 µM to test MxaF vs XoxF-MDH pathways:

High OD600 at low Nd → lanthanide-independent (MxaF-MDH).
High OD600 at high Nd → lanthanide-dependent (XoxF-MDH).
Response surface modelling identifies Pareto-optimal conditions.
Multi-objective optimisation balances growth AND Nd utilisation.

v16 extends the Nd³⁺ axis to 0–30 µM (5-point grid) and adds citrate (0–300 µM) so the round-3 experiment can directly probe chemistry-vs-biology attribution.

Reproducibility: round-1 vs round-2

Spearman ρ ≈ 0 on OD600 and ≈ −0.17 on Nd across 69 matched conditions; 26 / 69 growth and 42 / 69 Nd conditions disagree at |z| > 2σ. The two rounds switched instruments (600 nm → Biolog 740 nm) and Nd calibration (raw abs660 → Miller §5.10 inverse fit), so this is measurement-modality drift, not biology drift. v16 anchors on round-2 winners alone. Full analysis: outputs/round1_vs_round2/REPRODUCIBILITY_REPORT.md.

Contributing

Contributions are welcome:

Fork the repository.
Create a feature branch.
Write tests for new functionality.
Ensure all tests pass (just test).
Submit a pull request.

License

BSD 3-Clause License. See LICENSE for details.

Credits

This project uses the template monarch-project-copier.

Citation

If you use MicroGrowAgents in your research, please cite this repository.

Contact

Principal Investigator: Dr. Marcin P. Joachimiak

Institution: Lawrence Berkeley National Laboratory
Project: CultureBotAI Initiative
GitHub: CultureBotAI

For questions or issues:

Open an issue on GitHub Issues
See CLAUDE.md for development guidance

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.claude		.claude
.github		.github
MicroGrowAgents/tests		MicroGrowAgents/tests
ai-config		ai-config
data		data
docs		docs
examples		examples
external_tools/gapmind		external_tools/gapmind
notes		notes
outputs		outputs
outreach/DOE_NNSA		outreach/DOE_NNSA
project		project
scripts		scripts
src/microgrowagents		src/microgrowagents
tests		tests
.copier-answers.yml		.copier-answers.yml
.editorconfig		.editorconfig
.gitignore		.gitignore
.goosehints		.goosehints
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint.yaml		.yamllint.yaml
ARCHITECTURE.txt		ARCHITECTURE.txt
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GCA_026122615.1_ASM2612261v1_protein.faa.gz		GCA_026122615.1_ASM2612261v1_protein.faa.gz
LHS_design_v3_20260121.tar.gz		LHS_design_v3_20260121.tar.gz
LHS_design_v4_20260121.tar.gz		LHS_design_v4_20260121.tar.gz
LHS_design_v5_20260121.tar.gz		LHS_design_v5_20260121.tar.gz
LHS_design_v6_20260121.tar.gz		LHS_design_v6_20260121.tar.gz
LHS_design_v7_20260121.tar.gz		LHS_design_v7_20260121.tar.gz
LICENSE		LICENSE
MP_latinhypercube_v10_maxprooptblock_long_20260128.tar.gz		MP_latinhypercube_v10_maxprooptblock_long_20260128.tar.gz
MP_latinhypercube_v10_maxprooptblock_long_20260129.tar.gz		MP_latinhypercube_v10_maxprooptblock_long_20260129.tar.gz
MP_latinhypercube_v10_maxprooptblock_long_20260130.tar.gz		MP_latinhypercube_v10_maxprooptblock_long_20260130.tar.gz
MP_latinhypercube_v11_maxprooptblock_long_20260210.tar.gz		MP_latinhypercube_v11_maxprooptblock_long_20260210.tar.gz
MP_latinhypercube_v11_maxprooptblock_long_20260211.tar.gz		MP_latinhypercube_v11_maxprooptblock_long_20260211.tar.gz
MP_latinhypercube_v12_maxprooptblock_long_20260211.tar.gz		MP_latinhypercube_v12_maxprooptblock_long_20260211.tar.gz
MP_latinhypercube_v7_maxprooptblock_long_20260122.tar.gz		MP_latinhypercube_v7_maxprooptblock_long_20260122.tar.gz
MP_latinhypercube_v7_maxprooptblock_long_20260123.tar.gz		MP_latinhypercube_v7_maxprooptblock_long_20260123.tar.gz
MP_latinhypercube_v7_random_20260121.tar.gz		MP_latinhypercube_v7_random_20260121.tar.gz
MP_latinhypercube_v7_random_20260122.tar.gz		MP_latinhypercube_v7_random_20260122.tar.gz
MP_latinhypercube_v7_random_20260123.tar.gz		MP_latinhypercube_v7_random_20260123.tar.gz
MP_latinhypercube_v8_maxprooptblock_long_20260126.tar.gz		MP_latinhypercube_v8_maxprooptblock_long_20260126.tar.gz
MP_latinhypercube_v9_maxprooptblock_long_20260127.tar.gz		MP_latinhypercube_v9_maxprooptblock_long_20260127.tar.gz
README.md		README.md
ai.just		ai.just
doi_search_links.html		doi_search_links.html
download.yaml		download.yaml
download_ber_cmm_am1.yaml		download_ber_cmm_am1.yaml
download_public.yaml		download_public.yaml
justfile		justfile
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
organize_files.sh		organize_files.sh
plate_designs_v8_maxprooptblock_long.tar.gz		plate_designs_v8_maxprooptblock_long.tar.gz
project.justfile		project.justfile
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run.py		run.py
sensitivity_plot.png		sensitivity_plot.png
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

MicroGrowAgents (MGA)

Recent updates

Table of Contents

Overview

Documentation Quick Links

Key Features

DBTL Campaign Status

Round 1 (Feb–Mar 2026)

Round 2 (May 2026)

Round 3 (v16, planned)

Installation

Prerequisites

Quick Install

Quick Start

DBTL: round-2 analysis (current campaign)

Media concentration prediction

Sensitivity analysis

Agents & Skills

Specialized Agents (29)

Python Skills (52)

Claude Code Slash Commands (22)

Experimental Analysis Pipeline

Round-2 recipes

Round-1 dual-mode analysis

Response Surface Modeling

Optimization Workflow

Evidence-Based Interpretation

Cofactor & Chemistry Reference

Cofactor reference (60 cofactors)

Chemistry modules

Lanthanide chemistry

Data Integrity, Provenance & Audit

Input Data Checksums

ResearchAuditor

Artifact Cleanup Policy

Audit Compliance

Citation Coverage

Core Capabilities

Media Concentration Generation (gen-media-conc)

Sensitivity Analysis (sensitivity)

Media Comparison (compare-media)

Media Formulation Recommendation (recommend-media)

Genome Function Interpretation

Advanced Usage

Combining Multiple Property Calculations

Pipeline Mode

Python API

Repository Structure

Development

Documentation Website

Tools, APIs & Datasets

External APIs

Knowledge Graphs & Datasets

External Software

Python Libraries

Data Provenance

Historical Appendix

Round 1 key findings (Feb–Mar 2026, v10 design, OD600)

v13 lanthanide-dependent growth design

Reproducibility: round-1 vs round-2

Contributing

License

Credits

Citation

Contact

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Media Concentration Generation (`gen-media-conc`)

Sensitivity Analysis (`sensitivity`)

Media Comparison (`compare-media`)

Media Formulation Recommendation (`recommend-media`)

Packages