A public benchmark for evaluating AI/ML and Foundation Models on NASA OSDR spaceflight transcriptomics data.
Version: v1.0-alpha (Dataset freeze: 2026-03-01) Status: Phase 1 complete — 4 tissues GO (A2 Gastrocnemius, A4 Thymus, A5 Skin, A6 Eye pathway-level)
GeneLab Benchmark provides standardized tasks for evaluating how well machine learning models — from classical baselines to gene expression foundation models (Geneformer, scGPT) and text-based LLMs (GPT-4o, Claude) — generalize spaceflight transcriptomic signatures across missions.
Core challenge: Train a model on one spaceflight mission's RNA-seq data. Can it classify samples from a different mission it has never seen?
Data source: NASA Open Science Data Repository (OSDR) — mouse multi-tissue bulk RNA-seq from ISS and ground control missions.
- 6 tissues: Liver, Gastrocnemius, Kidney, Thymus, Skin, Eye
- 17 ISS missions: RR-1 through RR-9, MHU-1, MHU-2, and more
- 24 verified OSD studies, ~450 samples (binary Flight/Ground)
- 25+ evaluation tasks across 7 categories (A–D, J, NC, Validation)
- Leave-One-Mission-Out (LOMO) cross-validation — mission = independence unit, preventing cross-mission data leakage
- Category A: Spaceflight detection per tissue (binary: Flight vs. Ground)
- Category B: Cross-mission transfer matrix (train on mission i, test on mission j) for all 6 tissues
- Category C: Cross-tissue transfer (3 methods: gene, DEG, pathway)
- Category D: Condition/confounder prediction (mission, strain, hardware, gravity)
- 3-tier model evaluation: Classical ML → Gene Expression Foundation Models → Text LLMs
- Standardized submission format with automatic AUROC/CI/p-value evaluation
- Biological validation: NES pathway conservation, Cell 2020 concordance, negative controls
Gene-level (primary):
| Task | Tissue | Missions | Method | Mean AUROC | 95% CI lower | perm_p | Decision |
|---|---|---|---|---|---|---|---|
| A4 | Thymus | 4† | PCA-LR | 0.923 | 0.878 | 0.037 | ✓ GO |
| A2 | Gastrocnemius | 3 | LR | 0.907 | 0.717 | 0.026 | ✓ GO |
| A5 | Skin | 3§ | LR | 0.821 | 0.637 | 0.0023 | ✓ GO |
| A6 | Eye | 3 | LR | 0.811 | 0.470 | 0.063 | ✗ NO-GO‡ |
| A1 | Liver | 6 | LR | 0.653 | 0.457 | 0.091 | ✗ NO-GO |
| A3 | Kidney | 3 | LR | 0.593 | 0.431 | 0.281 | ✗ NO-GO |
Pathway-level (GSVA Hallmark, secondary):
| Task | Tissue | Method | Mean AUROC | 95% CI lower | perm_p | Decision |
|---|---|---|---|---|---|---|
| A6 | Eye | PCA-LR | 0.915 | 0.745 | 0.014 | ✓ GO⊕ |
| A3 | Kidney | LR | 0.755 | 0.481 | 0.071 | ✗ NO-GO |
Footnotes
- †A4 includes MHU-1 (Track 2b, GC/FLT strain mismatch — see PHASE1_RESULTS.md)
- §A5: MHU-2 = dorsal (OSD-238) + femoral (OSD-239) merged; RR-7 = OSD-254 C57BL/6J non-BSL subset (n=30)
- ‡A6 gene-level: AUROC passes but CI lower fails (n=9–16 per fold)
- ⊕A6 pathway-level: GSVA Hallmark 50-pathway scores rescue CI lower (0.470→0.745). Oxidative phosphorylation dominant.
| Task | Tissue | N pairs | Mean AUROC | 95% CI | AUROC≥0.70 | Tier |
|---|---|---|---|---|---|---|
| B4 | Thymus | 12 | 0.860 | [0.763, 0.953] | 9/12 | 1 |
| B2 | Gastrocnemius | 6 | 0.801 | [0.653, 0.944] | 4/6 | 1 |
| B5 | Skin | 6 | 0.772 | [0.691, 0.834] | 5/6 | 2 |
| B6 | Eye | 6 | 0.754 | [0.688, 0.838] | 5/6 | 2 |
| B1 | Liver | 30 | 0.577 | [0.492, 0.666] | 13/30 | 3 |
| B3 | Kidney | 6 | 0.555 | [0.397, 0.681] | 2/6 | 3 |
| Pair | Method A (Gene) | Method B (DEG) | Method C (Pathway) | Best |
|---|---|---|---|---|
| C1: liver→kidney | 0.730 | 0.441 NS | 0.483 NS | A |
| C2: liver→gastro | 0.563 NS | 0.676 | 0.867 | C |
| C3: liver→thymus | 0.350 NS | 0.621 | 0.184 (anti) | B |
| C4: thymus→kidney | 0.585 NS | 0.539 NS | 0.690 | C |
| Task | Tissue | N | Gene F1 | Pathway F1 | Gene p | Interpretation |
|---|---|---|---|---|---|---|
| D3 Mission ID (6-class) | Liver | 264 | 1.000 | 0.056 NS | <0.001 | Perfect batch separation; pathways batch-invariant |
| D4 Strain (2-class) | Thymus | 34 | 0.892 | 0.817 | 0.004 | Strain detectable from GC-only. EXPLORATORY (n_minority=3) |
| D5 Hardware (RR vs MHU) | Liver | 264 | 1.000 | 0.386 NS | <0.001 | Perfect gene separation; collinear with D3 |
| D5 Hardware (RR vs MHU) | Thymus | 92 | 1.000 | 0.352 NS | <0.001 | Perfect gene separation; collinear with D3 |
| D6 Gravity (3-class) | Liver | 9 | 0.886 | 0.413 NS | 0.002 | Microgravity separable from artificial gravity |
| D6 Gravity (3-class) | Thymus | 9 | 0.657 | 0.641 | 0.037 | Gene ≈ Pathway for gravity detection |
Confounder hierarchy: D3 (mission F1=1.0) ≥ D5 (hardware F1=1.0, collinear) ≥ D4 (strain F1=0.89, exploratory). All pathway F1 ≈ 0.05–0.41 → pathways resist confounder detection.
| Category | N | Gene wins | Pathway wins | Mean diff |
|---|---|---|---|---|
| A (Detection) | 5 | 3 | 2 | +0.032 |
| C (Cross-tissue) | 4 | 2 | 2 | -0.001 |
| D (Condition, D3–D6) | 6 | 6 | 0 | -0.462 |
| Total | 15 | 11 | 4 | -0.174 |
Notable finding — "Kidney Rescue": gene-level AUROC=0.43 (fail) → pathway-level AUROC=0.74 (success, +0.31). Eye shows similar rescue (0.79→0.92, +0.13).
See PHASE1_RESULTS.md for full results including per-fold tables, SHAP analysis, and pathway analysis.
| Hypothesis | Statement | Verdict | Key Evidence |
|---|---|---|---|
| H1 | Liver has the most consistent cross-mission transcriptome | REFUTED | Thymus (0.860) >> Liver (0.577). Thymus and Gastrocnemius = Tier 1. |
| H2 | Transfer failure from biological diversity, not batch effects | SUPPORTED | NES conservation r=0.9 (5 tissues). D3 pathway F1=0.06 (batch-invariant). limma_rbe mean delta=0.01. |
| H3 | Pathway-level preserves spaceflight response better than gene-level | CONDITIONALLY SUPPORTED | Kidney rescue (0.43→0.74), Eye (0.79→0.92). But tissue-pair dependent. |
Normalized Enrichment Score (NES) correlation between mission pairs predicts cross-mission transfer performance:
| Tissue | NES Mean r | Transfer AUROC | Spearman |
|---|---|---|---|
| Thymus | 0.619 | 0.860 | |
| Eye | 0.335 | 0.754 | |
| Skin | 0.147 | 0.772 | |
| Liver | 0.059 | 0.577 | |
| Kidney | 0.048 | 0.555 |
5-tissue Spearman r = 0.9 (excluding gastrocnemius, which has incomplete fGSEA data). Original 4-tissue r = 1.0.
Validated against Beheshti et al. (Cell 2020, PMID 33242417) multi-omics consensus:
- Pathway direction concordance: 71.7% across 5 tissues (STRONG agreement)
- Gene SHAP top-50 overlap: 10.7% (47× above random chance)
- Tissue-specific: Thymus/Gastrocnemius 100%, Liver/Eye 67%, Kidney 25%
| Control | Method | Expected | Result |
|---|---|---|---|
| NC1 | Permutation test (28 entries) | AUROC ≈ 0.50 | 0.50 ± 0.03 |
| NC2 | Housekeeping genes only (50 genes) | AUROC ≈ 0.50 | 0.49–0.55 |
| Tissue | Top Enriched Pathways | Consistency |
|---|---|---|
| Liver | OXIDATIVE_PHOSPHORYLATION, FATTY_ACID_METABOLISM | Literature-concordant |
| Thymus | E2F_TARGETS, G2M_CHECKPOINT, IFN-gamma | Thymocyte proliferation |
| Gastrocnemius | OXIDATIVE_PHOSPHORYLATION, MYOGENESIS | Muscle metabolism |
| Kidney | MTORC1_SIGNALING, CHOLESTEROL_HOMEOSTASIS | Renal metabolism |
| Eye | OXIDATIVE_PHOSPHORYLATION (dominant 3/3 missions) | Retina metabolic demand |
| Skin | E2F_TARGETS, G2M_CHECKPOINT, EPITHELIAL_MESENCHYMAL_TRANSITION | Cell proliferation + ECM remodeling |
GeneLab_benchmark/
├── README.md ← This file
├── PLAN.md ← Benchmark design specification (v0.6)
├── DESIGN_DECISIONS.md ← Architecture decisions log (DD-01 to DD-17)
├── DATA_CATALOG.md ← Auto-generated OSDR inventory (24 studies)
├── CITATION.cff ← Citation metadata
│
├── tasks/ ← Public task inputs (17 directories)
│ ├── A1_liver_lomo/ ← 6 folds + 3 variants (standard, ComBat, ISS-only)
│ ├── A2_gastrocnemius_lomo/ ← 3 folds
│ ├── A3_kidney_lomo/ ← 3 folds
│ ├── A4_thymus_lomo/ ← 4 folds + holdout
│ ├── A5_skin_lomo/ ← 3 folds
│ ├── A6_eye_lomo/ ← 3 folds
│ └── B1–B6_*_cross_mission/ ← N×(N-1) mission pairs per tissue
│
├── scripts/ ← Pipeline scripts (31 Python/R/shell, ~11K LOC)
│ ├── run_baselines.py ← Classical ML baseline runner (LR, RF, XGBoost, PCA-LR)
│ ├── evaluate_submission.py ← Submission evaluator (AUROC, CI, perm_p)
│ ├── generate_tasks.py ← LOMO split generator
│ ├── cross_mission_transfer.py ← Category B matrix generator
│ ├── cross_tissue_transfer.py ← Category C: 3 methods
│ ├── condition_prediction.py ← Category D: mission/strain/hardware/gravity
│ ├── gene_vs_pathway_comparison.py ← J5: feature representation
│ ├── shap_analysis.py ← SHAP feature importance
│ ├── run_fgsea.R ← Group-level fGSEA enrichment
│ ├── compute_pathway_scores.R ← Sample-level GSVA scores
│ ├── batch_correction_eval.py ← J3: ComBat-seq, limma, RUVseq
│ ├── housekeeping_control.py ← NC2: housekeeping gene baseline
│ ├── cell2020_validation.py ← External validation vs Cell 2020
│ ├── compute_nes_conservation.py ← NES pathway conservation
│ ├── geneformer_tokenize.py ← Gene rank tokenization
│ ├── geneformer_finetune.py ← BERT fine-tuning
│ └── utils.py ← Shared utilities
│
├── docs/
│ ├── BIOLOGICAL_GROUND_TRUTH.md ← Validation reference (Cell 2020, SOMA 2024)
│ ├── submission_format.md ← JSON submission specification
│ ├── text_llm_format.md ← Text LLM evaluation format (DD-16)
│ ├── hf_dataset_card.md ← HuggingFace dataset documentation
│ └── development_history/
│ └── PHASE1_RESULTS.md ← Full Phase 1 analysis
│
├── evaluation/ ← ~50 result JSON files
│ ├── A*_baseline_results.json ← Per-tissue baseline results
│ ├── A*_shap_rf.json ← SHAP rankings
│ ├── B_cross_mission_summary.json
│ ├── C_cross_tissue_summary.json
│ ├── D_condition_summary.json
│ ├── J3_batch_correction_comparison.json
│ ├── J5_gene_vs_pathway.json
│ ├── NC1_permutation_summary.json
│ ├── NC2_housekeeping_summary.json
│ ├── cell2020_validation.json
│ ├── NES_conservation_vs_transfer.json
│ ├── RESULTS_SUMMARY.md ← Comprehensive results table
│ └── submission_*.json ← Baseline submission files
│
└── processed/ ← Intermediate analysis outputs
├── A_detection/ ← Per-tissue LOMO data
├── B_cross_mission/ ← Transfer matrices + CI
├── C_cross_tissue/ ← 4 pairs × 3 methods
├── D_condition/ ← Condition prediction
├── fgsea/ ← 60 fGSEA results (6 tissues × missions × 3 DBs)
├── pathway_scores/ ← 54 GSVA files (5 tissues × missions × 3 DBs)
└── qc_reports/
Feature matrices (train_X.csv, test_X.csv) are hosted on HuggingFace due to size (~2 GB). Labels, metadata, and fold structure are in this repository.
pip install -r requirements.txt huggingface_hubfrom huggingface_hub import hf_hub_download
import pandas as pd
train_X = pd.read_csv(
hf_hub_download(
repo_id="jang1563/genelab-benchmark",
filename="A5_skin_lomo/fold_RR-7_test/train_X.csv",
repo_type="dataset",
),
index_col=0,
)
train_y = pd.read_csv("tasks/A5_skin_lomo/fold_RR-7_test/train_y.csv", index_col=0)
print(f"Train: {train_X.shape}") # (72, 20110)Or download a full task at once:
python scripts/upload_to_hf.py --task A5 --dry-run # preview
# After cloning HF data locally, run baselines:
python scripts/run_baselines.py --task A5 --model lrRequires R 4.2+ with Bioconductor. See docs/r_dependencies.md.
# 1. Download raw data from NASA OSDR
python scripts/fetch_osdr.py --osd OSD-238 OSD-239 OSD-243 OSD-254
# 2. Normalize (DESeq2)
Rscript scripts/normalize_rr7_skin.R # example
# 3. Quality filter + build all_missions
python scripts/quality_filter.py --tissue skin
# 4. Generate LOMO folds
python scripts/generate_tasks.py --task A5import pandas as pd
# Load A5 Skin — fold RR-7 test
train_X = pd.read_csv("tasks/A5_skin_lomo/fold_RR-7_test/train_X.csv", index_col=0)
train_y = pd.read_csv("tasks/A5_skin_lomo/fold_RR-7_test/train_y.csv", index_col=0)
test_X = pd.read_csv("tasks/A5_skin_lomo/fold_RR-7_test/test_X.csv", index_col=0)
test_y = pd.read_csv("tasks/A5_skin_lomo/fold_RR-7_test/test_y.csv", index_col=0)
print(f"Train: {train_X.shape}, Test: {test_X.shape}")
print(f"Train labels: {train_y.iloc[:,0].value_counts().to_dict()}")
# Features: Ensembl mouse gene IDs (e.g., ENSMUSG00000021969)
# Labels: 1.0 = Flight, 0.0 = Ground/Vivarium Controlpython scripts/run_baselines.py --task A5 --model lr
python scripts/run_baselines.py --task A4 --model pca_lr
# A1 has multiple variants; select one explicitly
python scripts/run_baselines.py --task A1 --task-dir A1_liver_lomo --model lrPrepare a JSON file (see docs/submission_format.md):
{
"task_id": "A5",
"model_name": "MyModel_v1",
"predictions": {
"fold_MHU-2_test": {"sample_id_1": 0.92, "sample_id_2": 0.07},
"fold_RR-6_test": {"...": "..."},
"fold_RR-7_test": {"...": "..."}
}
}Evaluate:
python scripts/evaluate_submission.py \
--submission my_submission.json \
--task A5
# A1 example (variant must be explicit)
python scripts/evaluate_submission.py \
--submission my_submission.json \
--task A1 \
--task-dir A1_liver_lomoGoal: Binary classification (Flight vs. Ground) using Leave-One-Mission-Out CV.
| Task | Tissue | Missions | Samples (binary) | Folds | Status |
|---|---|---|---|---|---|
| A2 | Gastrocnemius | RR-1, RR-5, RR-9 | 32 | 3 | ✓ GO |
| A4 | Thymus | MHU-1†, MHU-2, RR-6, RR-9 | 67 | 4 | ✓ GO |
| A5 | Skin | MHU-2§, RR-6, RR-7 | 102 | 3 | ✓ GO |
| A6 | Eye | RR-1, RR-3, TBD | 37 | 3 | ✓ GO (pathway) |
| A1 | Liver | MHU-2, RR-1, RR-3, RR-6, RR-8, RR-9 | 193 | 6 | ✗ NO-GO |
| A3 | Kidney | RR-1, RR-3, RR-7 | 118 | 3 | ✗ NO-GO |
†MHU-1 = Track 2b (GC strain = C57BL/6CR, FLT = C57BL/6J mismatch — see PHASE1_RESULTS.md) §MHU-2 = OSD-238 (dorsal) + OSD-239 (femoral) merged as single mission; RR-7 = OSD-254 C57BL/6J non-BSL subset
Input: Log2-normalized expression values for ~20,000 mouse genes (Ensembl IDs, e.g., ENSMUSG00000021969).
Label: 1.0 = Flight, 0.0 = Ground/Vivarium Control. Basal Control (BC) samples excluded.
Goal: Train on one mission, evaluate generalization to another (all N×(N-1) ordered pairs).
See processed/B_cross_mission/{tissue}/ for per-tissue AUROC matrices and evaluation/B_cross_mission_summary.json for aggregated results.
Goal: Train on tissue X, predict spaceflight status on tissue Y. Evaluates whether spaceflight signatures are shared across tissues.
Three transfer methods:
- Method A (Gene): Direct gene intersection transfer
- Method B (DEG): Differentially expressed gene overlap
- Method C (Pathway): GSVA Hallmark pathway score transfer
Goal: Predict confounding variables (mission identity, strain, hardware, gravity level) to quantify batch effects and biological confounders.
- D3: Mission ID (liver, 6-class) — batch effect quantification
- D4: Strain (thymus GC, C57BL/6J vs C57BL/6CR) — exploratory (n=3)
- D5: Hardware (RR vs MHU, liver + thymus) — collinear with D3
- D6: Gravity (MHU-2, uG/AG/GC, liver + thymus) — biological signal
Key finding: D3 gene F1=1.0 (perfect mission separation) vs pathway F1=0.06 (batch-invariant) confirms pathways absorb batch effects.
Pre-computed baseline predictions are available in evaluation/ for reference and reproducibility.
Category A (LOMO)
| File | Task | Model | Mean AUROC | Go/No-Go |
|---|---|---|---|---|
submission_PCALR_baseline_A4.json |
A4 Thymus | PCA-LR (L2, lbfgs) | 0.923 | ✓ GO |
submission_LR_baseline_A2.json |
A2 Gastrocnemius | LR-ElasticNet (SAGA) | 0.917 | ✓ GO |
submission_LR_baseline_A5.json |
A5 Skin | LR (ElasticNet) | 0.821 | ✓ GO |
submission_PCALR_baseline_A6.json |
A6 Eye | PCA-LR (pathway) | 0.915 | ✓ GO |
Category B (Cross-Mission Transfer)
| Task | Tissue | N pairs | PCA-LR Mean AUROC | LFC Mean AUROC |
|---|---|---|---|---|
| B4 | Thymus | 12 | 0.860 | 0.868 |
| B2 | Gastrocnemius | 6 | 0.801 | 0.655 |
| B5 | Skin | 6 | 0.772 | 0.750 |
| B6 | Eye | 6 | 0.754 | 0.696 |
| B1 | Liver | 30 | 0.577 | 0.534 |
| B3 | Kidney | 6 | 0.555 | 0.465 |
Category B does not report a single GO/NO-GO — see DD-17 for evaluation criteria.
Evaluate a baseline submission:
# Category A
python scripts/evaluate_submission.py \
--submission evaluation/submission_LR_baseline_A5.json \
--task A5
# Category B (summary across all tissues)
python scripts/cross_mission_transfer.py --tissue skinReproducibility note: The official
A2_baseline_results.jsonwas computed withmax_iter=2000(SAGA not fully converged for 15k genes). The baseline submission above usesmax_iter=10000(converged); A2 mean AUROC improves from 0.907 → 0.917. GO/No-Go conclusion unchanged. SeePHASE1_RESULTS.md §B3for details.
All submissions are evaluated with:
| Metric | Description | Go threshold |
|---|---|---|
| Mean AUROC | Average AUROC across folds | > 0.700 |
| 95% CI lower | Bootstrap CI (N=2000) lower bound | > 0.500 |
| perm_p | Permutation p-value (N=1000, pseudocount) | < 0.050 |
All three conditions must pass for a GO decision.
| Track | Examples | Input Format |
|---|---|---|
| Tier 1 — Classical ML | LR, RF, XGBoost, PCA-LR | Tabular gene × sample |
| Tier 2 — Foundation Models | Geneformer (Mouse-GF) | Gene rank order (tokenized) |
| Tier 3 — Text LLMs | GPT-4o, Claude, Llama 3 | Natural language gene list (see DD-16) |
| Tissue | Geneformer | Baseline | Delta | Winner |
|---|---|---|---|---|
| Liver | 0.486 | 0.588 | -0.102 | Baseline |
| Gastrocnemius | 0.382 | 0.907 | -0.525 | Baseline |
| Kidney | 0.452 | 0.521 | -0.069 | Baseline |
| Thymus | 0.495 | 0.923 | -0.428 | Baseline |
| Skin | 0.557 | 0.821 | -0.265 | Baseline |
| Eye | 0.484 | 0.789 | -0.305 | Baseline |
| Mean | 0.476 | 0.758 | -0.283 | Baseline |
Mouse-Geneformer (6-layer BERT, 56K gene vocab, pretrained on 30M mouse scRNA-seq cells) underperforms classical ML across all 6 tissues. Consistent with literature: foundation models pretrained on single-cell data do not automatically transfer to small-sample (n=30-100) bulk transcriptomics.
For Tier 3 (Text LLM) input format specification, see DESIGN_DECISIONS.md (DD-16).
All data is derived from publicly available NASA OSDR datasets (24 studies, 6 tissues).
| Tissue | OSD Accession | Mission | n samples | Note |
|---|---|---|---|---|
| Liver | OSD-48 | RR-1 | 18 | Track 2a |
| Liver | OSD-137 | RR-3 | 20 | Track 2a |
| Liver | OSD-245 | RR-6 | 48 | Track 2a |
| Liver | OSD-379 | RR-8 | 40 | Track 2a |
| Liver | OSD-242 | RR-9 | 39 | Track 2a |
| Liver | OSD-686 | MHU-2 | 28 | Track 2a (uG/GC/AG 3-group) |
| Gastrocnemius | OSD-101 | RR-1 | 12 | Track 2a |
| Gastrocnemius | OSD-401 | RR-5 | 12 | Track 2a |
| Gastrocnemius | OSD-326 | RR-9 | 8 | Track 2a |
| Kidney | OSD-102 | RR-1 | 47 | Track 2a |
| Kidney | OSD-163 | RR-3 | 32 | Track 2a |
| Kidney | OSD-253 | RR-7 | 39 | Track 2a |
| Thymus | OSD-289 | MHU-1 | 6 | Track 2b (GC = C57BL/6CR) |
| Thymus | OSD-289 | MHU-2 | 6 | Track 2a |
| Thymus | OSD-244 | RR-6 | 35 | Track 2a |
| Thymus | OSD-421 | RR-9 | 20 | Track 2a |
| Skin | OSD-238 | MHU-2 (dorsal) | 18 | merged as "MHU-2" (6F+6GC+6VC; AG excluded) |
| Skin | OSD-239 | MHU-2 (femoral) | 17 | merged as "MHU-2" (5F+12GC; AG excluded) |
| Skin | OSD-243 | RR-6 | 37 | Track 2a |
| Skin | OSD-254 | RR-7 | 30 | C57BL/6J non-BSL subset only |
| Eye | OSD-100 | RR-1 | 12 | Track 2a |
| Eye | OSD-194 | RR-3 | 9 | Track 2a |
| Eye | OSD-397 | TBD | 16 | Track 2a |
Preprocessing: DESeq2 normalization (per-mission), log2(counts + 1), global low-expression filter (≥20% samples with count>1), top 75th percentile variance gene selection per fold (train missions only — DD-03).
run_baselines.pyandshap_analysis.pyexcludefold_*_holdoutby default.shap_analysis.pyincludes holdout only with--include-holdout.evaluate_submission.pyaccepts holdout predictions if provided, but does not require them.- If one task ID matches multiple directories (for example A1), scripts now raise an ambiguity error unless
--task-diris provided. - Geneformer
mouse_gfpath is configurable:- Tokenize:
--mouse-gf-baseor envMOUSE_GF_BASE - Finetune:
--mouse-gf-model-diror envMOUSE_GF_MODEL_DIR
- Tokenize:
Key methodological choices are documented in DESIGN_DECISIONS.md:
- DD-01: Feature = log2(DESeq2 normalized counts) — LFC forbidden in Category A (label leakage)
- DD-03: LOMO-aware variance filter (train missions only — no test leakage)
- DD-04: Mission = independence unit for LOMO (not sample)
- DD-06: Track 2a = C57BL/6J only; Track 2b = all strains
- DD-08: Evaluation metrics (AUROC + bootstrap CI + permutation p)
- DD-11: Go/No-Go decision criteria (3 AND conditions)
- DD-12: Negative controls (NC1 permutation, NC2 housekeeping, NC3 cross-species)
- DD-13: Baseline model set (LR, RF, XGBoost, PCA-LR)
- DD-15: Pathway analysis (fGSEA group-level + GSVA sample-level)
- DD-16: Text LLM evaluation track specification
- DD-17: Category B evaluation criteria (Transfer Pattern Summary, perm_p floor)
| Version | Date | Changes |
|---|---|---|
| v1.0-alpha | 2026-03-01 | Phase 1 complete. 4 tissues GO (A2+A4+A5 gene-level, A6 pathway-level). Category B–D all 6 tissues. J5 gene-vs-pathway (12 comparisons). NES conservation analysis. Cell 2020 external validation (71.7% concordance). Negative controls (NC1/NC2) pass. fGSEA 60 files, GSVA 54 files. Submission format + evaluator. Dataset freeze. |
| v1.1 | 2026-03-07 | Tier 2 Geneformer complete: Mouse-GF fine-tuned on 6 tissues (22 LOMO folds, Cayuga A40 GPU). Mean AUROC=0.476 vs Baseline 0.758 — classical ML wins 6/6 tissues. |
| v1.0.1 | 2026-03-03 | Category D expanded: D4 strain + D5 hardware (liver/thymus) integrated into summary. J5 expanded to 15 comparisons. Confounder hierarchy documented. condition_prediction.py merge-on-write fix. |
| Version | Scope | Location |
|---|---|---|
| v1.0 | Mouse bulk RNA-seq, 6 tissues, 25 tasks, Tier 1 + Geneformer | Project root (scripts/, evaluation/, tasks/) |
| v2.0 | Cross-species, single-cell, spatial, microbiome | v2/ directory |
v1.0 is frozen at git tag v1.0. See v2/README.md for v2.0 scope and prerequisites.
(Manuscript in preparation)
@dataset{kang2026genelab,
title = {GeneLab Benchmark: A Multi-Tissue Spaceflight Transcriptomics Benchmark for AI/ML Models},
author = {Kang, Jaeyoung},
year = {2026},
url = {https://huggingface.co/datasets/jang1563/genelab-benchmark},
note = {v1.0-alpha}
}Data source: NASA Open Science Data Repository (OSDR) — osdr.nasa.gov
Code: MIT License Data: NASA OSDR public data (see individual dataset licenses at OSDR)