Negative Results Database & Dual ML/LLM Benchmark for Biomedical Sciences
Approximately 90% of scientific experiments produce null or inconclusive results, yet the vast majority remain unpublished. NegBioDB systematically collects experimentally confirmed negative results across four biomedical domains and provides dual-track ML + LLM benchmarks to quantify the impact of this publication bias on AI models.
- Four domains: Drug-Target Interaction (DTI), Clinical Trial Failure (CT), Protein-Protein Interaction (PPI), Gene Essentiality (GE/DepMap)
- ~61.6M negative results across 4 SQLite databases (30.5M DTI + 133K CT + 2.2M PPI + 28.8M GE)
- Dual benchmark: ML track (traditional prediction) + LLM track (biomedical NLP tasks)
- 242 ML runs + 321 LLM runs completed across all domains
- Multiple split strategies: random, cold-entity, temporal, scaffold, degree-balanced
- Reproducible pipeline: SQLite databases, config-driven ETL, SLURM/HPC support
- Standardized evaluation: 7 ML metrics (LogAUC, BEDROC, EF, AUROC, AUPRC, MCC) + LLM rubrics
| Domain | Negative Results | Key Entities | Sources | DB Size |
|---|---|---|---|---|
| DTI | 30,459,583 | 919K compounds, 3.7K targets | ChEMBL, PubChem, BindingDB, DAVIS | ~21 GB |
| CT | 132,925 | 177K interventions, 56K conditions | AACT, CTO, Open Targets, Shi & Du | ~500 MB |
| PPI | 2,229,670 | 18.4K proteins | IntAct, HuRI, hu.MAP, STRING | 849 MB |
| GE | 28,759,256 | 19,554 genes, 2,132 cell lines | DepMap (CRISPR, RNAi) | ~16 GB |
| Total | ~61.6M | 14 sources | ~38 GB |
PPI DB total: 2,229,670; export rows after split filtering: 2,220,786.
| Domain | ETL | ML Benchmark | LLM Benchmark | Status |
|---|---|---|---|---|
| DTI | 4 sources | 24/24 runs | 81/81 runs | Complete |
| CT | 4 sources | 108/108 runs | 80/80 runs | Complete |
| PPI | 4 sources | 54/54 runs | 80/80 runs | Complete |
| GE | 2 sources | 14/14 runs (seed 42) | 64/80 runs* | Seed 42 ML complete, LLM 4/5 models |
*Llama 3.1-8B results pending HPC GPU availability; seeds 43/44 in progress.
DTI — Degree-matched negatives inflate LogAUC by +0.112 on average. Cold-target splits cause catastrophic failure (LogAUC 0.15–0.33), while AUROC misleadingly stays 0.76–0.89.
| DTI Model | Random (NegBioDB) | Random (Degree-Matched) | Cold-Target |
|---|---|---|---|
| DeepDTA | 0.833 | 0.919 | 0.325 |
| GraphDTA | 0.843 | 0.967 | 0.241 |
| DrugBAN | 0.830 | 0.955 | 0.151 |
PPI — PIPR cold_both AUROC drops to 0.409 (below random); MLPFeatures remains robust at 0.950 due to hand-crafted features.
CT — NegBioDB negatives are trivially separable (AUROC ~1.0); M2 7-way classification is challenging (best macro-F1 = 0.51).
GE — Cold-gene splits reveal severe generalization gaps; degree-balanced negatives modestly improve ranking metrics over random negatives.
| Domain | L4 MCC Range | Interpretation | Contamination |
|---|---|---|---|
| DTI | ≤ 0.18 | Near random | Not detected |
| PPI | 0.33–0.44 | Moderate | Yes (temporal gap) |
| CT | 0.48–0.56 | Meaningful | Not detected |
| GE | Pending full run | — | — |
PPI L4 reveals temporal contamination: pre-2015 interaction data is identified at 59–79% accuracy, while post-2020 data drops to 7–25%. LLMs rely on memorized training data, not biological reasoning.
Requires Python 3.11+ and uv.
git clone https://github.com/jang1563/NegBioDB.git
cd NegBioDB
make setup # Create venv and install dependencies
make db # Initialize SQLite databasemake download # Download all 4 sources (ChEMBL, PubChem, BindingDB, DAVIS)
make load-all # Run all ETL loaders
uv run python scripts/export_ml_dataset.py # Export ML datasets# Download sources (AACT URL changes monthly)
uv run python scripts_ct/download_aact.py --url <AACT_URL>
uv run python scripts_ct/download_cto.py
uv run python scripts_ct/download_opentargets.py
uv run python scripts_ct/download_shi_du.py
# Load and process
uv run python scripts_ct/load_aact.py
uv run python scripts_ct/classify_failures.py
uv run python scripts_ct/resolve_drugs.py
uv run python scripts_ct/load_outcomes.py
uv run python scripts_ct/export_ct_ml_dataset.py# Download sources
uv run python scripts_ppi/download_intact.py
uv run python scripts_ppi/download_huri.py
uv run python scripts_ppi/download_humap.py
uv run python scripts_ppi/download_string.py
# Load and process
uv run python scripts_ppi/load_intact.py
uv run python scripts_ppi/load_huri.py
uv run python scripts_ppi/load_humap.py
uv run python scripts_ppi/load_string.py
uv run python scripts_ppi/fetch_sequences.py
uv run python scripts_ppi/export_ppi_ml_dataset.py# Download DepMap CRISPR and RNAi screens
uv run python scripts_depmap/download_depmap.py
# Load and process
uv run python scripts_depmap/load_depmap.py
uv run python scripts_depmap/load_rnai.py
uv run python scripts_depmap/fetch_gene_descriptions.py
uv run python scripts_depmap/export_ge_ml_dataset.py# DTI training (local or SLURM)
uv run python scripts/train_baseline.py --model deepdta --split random --negative negbiodb --dataset balanced
bash slurm/submit_all.sh
# CT training
uv run python scripts_ct/train_ct_baseline.py --model xgboost --task m1 --split random --negative negbiodb
bash slurm/submit_ct_all.sh
# PPI training
uv run python scripts_ppi/train_baseline.py --model siamese_cnn --split random --negative negbiodb --dataset balanced
bash slurm/submit_ppi_all.sh
# GE training
uv run python scripts_depmap/train_ge_baseline.py --model xgboost --split random --negative negbiodb
bash slurm/submit_ge_ml_all.sh
# Results collection (all domains support --aggregate-seeds)
uv run python scripts/collect_results.py --dataset balanced --aggregate-seeds
uv run python scripts_ct/collect_ct_results.py --aggregate-seeds
uv run python scripts_ppi/collect_results.py --dataset balanced --aggregate-seeds
uv run python scripts_depmap/collect_ge_results.py --aggregate-seeds# Build LLM datasets (example: DTI)
uv run python scripts/build_l1_dataset.py
uv run python scripts/build_l2_dataset.py
uv run python scripts/build_l3_dataset.py
uv run python scripts/build_l4_dataset.py
# Run LLM inference
uv run python scripts/run_llm_benchmark.py --model gemini --level l1 --config zeroshot
# GE-specific LLM datasets and inference
uv run python scripts_depmap/build_ge_l1_dataset.py
uv run python scripts_depmap/run_ge_llm_benchmark.py --model gemini --level l1 --config zeroshot
# Collect results
uv run python scripts/collect_llm_results.py
uv run python scripts_ct/collect_ct_llm_results.py
uv run python scripts_ppi/collect_ppi_llm_results.py
uv run python scripts_depmap/collect_ge_results.py --llm# All tests (~1,000 total across 4 domains)
PYTHONPATH=src uv run pytest tests/ -v
# By domain
PYTHONPATH=src uv run pytest tests/test_db.py tests/test_etl_*.py tests/test_export.py -v # DTI
PYTHONPATH=src uv run pytest tests/test_ct_*.py tests/test_etl_aact.py -v # CT
PYTHONPATH=src uv run pytest tests/test_ppi_*.py tests/test_etl_intact.py -v # PPI
PYTHONPATH=src uv run pytest tests/test_ge_*.py tests/test_etl_depmap.py -v # GE
# Skip network-dependent tests
PYTHONPATH=src uv run pytest tests/ -v -m "not integration"NegBioDB/
├── src/
│ ├── negbiodb/ # DTI core library
│ │ ├── db.py # Database creation & migrations
│ │ ├── download.py # Download utilities (resume, checksum)
│ │ ├── standardize.py # Compound/target standardization (RDKit)
│ │ ├── etl_davis.py # DAVIS ETL pipeline
│ │ ├── etl_chembl.py # ChEMBL ETL pipeline
│ │ ├── etl_pubchem.py # PubChem ETL (streaming, 29M rows)
│ │ ├── etl_bindingdb.py # BindingDB ETL pipeline
│ │ ├── export.py # ML dataset export (Parquet, 5 splits)
│ │ ├── metrics.py # ML evaluation metrics (7 metrics)
│ │ ├── llm_client.py # LLM API client (vLLM, Gemini, OpenAI, Anthropic)
│ │ ├── llm_prompts.py # LLM prompt templates (L1-L4)
│ │ ├── llm_eval.py # LLM evaluation functions
│ │ └── models/ # ML baseline models
│ │ ├── deepdta.py # DeepDTA (sequence CNN)
│ │ ├── graphdta.py # GraphDTA (graph neural network)
│ │ └── drugban.py # DrugBAN (bilinear attention)
│ ├── negbiodb_ct/ # Clinical Trial domain
│ │ ├── ct_db.py # CT database & migrations
│ │ ├── etl_aact.py # AACT ETL (13 tables)
│ │ ├── etl_classify.py # 3-tier failure classification
│ │ ├── drug_resolver.py # 4-step drug name resolution
│ │ ├── etl_outcomes.py # Outcome enrichment (p-values, SAE)
│ │ ├── ct_export.py # ML export (M1/M2, 6 splits)
│ │ ├── ct_features.py # Feature encoding (1044/1066-dim)
│ │ ├── ct_models.py # CT_MLP, CT_GNN_Tab models
│ │ ├── llm_prompts.py # CT LLM prompts (L1-L4)
│ │ ├── llm_eval.py # CT LLM evaluation
│ │ └── llm_dataset.py # CT LLM dataset construction
│ ├── negbiodb_ppi/ # PPI domain
│ │ ├── ppi_db.py # PPI database & migrations
│ │ ├── etl_intact.py # IntAct PSI-MI TAB 2.7
│ │ ├── etl_huri.py # HuRI Y2H screen negatives
│ │ ├── etl_humap.py # hu.MAP ML-derived negatives
│ │ ├── etl_string.py # STRING zero-score pairs
│ │ ├── protein_mapper.py # UniProt validation, ENSG mapping
│ │ ├── export.py # ML export (4 splits, controls)
│ │ ├── llm_prompts.py # PPI LLM prompts (L1-L4)
│ │ ├── llm_eval.py # PPI LLM evaluation
│ │ ├── llm_dataset.py # PPI LLM dataset construction
│ │ └── models/ # PPI ML models
│ │ ├── siamese_cnn.py # Shared CNN encoder
│ │ ├── pipr.py # Cross-attention PPI model
│ │ └── mlp_features.py # Hand-crafted feature MLP
│ └── negbiodb_depmap/ # Gene Essentiality (DepMap) domain
│ ├── depmap_db.py # GE database & migrations
│ ├── etl_depmap.py # DepMap CRISPR ETL
│ ├── etl_rnai.py # RNAi screen ETL
│ ├── etl_prism.py # PRISM drug screen ETL (optional)
│ ├── export.py # ML export (5 splits, 770 MB parquet)
│ ├── ge_features.py # Gene/cell-line feature encoding
│ ├── llm_prompts.py # GE LLM prompts (L1-L4)
│ ├── llm_eval.py # GE LLM evaluation
│ └── llm_dataset.py # GE LLM dataset construction
├── scripts/ # DTI CLI entry points
├── scripts_ct/ # CT CLI entry points
├── scripts_ppi/ # PPI CLI entry points
├── scripts_depmap/ # GE CLI entry points
├── slurm/ # SLURM job scripts (HPC-ready, path-agnostic)
├── migrations/ # DTI SQL schema migrations
├── migrations_ct/ # CT SQL schema migrations
├── migrations_ppi/ # PPI SQL schema migrations
├── migrations_depmap/ # GE SQL schema migrations
├── tests/ # Test suite (~1,000 tests across 4 domains)
├── docs/ # Methodology notes and prompt appendices
├── paper/ # LaTeX source (NeurIPS 2026 submission)
├── data/ # SQLite databases (not in repo, ~38 GB)
├── exports/ # ML/LLM export files (Parquet, not in repo)
├── results/ # Experiment results (not in repo)
├── config.yaml # Pipeline configuration
├── Makefile # Build/pipeline commands
├── pyproject.toml # Python project metadata
├── experiment_results.md # ML/LLM result tables (all 4 domains)
├── PROJECT_OVERVIEW.md # Detailed project overview
└── ROADMAP.md # Execution roadmap
| File | Description |
|---|---|
negbiodb_dti_pairs.parquet |
1.7M compound-target pairs with 5 split columns |
negbiodb_m1_balanced.parquet |
M1: 1.73M rows (1:1 active:inactive) |
negbiodb_m1_realistic.parquet |
M1: 9.49M rows (1:10 ratio) |
negbiodb_m1_balanced_ddb.parquet |
Exp 4: degree-balanced split |
negbiodb_m1_uniform_random.parquet |
Exp 1: uniform random negatives |
negbiodb_m1_degree_matched.parquet |
Exp 1: degree-matched negatives |
chembl_positives_pchembl6.parquet |
863K ChEMBL actives (pChEMBL >= 6) |
compound_names.parquet |
144K compound names (for LLM tasks) |
| File | Description |
|---|---|
negbiodb_ct_pairs.parquet |
102,850 failure pairs, 6 splits, all tiers |
negbiodb_ct_m1_balanced.parquet |
Binary: 11,222 rows (5,611 pos + 5,611 neg) |
negbiodb_ct_m1_realistic.parquet |
Binary: 36,957 rows (1:~6 ratio) |
negbiodb_ct_m1_smiles_only.parquet |
Binary: 3,878 rows (SMILES-resolved only) |
negbiodb_ct_m2.parquet |
7-way category: 112,298 rows (non-copper) |
| File | Description |
|---|---|
negbiodb_ppi_pairs.parquet |
2,220,786 negative pairs with split columns |
ppi_m1_balanced.parquet |
M1: 123,456 rows (1:1 pos:neg) |
ppi_m1_realistic.parquet |
M1: 679,008 rows (1:10 ratio) |
ppi_m1_balanced_ddb.parquet |
Exp 4: degree-balanced split |
ppi_m1_uniform_random.parquet |
Exp 1: uniform random negatives |
ppi_m1_degree_matched.parquet |
Exp 1: degree-matched negatives |
| File | Description |
|---|---|
negbiodb_ge_pairs.parquet |
770 MB; 22.5M gene-cell-line pairs with 5 split columns |
ge_m1_random.parquet |
Random split (train/val/test) |
ge_m1_cold_gene.parquet |
Cold-gene generalization split |
ge_m1_cold_cell_line.parquet |
Cold-cell-line generalization split |
ge_m1_cold_both.parquet |
Cold-both (hardest) split |
ge_m1_degree_balanced.parquet |
Degree-balanced negative control |
| Source | Records | License |
|---|---|---|
| ChEMBL v36 | 371K | CC BY-SA 3.0 |
| PubChem BioAssay | 29.6M | Public Domain |
| BindingDB | 404K | CC BY |
| DAVIS | 20K | Public |
| Source | Records | License |
|---|---|---|
| AACT (ClinicalTrials.gov) | 216,987 trials | Public Domain |
| CTO | 20,627 | MIT |
| Open Targets | 32,782 targets | Apache 2.0 |
| Shi & Du 2024 | 119K + 803K rows | CC BY 4.0 |
| Source | Records | License |
|---|---|---|
| IntAct | 779 pairs | CC BY 4.0 |
| HuRI | 500,000 pairs | CC BY 4.0 |
| hu.MAP 3.0 | 1,228,891 pairs | MIT |
| STRING v12.0 | 500,000 pairs | CC BY 4.0 |
| Source | Records | License |
|---|---|---|
| DepMap CRISPR (Chronos) | 28.7M gene-cell pairs | CC BY 4.0 |
| DepMap RNAi (DEMETER2) | Integrated | CC BY 4.0 |
| Metric | Role |
|---|---|
| LogAUC[0.001,0.1] | Primary ranking metric (early enrichment) |
| BEDROC (alpha=20) | Early enrichment (exponentially weighted) |
| EF@1%, EF@5% | Enrichment factor at top 1%/5% |
| AUPRC | Secondary ranking metric |
| MCC | Balanced classification |
| AUROC | Backward compatibility |
If you use NegBioDB in your research, please cite:
@misc{negbiodb2026,
title={NegBioDB: A Negative Results Database and Dual ML/LLM Benchmark for Biomedical Sciences},
author={Jang, James},
year={2026},
url={https://github.com/jang1563/NegBioDB}
}CC BY-SA 4.0 — see LICENSE for details.
This license is required by the viral clause in ChEMBL's CC BY-SA 3.0 license.