NegBioDB

Negative Results Database & Dual ML/LLM Benchmark for Biomedical Sciences

Approximately 90% of scientific experiments produce null or inconclusive results, yet the vast majority remain unpublished. NegBioDB systematically collects experimentally confirmed negative results across four biomedical domains and provides dual-track ML + LLM benchmarks to quantify the impact of this publication bias on AI models.

Key Features

Four domains: Drug-Target Interaction (DTI), Clinical Trial Failure (CT), Protein-Protein Interaction (PPI), Gene Essentiality (GE/DepMap)
~61.6M negative results across 4 SQLite databases (30.5M DTI + 133K CT + 2.2M PPI + 28.8M GE)
Dual benchmark: ML track (traditional prediction) + LLM track (biomedical NLP tasks)
242 ML runs + 321 LLM runs completed across all domains
Multiple split strategies: random, cold-entity, temporal, scaffold, degree-balanced
Reproducible pipeline: SQLite databases, config-driven ETL, SLURM/HPC support
Standardized evaluation: 7 ML metrics (LogAUC, BEDROC, EF, AUROC, AUPRC, MCC) + LLM rubrics

Database Statistics

Domain	Negative Results	Key Entities	Sources	DB Size
DTI	30,459,583	919K compounds, 3.7K targets	ChEMBL, PubChem, BindingDB, DAVIS	~21 GB
CT	132,925	177K interventions, 56K conditions	AACT, CTO, Open Targets, Shi & Du	~500 MB
PPI	2,229,670	18.4K proteins	IntAct, HuRI, hu.MAP, STRING	849 MB
GE	28,759,256	19,554 genes, 2,132 cell lines	DepMap (CRISPR, RNAi)	~16 GB
Total	~61.6M		14 sources	~38 GB

PPI DB total: 2,229,670; export rows after split filtering: 2,220,786.

Project Status

Domain	ETL	ML Benchmark	LLM Benchmark	Status
DTI	4 sources	24/24 runs	81/81 runs	Complete
CT	4 sources	108/108 runs	80/80 runs	Complete
PPI	4 sources	54/54 runs	80/80 runs	Complete
GE	2 sources	14/14 runs (seed 42)	64/80 runs*	Seed 42 ML complete, LLM 4/5 models

*Llama 3.1-8B results pending HPC GPU availability; seeds 43/44 in progress.

Key Findings

ML: Negative Source Matters

DTI — Degree-matched negatives inflate LogAUC by +0.112 on average. Cold-target splits cause catastrophic failure (LogAUC 0.15–0.33), while AUROC misleadingly stays 0.76–0.89.

DTI Model	Random (NegBioDB)	Random (Degree-Matched)	Cold-Target
DeepDTA	0.833	0.919	0.325
GraphDTA	0.843	0.967	0.241
DrugBAN	0.830	0.955	0.151

PPI — PIPR cold_both AUROC drops to 0.409 (below random); MLPFeatures remains robust at 0.950 due to hand-crafted features.

CT — NegBioDB negatives are trivially separable (AUROC ~1.0); M2 7-way classification is challenging (best macro-F1 = 0.51).

GE — Cold-gene splits reveal severe generalization gaps; degree-balanced negatives modestly improve ranking metrics over random negatives.

LLM: L4 Discrimination Reveals Domain Differences

Domain	L4 MCC Range	Interpretation	Contamination
DTI	≤ 0.18	Near random	Not detected
PPI	0.33–0.44	Moderate	Yes (temporal gap)
CT	0.48–0.56	Meaningful	Not detected
GE	Pending full run	—	—

PPI L4 reveals temporal contamination: pre-2015 interaction data is identified at 59–79% accuracy, while post-2020 data drops to 7–25%. LLMs rely on memorized training data, not biological reasoning.

Setup

Requires Python 3.11+ and uv.

git clone https://github.com/jang1563/NegBioDB.git
cd NegBioDB
make setup          # Create venv and install dependencies
make db             # Initialize SQLite database

Data Pipeline

DTI Domain

make download       # Download all 4 sources (ChEMBL, PubChem, BindingDB, DAVIS)
make load-all       # Run all ETL loaders
uv run python scripts/export_ml_dataset.py   # Export ML datasets

CT Domain

# Download sources (AACT URL changes monthly)
uv run python scripts_ct/download_aact.py --url <AACT_URL>
uv run python scripts_ct/download_cto.py
uv run python scripts_ct/download_opentargets.py
uv run python scripts_ct/download_shi_du.py

# Load and process
uv run python scripts_ct/load_aact.py
uv run python scripts_ct/classify_failures.py
uv run python scripts_ct/resolve_drugs.py
uv run python scripts_ct/load_outcomes.py
uv run python scripts_ct/export_ct_ml_dataset.py

PPI Domain

# Download sources
uv run python scripts_ppi/download_intact.py
uv run python scripts_ppi/download_huri.py
uv run python scripts_ppi/download_humap.py
uv run python scripts_ppi/download_string.py

# Load and process
uv run python scripts_ppi/load_intact.py
uv run python scripts_ppi/load_huri.py
uv run python scripts_ppi/load_humap.py
uv run python scripts_ppi/load_string.py
uv run python scripts_ppi/fetch_sequences.py
uv run python scripts_ppi/export_ppi_ml_dataset.py

GE Domain (DepMap)

# Download DepMap CRISPR and RNAi screens
uv run python scripts_depmap/download_depmap.py

# Load and process
uv run python scripts_depmap/load_depmap.py
uv run python scripts_depmap/load_rnai.py
uv run python scripts_depmap/fetch_gene_descriptions.py
uv run python scripts_depmap/export_ge_ml_dataset.py

ML Experiments

# DTI training (local or SLURM)
uv run python scripts/train_baseline.py --model deepdta --split random --negative negbiodb --dataset balanced
bash slurm/submit_all.sh

# CT training
uv run python scripts_ct/train_ct_baseline.py --model xgboost --task m1 --split random --negative negbiodb
bash slurm/submit_ct_all.sh

# PPI training
uv run python scripts_ppi/train_baseline.py --model siamese_cnn --split random --negative negbiodb --dataset balanced
bash slurm/submit_ppi_all.sh

# GE training
uv run python scripts_depmap/train_ge_baseline.py --model xgboost --split random --negative negbiodb
bash slurm/submit_ge_ml_all.sh

# Results collection (all domains support --aggregate-seeds)
uv run python scripts/collect_results.py --dataset balanced --aggregate-seeds
uv run python scripts_ct/collect_ct_results.py --aggregate-seeds
uv run python scripts_ppi/collect_results.py --dataset balanced --aggregate-seeds
uv run python scripts_depmap/collect_ge_results.py --aggregate-seeds

LLM Benchmark

# Build LLM datasets (example: DTI)
uv run python scripts/build_l1_dataset.py
uv run python scripts/build_l2_dataset.py
uv run python scripts/build_l3_dataset.py
uv run python scripts/build_l4_dataset.py

# Run LLM inference
uv run python scripts/run_llm_benchmark.py --model gemini --level l1 --config zeroshot

# GE-specific LLM datasets and inference
uv run python scripts_depmap/build_ge_l1_dataset.py
uv run python scripts_depmap/run_ge_llm_benchmark.py --model gemini --level l1 --config zeroshot

# Collect results
uv run python scripts/collect_llm_results.py
uv run python scripts_ct/collect_ct_llm_results.py
uv run python scripts_ppi/collect_ppi_llm_results.py
uv run python scripts_depmap/collect_ge_results.py --llm

Testing

# All tests (~1,000 total across 4 domains)
PYTHONPATH=src uv run pytest tests/ -v

# By domain
PYTHONPATH=src uv run pytest tests/test_db.py tests/test_etl_*.py tests/test_export.py -v       # DTI
PYTHONPATH=src uv run pytest tests/test_ct_*.py tests/test_etl_aact.py -v                       # CT
PYTHONPATH=src uv run pytest tests/test_ppi_*.py tests/test_etl_intact.py -v                    # PPI
PYTHONPATH=src uv run pytest tests/test_ge_*.py tests/test_etl_depmap.py -v                     # GE

# Skip network-dependent tests
PYTHONPATH=src uv run pytest tests/ -v -m "not integration"

Project Structure

NegBioDB/
├── src/
│   ├── negbiodb/              # DTI core library
│   │   ├── db.py              # Database creation & migrations
│   │   ├── download.py        # Download utilities (resume, checksum)
│   │   ├── standardize.py     # Compound/target standardization (RDKit)
│   │   ├── etl_davis.py       # DAVIS ETL pipeline
│   │   ├── etl_chembl.py      # ChEMBL ETL pipeline
│   │   ├── etl_pubchem.py     # PubChem ETL (streaming, 29M rows)
│   │   ├── etl_bindingdb.py   # BindingDB ETL pipeline
│   │   ├── export.py          # ML dataset export (Parquet, 5 splits)
│   │   ├── metrics.py         # ML evaluation metrics (7 metrics)
│   │   ├── llm_client.py      # LLM API client (vLLM, Gemini, OpenAI, Anthropic)
│   │   ├── llm_prompts.py     # LLM prompt templates (L1-L4)
│   │   ├── llm_eval.py        # LLM evaluation functions
│   │   └── models/            # ML baseline models
│   │       ├── deepdta.py     # DeepDTA (sequence CNN)
│   │       ├── graphdta.py    # GraphDTA (graph neural network)
│   │       └── drugban.py     # DrugBAN (bilinear attention)
│   ├── negbiodb_ct/           # Clinical Trial domain
│   │   ├── ct_db.py           # CT database & migrations
│   │   ├── etl_aact.py        # AACT ETL (13 tables)
│   │   ├── etl_classify.py    # 3-tier failure classification
│   │   ├── drug_resolver.py   # 4-step drug name resolution
│   │   ├── etl_outcomes.py    # Outcome enrichment (p-values, SAE)
│   │   ├── ct_export.py       # ML export (M1/M2, 6 splits)
│   │   ├── ct_features.py     # Feature encoding (1044/1066-dim)
│   │   ├── ct_models.py       # CT_MLP, CT_GNN_Tab models
│   │   ├── llm_prompts.py     # CT LLM prompts (L1-L4)
│   │   ├── llm_eval.py        # CT LLM evaluation
│   │   └── llm_dataset.py     # CT LLM dataset construction
│   ├── negbiodb_ppi/          # PPI domain
│   │   ├── ppi_db.py          # PPI database & migrations
│   │   ├── etl_intact.py      # IntAct PSI-MI TAB 2.7
│   │   ├── etl_huri.py        # HuRI Y2H screen negatives
│   │   ├── etl_humap.py       # hu.MAP ML-derived negatives
│   │   ├── etl_string.py      # STRING zero-score pairs
│   │   ├── protein_mapper.py  # UniProt validation, ENSG mapping
│   │   ├── export.py          # ML export (4 splits, controls)
│   │   ├── llm_prompts.py     # PPI LLM prompts (L1-L4)
│   │   ├── llm_eval.py        # PPI LLM evaluation
│   │   ├── llm_dataset.py     # PPI LLM dataset construction
│   │   └── models/            # PPI ML models
│   │       ├── siamese_cnn.py # Shared CNN encoder
│   │       ├── pipr.py        # Cross-attention PPI model
│   │       └── mlp_features.py # Hand-crafted feature MLP
│   └── negbiodb_depmap/       # Gene Essentiality (DepMap) domain
│       ├── depmap_db.py       # GE database & migrations
│       ├── etl_depmap.py      # DepMap CRISPR ETL
│       ├── etl_rnai.py        # RNAi screen ETL
│       ├── etl_prism.py       # PRISM drug screen ETL (optional)
│       ├── export.py          # ML export (5 splits, 770 MB parquet)
│       ├── ge_features.py     # Gene/cell-line feature encoding
│       ├── llm_prompts.py     # GE LLM prompts (L1-L4)
│       ├── llm_eval.py        # GE LLM evaluation
│       └── llm_dataset.py     # GE LLM dataset construction
├── scripts/                   # DTI CLI entry points
├── scripts_ct/                # CT CLI entry points
├── scripts_ppi/               # PPI CLI entry points
├── scripts_depmap/            # GE CLI entry points
├── slurm/                     # SLURM job scripts (HPC-ready, path-agnostic)
├── migrations/                # DTI SQL schema migrations
├── migrations_ct/             # CT SQL schema migrations
├── migrations_ppi/            # PPI SQL schema migrations
├── migrations_depmap/         # GE SQL schema migrations
├── tests/                     # Test suite (~1,000 tests across 4 domains)
├── docs/                      # Methodology notes and prompt appendices
├── paper/                     # LaTeX source (NeurIPS 2026 submission)
├── data/                      # SQLite databases (not in repo, ~38 GB)
├── exports/                   # ML/LLM export files (Parquet, not in repo)
├── results/                   # Experiment results (not in repo)
├── config.yaml                # Pipeline configuration
├── Makefile                   # Build/pipeline commands
├── pyproject.toml             # Python project metadata
├── experiment_results.md      # ML/LLM result tables (all 4 domains)
├── PROJECT_OVERVIEW.md        # Detailed project overview
└── ROADMAP.md                 # Execution roadmap

Exported Datasets

DTI (`exports/`)

File	Description
`negbiodb_dti_pairs.parquet`	1.7M compound-target pairs with 5 split columns
`negbiodb_m1_balanced.parquet`	M1: 1.73M rows (1:1 active:inactive)
`negbiodb_m1_realistic.parquet`	M1: 9.49M rows (1:10 ratio)
`negbiodb_m1_balanced_ddb.parquet`	Exp 4: degree-balanced split
`negbiodb_m1_uniform_random.parquet`	Exp 1: uniform random negatives
`negbiodb_m1_degree_matched.parquet`	Exp 1: degree-matched negatives
`chembl_positives_pchembl6.parquet`	863K ChEMBL actives (pChEMBL >= 6)
`compound_names.parquet`	144K compound names (for LLM tasks)

CT (`exports/ct/`)

File	Description
`negbiodb_ct_pairs.parquet`	102,850 failure pairs, 6 splits, all tiers
`negbiodb_ct_m1_balanced.parquet`	Binary: 11,222 rows (5,611 pos + 5,611 neg)
`negbiodb_ct_m1_realistic.parquet`	Binary: 36,957 rows (1:~6 ratio)
`negbiodb_ct_m1_smiles_only.parquet`	Binary: 3,878 rows (SMILES-resolved only)
`negbiodb_ct_m2.parquet`	7-way category: 112,298 rows (non-copper)

PPI (`exports/ppi/`)

File	Description
`negbiodb_ppi_pairs.parquet`	2,220,786 negative pairs with split columns
`ppi_m1_balanced.parquet`	M1: 123,456 rows (1:1 pos:neg)
`ppi_m1_realistic.parquet`	M1: 679,008 rows (1:10 ratio)
`ppi_m1_balanced_ddb.parquet`	Exp 4: degree-balanced split
`ppi_m1_uniform_random.parquet`	Exp 1: uniform random negatives
`ppi_m1_degree_matched.parquet`	Exp 1: degree-matched negatives

GE (`exports/ge/`)

File	Description
`negbiodb_ge_pairs.parquet`	770 MB; 22.5M gene-cell-line pairs with 5 split columns
`ge_m1_random.parquet`	Random split (train/val/test)
`ge_m1_cold_gene.parquet`	Cold-gene generalization split
`ge_m1_cold_cell_line.parquet`	Cold-cell-line generalization split
`ge_m1_cold_both.parquet`	Cold-both (hardest) split
`ge_m1_degree_balanced.parquet`	Degree-balanced negative control

Data Sources

DTI

Source	Records	License
ChEMBL v36	371K	CC BY-SA 3.0
PubChem BioAssay	29.6M	Public Domain
BindingDB	404K	CC BY
DAVIS	20K	Public

CT

Source	Records	License
AACT (ClinicalTrials.gov)	216,987 trials	Public Domain
CTO	20,627	MIT
Open Targets	32,782 targets	Apache 2.0
Shi & Du 2024	119K + 803K rows	CC BY 4.0

PPI

Source	Records	License
IntAct	779 pairs	CC BY 4.0
HuRI	500,000 pairs	CC BY 4.0
hu.MAP 3.0	1,228,891 pairs	MIT
STRING v12.0	500,000 pairs	CC BY 4.0

GE

Source	Records	License
DepMap CRISPR (Chronos)	28.7M gene-cell pairs	CC BY 4.0
DepMap RNAi (DEMETER2)	Integrated	CC BY 4.0

ML Evaluation Metrics

Metric	Role
LogAUC[0.001,0.1]	Primary ranking metric (early enrichment)
BEDROC (alpha=20)	Early enrichment (exponentially weighted)
EF@1%, EF@5%	Enrichment factor at top 1%/5%
AUPRC	Secondary ranking metric
MCC	Balanced classification
AUROC	Backward compatibility

Citation

If you use NegBioDB in your research, please cite:

@misc{negbiodb2026,
  title={NegBioDB: A Negative Results Database and Dual ML/LLM Benchmark for Biomedical Sciences},
  author={Jang, James},
  year={2026},
  url={https://github.com/jang1563/NegBioDB}
}

License

CC BY-SA 4.0 — see LICENSE for details.

This license is required by the viral clause in ChEMBL's CC BY-SA 3.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
docs		docs
migrations		migrations
migrations_ct		migrations_ct
migrations_depmap		migrations_depmap
migrations_ppi		migrations_ppi
paper		paper
scripts		scripts
scripts_ct		scripts_ct
scripts_depmap		scripts_depmap
scripts_ppi		scripts_ppi
slurm		slurm
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
PATCH_NOTES_2026-03-10.md		PATCH_NOTES_2026-03-10.md
PROJECT_OVERVIEW.md		PROJECT_OVERVIEW.md
README.md		README.md
ROADMAP.md		ROADMAP.md
config.yaml		config.yaml
experiment_results.md		experiment_results.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

NegBioDB

Key Features

Database Statistics

Project Status

Key Findings

ML: Negative Source Matters

LLM: L4 Discrimination Reveals Domain Differences

Setup

Data Pipeline

DTI Domain

CT Domain

PPI Domain

GE Domain (DepMap)

ML Experiments

LLM Benchmark

Testing

Project Structure

Exported Datasets

DTI (exports/)

CT (exports/ct/)

PPI (exports/ppi/)

GE (exports/ge/)

Data Sources

DTI

CT

PPI

GE

ML Evaluation Metrics

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

DTI (`exports/`)

CT (`exports/ct/`)

PPI (`exports/ppi/`)

GE (`exports/ge/`)

Packages