A Nextflow pipeline for downloading and preprocessing single-cell RNA-seq datasets from the GEMMA database. The pipeline downloads expression matrices, cell type assignments, and sample metadata, then processes them into standardized AnnData (H5AD) format.
- Downloads single-cell expression data in MEX format from GEMMA
- Retrieves cell type assignments and sample metadata via GEMMA API
- Standardizes gene names using ENSEMBL ID mapping
- Supports two processing modes:
- Combined mode: All samples merged into a single H5AD per study
- Sample mode: Each sample processed into a separate H5AD file
- SLURM cluster support with conda environment management
The pipeline requires a conda environment with:
- scanpy
- pandas
- anndata
- scipy
- gemmapy
GEMMA API credentials must be set as environment variables:
export GEMMA_USERNAME="your_username"
export GEMMA_PASSWORD="your_password"A text file with one GEMMA study ID per line:
GSE237718
GSE180670
Velmeshev-2019.1
Example files are provided:
study_names_human.txt- Human studiesstudy_names_mouse.txt- Mouse studies
Located at meta/gemma_genes.tsv, containing ENSEMBL_ID to OFFICIAL_SYMBOL mappings.
# Using a study names file
nextflow run main.nf -profile conda \
--study_names study_names_human.txt
# Using a direct study ID
nextflow run main.nf -profile conda \
--study_names GSE237718
# Using an existing studies directory
nextflow run main.nf -profile conda \
--study_paths /path/to/existing/studies# Combined mode (default) - one H5AD per study
nextflow run main.nf -profile conda \
--study_names study_names_human.txt \
--process_samples false
# Sample mode - one H5AD per sample
nextflow run main.nf -profile conda \
--study_names study_names_human.txt \
--process_samples truenextflow run main.nf -profile conda \
--study_names study_names_human.txt \
--author_submitted true \
-resume| Parameter | Description | Default |
|---|---|---|
--study_names |
Path to file with study IDs (one per line), or a single study ID | null |
--study_file |
Comma- or space-separated study IDs (inline) | null |
--study_paths |
Path to pre-downloaded studies directory | null |
--process_samples |
Process each sample separately (true) or combined (false) |
false |
--author_submitted |
Use author-submitted cell types | false |
--gene_mapping |
Path to gene mapping TSV | meta/gemma_genes.tsv |
--outdir |
Output directory | Auto-generated: {study_names}_author_{author_submitted}_process_samples_{process_samples} |
Note: You must provide exactly one of --study_names, --study_file, or --study_paths.
{outdir}/
├── mex/ # Raw MEX format data
│ └── {study}/
│ └── {sample}/
│ ├── matrix.mtx.gz
│ ├── features.tsv.gz
│ └── barcodes.tsv.gz
├── cell_type_assignments/ # Cell type annotations
│ └── {study}.celltypes.tsv
├── metadata/ # Sample metadata (raw)
│ └── {study}/
│ └── {organism}/
│ └── {study}_sample_meta.tsv
├── metadata_standardized/ # Column-renamed metadata
│ └── {study}_sample_meta.tsv
├── unique_cells/ # Cell type count summaries
│ └── {study}/
│ └── {study}_unique_cells.tsv
├── h5ad/ # Processed AnnData files
│ └── {study}/
│ └── {study}.h5ad # or {study}_{sample}.h5ad
└── small_samples/ # Samples with <50 cells
└── {study}/
AnnData format containing:
- X: Sparse expression matrix (CSR format, raw counts)
- obs: Cell metadata —
sample_id,cell_type,cell_type_uri,region,sex,dev_stage,organism,assay,donor_id, plus study-specific columns - var: index = ENSEMBL_ID,
feature_name= OFFICIAL_SYMBOL
| Column | Description |
|---|---|
| sample_id | Sample identifier |
| cell_id | Cell barcode |
| cell_type | Assigned cell type |
| cell_type_uri | Ontology URI |
Contains sample characteristics and factor values including:
- organism_part / region
- sex
- developmental_stage
- disease
- assay
1. Input Handling
└── Read study names OR discover from existing directory
2. Data Download
├── Download MEX expression matrices (gemma-cli-staging)
├── Download cell type assignments (GEMMA API)
└── Extract sample metadata (gemmapy)
3. Processing
├── Combined mode: Merge all samples per study
└── Sample mode: Process each sample separately
4. Output Generation
├── Standardize gene names
├── Integrate cell/sample metadata
└── Export to H5AD format
The pipeline is configured for SLURM execution:
- Queue size: 90 concurrent jobs
- CPUs per task: 10
- Cluster options:
-C thrd64 --cpus-per-task=10
To modify, edit nextflow.config:
process {
executor = 'slurm'
queueSize = 90
clusterOptions = '-C thrd64 --cpus-per-task=10'
}Use the -resume flag to continue from the last successful checkpoint:
nextflow run main.nf -profile conda --study_names study_names.txt -resumeEnsure GEMMA_USERNAME and GEMMA_PASSWORD environment variables are set correctly.
Samples with fewer than 50 cells are automatically moved to the small_samples/ directory.
Genes without ENSEMBL ID mappings will retain their original identifiers.
This project is developed for research purposes at UBC.
For questions about GEMMA data, visit: https://gemma.msl.ubc.ca/