Add python-chebi-utils library: downloader, OBO/SDF extractors, stratified splits, tests, CI by Copilot · Pull Request #1 · ChEB-AI/python-chebi-utils

Copilot · 2026-02-25T08:46:49Z

Implements a Python library for generating data products from ChEBI — downloading source files, extracting ontology classes/relations and molecules, and producing stratified dataset splits.

Package: `chebi_utils/`

downloader.py — download_chebi_obo(version, ...) / download_chebi_sdf(version, ...) fetch versioned releases from the EBI FTP archive via urllib. Versions below 245 are resolved to the legacy archive path (chebi_legacy/archive/rel{version}/...); versions 245 and above use the modern path (archive/rel{version}/...).
obo_extractor.py — Uses the fastobo library to parse ChEBI OBO files and exposes build_chebi_graph(), which returns a networkx.DiGraph. xref: lines are stripped before parsing. Only non-obsolete CHEBI-prefixed terms are included. Nodes are string CHEBI IDs (e.g. "123" for CHEBI:123) with attributes name, smiles (extracted from PropertyValueClause using both the legacy chebi/smiles and modern chemrof:smiles_string properties, with fallback to SMILES synonym entries), and subset. Edges carry a relation attribute and represent is_a (child → parent) and has_part (whole → part) relationships.
sdf_extractor.py — extract_molecules() reads plain .sdf or gzip .sdf.gz, returns a DataFrame with chebi_id, name, smiles, inchi, inchikey, formula, charge, mass, mol. Each molecule's connection table is parsed into an RDKit Mol object stored in the mol column. Parsing uses MolFromMolBlock(sanitize=False, removeHs=False), followed by _update_mol_valences() (sets NoImplicit=True on all atoms) and Chem.SanitizeMol with flags FINDRADICALS | KEKULIZE | SETAROMATICITY | SETCONJUGATION | SETHYBRIDIZATION | SYMMRINGS. Molecules that fail to parse result in None with a warning.
splitter.py — create_splits() produces reproducible random or stratified train/val/test splits from any DataFrame

from chebi_utils import download_chebi_obo, build_chebi_graph, extract_molecules, create_splits

obo = download_chebi_obo(version=250, dest_dir="data/")
graph = build_chebi_graph(obo)
# graph.nodes["1"] -> {"name": "...", "smiles": "...", "subset": "..."}
# graph.edges[("1", "2")] -> {"relation": "is_a"}

sdf = download_chebi_sdf(version=250, dest_dir="data/")
molecules = extract_molecules(sdf)      # chebi_id, name, smiles, inchi, …, mol (RDKit Mol)

splits = create_splits(molecules, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1,
                       stratify_col="formula", seed=42)
# splits["train"], splits["val"], splits["test"]

Tests (`tests/`, 58 tests)

Fixture files tests/fixtures/sample.obo and tests/fixtures/sample.sdf used throughout
Covers: versioned URL generation for both legacy and modern releases, URL routing and directory creation (via urlretrieve mock), OBO graph construction (node/edge count, string IDs, node attributes, obsolete term exclusion, xref robustness), SDF plain+gzip parsing, RDKit Mol object presence and atom counts, None-on-failure behaviour, split ratio validation, reproducibility, no-overlap guarantees, and stratification proportions

CI/CD (`.github/workflows/ci.yml`)

Two jobs on every push/PR:

lint — ruff format --check + ruff check
test — pytest matrix across Python 3.10, 3.11, 3.12

Configuration (`pyproject.toml`)

Hatchling build backend; runtime deps: fastobo>=0.14, networkx>=3.0, numpy>=1.24, pandas>=2.0, rdkit>=2022.09; dev extras add pytest and ruff.

Original prompt

I need a python library that generates different data products from ChEBI. It should be able to download chebi.obo and chebi.sdf files from the ChEBI FTP server. Extract class relations and molecules, generate stratified train/val/test splits. Also include appropriate unittests and a CI/CD workflow that runs ruff and the unittests automatically

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

… and CI workflow Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

sfluegel05 · 2026-02-25T09:10:21Z

@copilot Use the following download locations for chebi.obo and chebi.sdf. Accept a version parameter as input:

chebi.obo

if version < 245:
                url = f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/chebi_legacy/archive/rel{version}/ontology/chebi.obo"
            else:
                url = f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel{version}/ontology/chebi.obo"

chebi.sdf

if version < 245:
                url = f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/chebi_legacy/archive/rel{version}/ontology/chebi.obo"
            else:
                url = f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel{version}/SDF/chebi.sdf.gz"

…URLs Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

Copilot · 2026-02-25T09:12:36Z

@copilot Use the following download locations for chebi.obo and chebi.sdf. Accept a version parameter as input:

chebi.obo
...

Done in eb6a7fe. Both download_chebi_obo and download_chebi_sdf now require a version: int parameter. The URL is resolved via helper functions _chebi_obo_url / _chebi_sdf_url using the exact logic you specified (legacy archive path for versions < 245, modern path for ≥ 245). Tests cover both branches for each function.

sfluegel05 · 2026-02-25T09:15:45Z

@copilot for processing the chebi ontology, use the fastobo library. Generate a networkx directed graph as a result.

Here are some code blocks working with fastobo as orientation:

with open(data_path, encoding="utf-8") as chebi:
            chebi = "\n".join(line for line in chebi if not line.startswith("xref:"))

        elements = []
        for term_doc in fastobo.loads(chebi):
            if (
                term_doc
                and isinstance(term_doc.id, fastobo.id.PrefixedIdent)
                and term_doc.id.prefix == "CHEBI"
            ):
                term_dict = term_callback(term_doc)
                if term_dict and (
                    not self.subset or term_dict["subset"] == self.subset
                ):
                    elements.append(term_dict)

def term_callback(doc: "fastobo.term.TermFrame") -> Union[Dict, bool]:
    """
    Extracts information from a ChEBI term document.
    This function takes a ChEBI term document as input and extracts relevant information such as the term ID, parents,
    parts, name, and SMILES string. It returns a dictionary containing the extracted information.

    Args:
    - doc: A ChEBI term document.

    Returns:
    A dictionary containing the following keys:
    - "id": The ID of the ChEBI term.
    - "parents": A list of parent term IDs.
    - "has_part": A set of term IDs representing the parts of the ChEBI term.
    - "name": The name of the ChEBI term.
    - "smiles": The SMILES string associated with the ChEBI term, if available.
    """
    import fastobo

    parts = set()
    parents = []
    name = None
    smiles = None
    subset = None
    for clause in doc:
        if isinstance(clause, fastobo.term.PropertyValueClause):
            t = clause.property_value
            # chemrof:smiles_string is the new annotation property, chebi/smiles is the old one (see https://chembl.blogspot.com/2025/07/chebi-20-data-products.html)
            if (
                str(t.relation) == "chemrof:smiles_string"
                or str(t.relation) == "http://purl.obolibrary.org/obo/chebi/smiles"
            ):
                assert smiles is None
                smiles = t.value
        # in older chebi versions, smiles strings are synonyms
        # e.g. synonym: "[F-].[Na+]" RELATED SMILES [ChEBI]
        elif isinstance(clause, fastobo.term.SynonymClause):
            if "SMILES" in clause.raw_value():
                assert smiles is None
                smiles = clause.raw_value().split('"')[1]
        elif isinstance(clause, fastobo.term.RelationshipClause):
            if str(clause.typedef) == "has_part":
                parts.add(chebi_to_int(str(clause.term)))
        elif isinstance(clause, fastobo.term.IsAClause):
            parents.append(chebi_to_int(str(clause.term)))
        elif isinstance(clause, fastobo.term.NameClause):
            name = str(clause.name)
        elif isinstance(clause, fastobo.term.SubsetClause):
            subset = str(clause.subset)

        if isinstance(clause, fastobo.term.IsObsoleteClause):
            if clause.obsolete:
                # if the term document contains clause as obsolete as true, skips this document.
                return False

    return {
        "id": chebi_id_to_short_id(str(doc.id)), # "CHEBI:123" -> "123"
        "parents": parents,
        "has_part": parts,
        "name": name,
        "smiles": smiles,
        "subset": subset,
    }

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

Copilot · 2026-02-25T09:20:07Z

@copilot for processing the chebi ontology, use the fastobo library. Generate a networkx directed graph as a result.

Here are some code blocks working with fastobo as orientation:
...

Done in 1df6eee. obo_extractor.py is now rewritten around fastobo:

xref: lines are stripped before loading to avoid parse errors
Each CHEBI term frame is processed by _term_data() using the same clause-dispatch pattern you shared; obsolete terms are skipped
SMILES is extracted from PropertyValueClause (both chemrof:smiles_string and the legacy chebi/smiles URI), with a fallback to SynonymClause entries containing "SMILES"
build_chebi_graph() returns a nx.DiGraph with integer CHEBI IDs as nodes (attributes: name, smiles, subset) and directed edges for is_a (child → parent) and has_part (whole → part)

sfluegel05 · 2026-02-25T09:22:38Z

@copilot When reading the sdf file, also parse the molecule into an rdkit Mol object.

Take this as an example:

mol = parse_molblock(molfile)
if mol is None:
    print(f"Failed to parse molfile for CHEBI:{ident}")
    continue
mol = update_mol_valences(mol)
Chem.SanitizeMol(
    mol,
    sanitizeOps=Chem.SanitizeFlags.SANITIZE_FINDRADICALS
    | Chem.SanitizeFlags.SANITIZE_KEKULIZE
    | Chem.SanitizeFlags.SANITIZE_SETAROMATICITY
    | Chem.SanitizeFlags.SANITIZE_SETCONJUGATION
    | Chem.SanitizeFlags.SANITIZE_SETHYBRIDIZATION
    | Chem.SanitizeFlags.SANITIZE_SYMMRINGS,
    catchErrors=True,
)

sfluegel05 · 2026-02-25T09:24:49Z

@copilot Make sure that chebi IDs are always stored as strings (even though the id is a number). E.g., "123" should still have the string datatype

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

Copilot · 2026-02-25T09:27:23Z

@copilot When reading the sdf file, also parse the molecule into an rdkit Mol object.

Take this as an example:
...

Done in b2e747a. sdf_extractor.py now includes:

_update_mol_valences() — sets NoImplicit=True on all atoms to preserve explicit valences from the molfile
_parse_molblock() — calls MolFromMolBlock(sanitize=False, removeHs=False), then _update_mol_valences, then Chem.SanitizeMol with the exact flags from your example; emits a warnings.warn and returns None when parsing fails

Each row in the returned DataFrame now has a mol column containing the RDKit Mol object (or None for molecules that couldn't be parsed).

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

Copilot · 2026-02-25T09:31:41Z

@copilot Make sure that chebi IDs are always stored as strings (even though the id is a number). E.g., "123" should still have the string datatype

Done in 361ebd7. Renamed _chebi_id_to_int → _chebi_id_to_str which now returns the numeric part as a str (e.g. "CHEBI:123" → "123"). All node IDs, edge endpoints, and type annotations in obo_extractor.py are updated accordingly.

Initial plan

84ebd3f

Copilot AI assigned Copilot and sfluegel05 Feb 25, 2026

Copilot started work on behalf of sfluegel05 February 25, 2026 08:47 View session

Add chebi_utils library with downloader, extractors, splitter, tests,…

04aa4d3

… and CI workflow Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add CheBI data product generation library~~ Add python-chebi-utils library: downloader, OBO/SDF extractors, stratified splits, tests, CI Feb 25, 2026

Copilot AI requested a review from sfluegel05 February 25, 2026 08:53

Copilot finished work on behalf of sfluegel05 February 25, 2026 08:53

Copilot started work on behalf of sfluegel05 February 25, 2026 09:10 View session

Update downloader to accept version parameter with versioned archive …

eb6a7fe

…URLs Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

Copilot finished work on behalf of sfluegel05 February 25, 2026 09:13

Copilot started work on behalf of sfluegel05 February 25, 2026 09:16 View session

Rewrite obo_extractor to use fastobo and return a networkx DiGraph

1df6eee

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

Copilot finished work on behalf of sfluegel05 February 25, 2026 09:20

Copilot started work on behalf of sfluegel05 February 25, 2026 09:23 View session

Add RDKit Mol parsing to SDF extractor with partial sanitisation

b2e747a

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

Copilot finished work on behalf of sfluegel05 February 25, 2026 09:28

Copilot started work on behalf of sfluegel05 February 25, 2026 09:28 View session

Store ChEBI IDs as strings in the ontology graph

361ebd7

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>

sfluegel05 marked this pull request as ready for review February 25, 2026 09:32

sfluegel05 merged commit 887e5d5 into main Feb 25, 2026

Copilot finished work on behalf of sfluegel05 February 25, 2026 09:32

sfluegel05 deleted the copilot/add-chebi-data-generator branch February 25, 2026 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add python-chebi-utils library: downloader, OBO/SDF extractors, stratified splits, tests, CI#1

Add python-chebi-utils library: downloader, OBO/SDF extractors, stratified splits, tests, CI#1
sfluegel05 merged 6 commits intomainfrom
copilot/add-chebi-data-generator

Copilot AI commented Feb 25, 2026 •

edited

Loading

Uh oh!

sfluegel05 commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

sfluegel05 commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

sfluegel05 commented Feb 25, 2026

Uh oh!

sfluegel05 commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Package: chebi_utils/

Tests (tests/, 58 tests)

CI/CD (.github/workflows/ci.yml)

Configuration (pyproject.toml)

Uh oh!

sfluegel05 commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

sfluegel05 commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

sfluegel05 commented Feb 25, 2026

Uh oh!

sfluegel05 commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 25, 2026 •

edited

Loading

Package: `chebi_utils/`

Tests (`tests/`, 58 tests)

CI/CD (`.github/workflows/ci.yml`)

Configuration (`pyproject.toml`)