Skip to content

Add python-chebi-utils library: downloader, OBO/SDF extractors, stratified splits, tests, CI#1

Merged
sfluegel05 merged 6 commits intomainfrom
copilot/add-chebi-data-generator
Feb 25, 2026
Merged

Add python-chebi-utils library: downloader, OBO/SDF extractors, stratified splits, tests, CI#1
sfluegel05 merged 6 commits intomainfrom
copilot/add-chebi-data-generator

Conversation

Copy link
Contributor

Copilot AI commented Feb 25, 2026

Implements a Python library for generating data products from ChEBI — downloading source files, extracting ontology classes/relations and molecules, and producing stratified dataset splits.

Package: chebi_utils/

  • downloader.pydownload_chebi_obo(version, ...) / download_chebi_sdf(version, ...) fetch versioned releases from the EBI FTP archive via urllib. Versions below 245 are resolved to the legacy archive path (chebi_legacy/archive/rel{version}/...); versions 245 and above use the modern path (archive/rel{version}/...).
  • obo_extractor.py — Uses the fastobo library to parse ChEBI OBO files and exposes build_chebi_graph(), which returns a networkx.DiGraph. xref: lines are stripped before parsing. Only non-obsolete CHEBI-prefixed terms are included. Nodes are string CHEBI IDs (e.g. "123" for CHEBI:123) with attributes name, smiles (extracted from PropertyValueClause using both the legacy chebi/smiles and modern chemrof:smiles_string properties, with fallback to SMILES synonym entries), and subset. Edges carry a relation attribute and represent is_a (child → parent) and has_part (whole → part) relationships.
  • sdf_extractor.pyextract_molecules() reads plain .sdf or gzip .sdf.gz, returns a DataFrame with chebi_id, name, smiles, inchi, inchikey, formula, charge, mass, mol. Each molecule's connection table is parsed into an RDKit Mol object stored in the mol column. Parsing uses MolFromMolBlock(sanitize=False, removeHs=False), followed by _update_mol_valences() (sets NoImplicit=True on all atoms) and Chem.SanitizeMol with flags FINDRADICALS | KEKULIZE | SETAROMATICITY | SETCONJUGATION | SETHYBRIDIZATION | SYMMRINGS. Molecules that fail to parse result in None with a warning.
  • splitter.pycreate_splits() produces reproducible random or stratified train/val/test splits from any DataFrame
from chebi_utils import download_chebi_obo, build_chebi_graph, extract_molecules, create_splits

obo = download_chebi_obo(version=250, dest_dir="data/")
graph = build_chebi_graph(obo)
# graph.nodes["1"] -> {"name": "...", "smiles": "...", "subset": "..."}
# graph.edges[("1", "2")] -> {"relation": "is_a"}

sdf = download_chebi_sdf(version=250, dest_dir="data/")
molecules = extract_molecules(sdf)      # chebi_id, name, smiles, inchi, …, mol (RDKit Mol)

splits = create_splits(molecules, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1,
                       stratify_col="formula", seed=42)
# splits["train"], splits["val"], splits["test"]

Tests (tests/, 58 tests)

  • Fixture files tests/fixtures/sample.obo and tests/fixtures/sample.sdf used throughout
  • Covers: versioned URL generation for both legacy and modern releases, URL routing and directory creation (via urlretrieve mock), OBO graph construction (node/edge count, string IDs, node attributes, obsolete term exclusion, xref robustness), SDF plain+gzip parsing, RDKit Mol object presence and atom counts, None-on-failure behaviour, split ratio validation, reproducibility, no-overlap guarantees, and stratification proportions

CI/CD (.github/workflows/ci.yml)

Two jobs on every push/PR:

  • lintruff format --check + ruff check
  • testpytest matrix across Python 3.10, 3.11, 3.12

Configuration (pyproject.toml)

Hatchling build backend; runtime deps: fastobo>=0.14, networkx>=3.0, numpy>=1.24, pandas>=2.0, rdkit>=2022.09; dev extras add pytest and ruff.

Original prompt

I need a python library that generates different data products from ChEBI. It should be able to download chebi.obo and chebi.sdf files from the ChEBI FTP server. Extract class relations and molecules, generate stratified train/val/test splits. Also include appropriate unittests and a CI/CD workflow that runs ruff and the unittests automatically


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

… and CI workflow

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Copilot AI changed the title [WIP] Add CheBI data product generation library Add python-chebi-utils library: downloader, OBO/SDF extractors, stratified splits, tests, CI Feb 25, 2026
Copilot AI requested a review from sfluegel05 February 25, 2026 08:53
@sfluegel05
Copy link
Contributor

@copilot Use the following download locations for chebi.obo and chebi.sdf. Accept a version parameter as input:

chebi.obo

if version < 245:
                url = f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/chebi_legacy/archive/rel{version}/ontology/chebi.obo"
            else:
                url = f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel{version}/ontology/chebi.obo"

chebi.sdf

if version < 245:
                url = f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/chebi_legacy/archive/rel{version}/ontology/chebi.obo"
            else:
                url = f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel{version}/SDF/chebi.sdf.gz"

…URLs

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@copilot Use the following download locations for chebi.obo and chebi.sdf. Accept a version parameter as input:

chebi.obo
...

Done in eb6a7fe. Both download_chebi_obo and download_chebi_sdf now require a version: int parameter. The URL is resolved via helper functions _chebi_obo_url / _chebi_sdf_url using the exact logic you specified (legacy archive path for versions < 245, modern path for ≥ 245). Tests cover both branches for each function.

@sfluegel05
Copy link
Contributor

@copilot for processing the chebi ontology, use the fastobo library. Generate a networkx directed graph as a result.

Here are some code blocks working with fastobo as orientation:

with open(data_path, encoding="utf-8") as chebi:
            chebi = "\n".join(line for line in chebi if not line.startswith("xref:"))

        elements = []
        for term_doc in fastobo.loads(chebi):
            if (
                term_doc
                and isinstance(term_doc.id, fastobo.id.PrefixedIdent)
                and term_doc.id.prefix == "CHEBI"
            ):
                term_dict = term_callback(term_doc)
                if term_dict and (
                    not self.subset or term_dict["subset"] == self.subset
                ):
                    elements.append(term_dict)
def term_callback(doc: "fastobo.term.TermFrame") -> Union[Dict, bool]:
    """
    Extracts information from a ChEBI term document.
    This function takes a ChEBI term document as input and extracts relevant information such as the term ID, parents,
    parts, name, and SMILES string. It returns a dictionary containing the extracted information.

    Args:
    - doc: A ChEBI term document.

    Returns:
    A dictionary containing the following keys:
    - "id": The ID of the ChEBI term.
    - "parents": A list of parent term IDs.
    - "has_part": A set of term IDs representing the parts of the ChEBI term.
    - "name": The name of the ChEBI term.
    - "smiles": The SMILES string associated with the ChEBI term, if available.
    """
    import fastobo

    parts = set()
    parents = []
    name = None
    smiles = None
    subset = None
    for clause in doc:
        if isinstance(clause, fastobo.term.PropertyValueClause):
            t = clause.property_value
            # chemrof:smiles_string is the new annotation property, chebi/smiles is the old one (see https://chembl.blogspot.com/2025/07/chebi-20-data-products.html)
            if (
                str(t.relation) == "chemrof:smiles_string"
                or str(t.relation) == "http://purl.obolibrary.org/obo/chebi/smiles"
            ):
                assert smiles is None
                smiles = t.value
        # in older chebi versions, smiles strings are synonyms
        # e.g. synonym: "[F-].[Na+]" RELATED SMILES [ChEBI]
        elif isinstance(clause, fastobo.term.SynonymClause):
            if "SMILES" in clause.raw_value():
                assert smiles is None
                smiles = clause.raw_value().split('"')[1]
        elif isinstance(clause, fastobo.term.RelationshipClause):
            if str(clause.typedef) == "has_part":
                parts.add(chebi_to_int(str(clause.term)))
        elif isinstance(clause, fastobo.term.IsAClause):
            parents.append(chebi_to_int(str(clause.term)))
        elif isinstance(clause, fastobo.term.NameClause):
            name = str(clause.name)
        elif isinstance(clause, fastobo.term.SubsetClause):
            subset = str(clause.subset)

        if isinstance(clause, fastobo.term.IsObsoleteClause):
            if clause.obsolete:
                # if the term document contains clause as obsolete as true, skips this document.
                return False

    return {
        "id": chebi_id_to_short_id(str(doc.id)), # "CHEBI:123" -> "123"
        "parents": parents,
        "has_part": parts,
        "name": name,
        "smiles": smiles,
        "subset": subset,
    }

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@copilot for processing the chebi ontology, use the fastobo library. Generate a networkx directed graph as a result.

Here are some code blocks working with fastobo as orientation:
...

Done in 1df6eee. obo_extractor.py is now rewritten around fastobo:

  • xref: lines are stripped before loading to avoid parse errors
  • Each CHEBI term frame is processed by _term_data() using the same clause-dispatch pattern you shared; obsolete terms are skipped
  • SMILES is extracted from PropertyValueClause (both chemrof:smiles_string and the legacy chebi/smiles URI), with a fallback to SynonymClause entries containing "SMILES"
  • build_chebi_graph() returns a nx.DiGraph with integer CHEBI IDs as nodes (attributes: name, smiles, subset) and directed edges for is_a (child → parent) and has_part (whole → part)

@sfluegel05
Copy link
Contributor

@copilot When reading the sdf file, also parse the molecule into an rdkit Mol object.

Take this as an example:

mol = parse_molblock(molfile)
if mol is None:
    print(f"Failed to parse molfile for CHEBI:{ident}")
    continue
mol = update_mol_valences(mol)
Chem.SanitizeMol(
    mol,
    sanitizeOps=Chem.SanitizeFlags.SANITIZE_FINDRADICALS
    | Chem.SanitizeFlags.SANITIZE_KEKULIZE
    | Chem.SanitizeFlags.SANITIZE_SETAROMATICITY
    | Chem.SanitizeFlags.SANITIZE_SETCONJUGATION
    | Chem.SanitizeFlags.SANITIZE_SETHYBRIDIZATION
    | Chem.SanitizeFlags.SANITIZE_SYMMRINGS,
    catchErrors=True,
)

@sfluegel05
Copy link
Contributor

@copilot Make sure that chebi IDs are always stored as strings (even though the id is a number). E.g., "123" should still have the string datatype

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@copilot When reading the sdf file, also parse the molecule into an rdkit Mol object.

Take this as an example:
...

Done in b2e747a. sdf_extractor.py now includes:

  • _update_mol_valences() — sets NoImplicit=True on all atoms to preserve explicit valences from the molfile
  • _parse_molblock() — calls MolFromMolBlock(sanitize=False, removeHs=False), then _update_mol_valences, then Chem.SanitizeMol with the exact flags from your example; emits a warnings.warn and returns None when parsing fails

Each row in the returned DataFrame now has a mol column containing the RDKit Mol object (or None for molecules that couldn't be parsed).

Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@copilot Make sure that chebi IDs are always stored as strings (even though the id is a number). E.g., "123" should still have the string datatype

Done in 361ebd7. Renamed _chebi_id_to_int_chebi_id_to_str which now returns the numeric part as a str (e.g. "CHEBI:123""123"). All node IDs, edge endpoints, and type annotations in obo_extractor.py are updated accordingly.

@sfluegel05 sfluegel05 marked this pull request as ready for review February 25, 2026 09:32
@sfluegel05 sfluegel05 merged commit 887e5d5 into main Feb 25, 2026
@sfluegel05 sfluegel05 deleted the copilot/add-chebi-data-generator branch February 25, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants