Skip to content

sciknoworg/schema-miner

Repository files navigation

schema miner pro logo

PyPI - Version Pepy Total Downloads Maintained Yes pre-commit security: bandit MIT License DOI Read the Docs

SCHEMA-MINERpro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow

This is an open-source implementation of Schema-Minerpro.

πŸ“‹ Schema-minerpro Overview

Schema-Miner is a novel framework that leverages Large Language Models (LLMs) and continuous human feedback to automate and enhance the schema mining task. Through an iterative process, the framework uses LLMs to extract and organize properties from unstructured text and refines schemas with expert input ESWC Proceedings. Schema-Minerpro extends Schema-Miner with an ontology grounding component powered by agentic AI. It performs multi-step reasoning using lexical heuristics and semantic similarity search, and grounds schema elements in formal ontologies (e.g., QUDT). Comprehensive documentation for Schema-Miner Pro, including detailed guides and examples, is available at schema-miner.readthedocs.io.

Note

Schema-Miner implements a three-stage pipeline for schema discovery and refinement without ontology grounding (see Figure 1). Schema-Miner Pro extends this pipeline by grounding the discovered schemas to formal ontologies.

Figure 1: Overview of the LLMs4SchemaDiscovery workflow implemented in the SCHEMA-MINER tool. Stage 1 generates an initial process schema using domain specifications, while Stage 2, refines this schema using a small, curated scientific corpus. In Stage 3, schema is further enriched using a larger, non-curated corpus. The final stage involves grounding the properties in formal ontologies.

πŸ§ͺ Installation

Install the package directly from PyPI using pip:

pip install schema-miner

If you are working with the source code directly, install dependencies from requirements.txt:

git clone https://github.com/sciknoworg/schema-miner.git
cd schema-miner
pip install -r requirements.txt

Important

Before running schema-miner for the first time, configure your environment by copying .env.example to .env and filling in your values. See the Configuration section below.

βš™οΈ Configuration

Schema-Miner is configured entirely through a .env file in the project root. Copy the provided template and fill in your values:

cp .env.example .env

πŸ€– Model Configuration

Select your LLM provider and model, then fill in only the credentials block for that provider:

# Active provider β€” options: OPENAI | SAIA | OLLAMA | HUGGINGFACE
# Use SAIA for any OpenAI-compatible endpoint (GWDG/SAIA, OpenRouter, etc.)
LLM_PROVIDER = '<Your LLM provider here>'
LLM_MODEL = '<Your model here>'                          # e.g. mistral-large-3-675b-instruct-2512, gemma-3-27b-it

# OpenAI
OPENAI_API_KEY = '<your-openai-api-key>'
OPENAI_ORGANIZATION_ID = '<your-openai-organization-id>' # Optional, only needed if you have multiple organizations in OpenAI

# SAIA / Any OpenAI-compatible endpoint
# Schema-Miner supports any service exposing an OpenAI-compatible API.
# Provide your API key and the base URL for your preferred provider.
SAIA_API_KEY = '<your-api-key>'                         # SAIA key  OR  OpenRouter key 
SAIA_BASE_URL = 'https://chat-ai.academiccloud.de/v1'   # GWDG/SAIA (Germany) | or use https://openrouter.ai/api/v1 for OpenRouter

# Ollama  (leave blank if running locally on the same machine)
OLLAMA_BASE_URL = '<OLLAMA Server Base URL>'

# HuggingFace
HuggingFace_Access_Token = '<your-huggingface-access-token>'
HUGGINGFACE_USE_LOCAL = False                            # True = load model locally (GPU recommended) | False = use Inference API

Supported LLM Providers

Schema-Miner supports any service that exposes an OpenAI-compatible API via the SAIA provider type β€” just supply your API key and the service's base URL.

Provider LLM_PROVIDER value Example models Notes
OpenAI OPENAI gpt-4o, o3-mini Requires OPENAI_API_KEY
GWDG / SAIA SAIA gemma-3-27b-it, qwen3-30b-a3b-instruct-2507 OpenAI-compatible; set SAIA_BASE_URL = https://chat-ai.academiccloud.de/v1
OpenRouter SAIA qwen/qwen3-235b-a22b-2507, anthropic/claude-sonnet-4.6 OpenAI-compatible; set SAIA_BASE_URL = https://openrouter.ai/api/v1 β€” see openrouter.ai/docs
Ollama OLLAMA llama3.2:3b, ministral-3:3b Local or remote server; no API key needed
HuggingFace HUGGINGFACE Qwen/Qwen3-4B-Instruct-2507 Local GPU mode or serverless Inference API. Requires HuggingFace_Access_Token

Note

HuggingFace local mode (HUGGINGFACE_USE_LOCAL = True) downloads and runs the model on your machine. A CUDA-compatible GPU is strongly recommended for models larger than 1B parameters. For CPU-only machines, use the Inference API (HUGGINGFACE_USE_LOCAL = False) instead.

πŸ”¬ Process Configuration

Define the scientific process whose schema you want to discover:

PROCESS_NAME = '<your-process-name>'
PROCESS_DESCRIPTION = '<a brief description of the process>'

These values are injected into every LLM prompt as scientific context.

πŸ“‚ Data Paths

Point schema-miner to your input documents. Only set the path for the stage you intend to run:

# Stage 1 β€” path to the process specification document (PDF or plain text)
STAGE1_SPECS_PATH = 'data/stage-1/my-process/specification.pdf'

# Stage 2 β€” directory containing curated research papers (PDF or plain text)
STAGE2_PAPERS_PATH = 'data/stage-2/my-process/papers/'

# Stage 3 β€” directory containing the broader paper corpus (PDF or plain text)
STAGE3_PAPERS_PATH = 'data/stage-3/my-process/papers/'

πŸ“€ Output Configuration

Set the directory where extracted schemas will be saved:

RESULTS_PATH = 'results/my-run/'

πŸš€ Usage

Schema-Miner Pro supports two usage modes:

  1. Python SDK β€” programmatic access via function calls, ideal for notebooks and custom workflows
  2. CLI β€” command-line interface for direct stage execution without writing any Python

πŸ““ Python SDK

For a quick start with the Python SDK, see the provided example notebooks:


πŸ–₯️ CLI Reference

Schema-Miner exposes a schema-miner command after installation. All configuration is read from the .env file β€” no Python code required.

schema-miner [OPTIONS]

Options

Option Values Required when Description
--stage 1, 2, 3 Mutually exclusive with --ontology-grounding Run a schema extraction stage
--ontology-grounding prompt, agentic Mutually exclusive with --stage Run ontology grounding
--schema <path> Stages 2, 3, and ontology grounding Path to the input JSON schema file
--expert-feedback <text or path> Optional (stages 2 & 3) Inline review text, or path to a .txt / .md file
--papers N or all Optional (stages 2 & 3) Number of papers to process per batch (default: 1)
--version β€” β€” Display the installed version and exit
--help β€” β€” Show possible options and exit

🧩 Stage 1 β€” Initial Schema Mining

Generates an initial JSON schema from a process specification document using the configured LLM.

Prerequisite: set STAGE1_SPECS_PATH in .env (PDF or plain text file).

schema-miner --stage 1

Schema-Miner reads the specification document, queries the LLM, and saves the resulting JSON schema to RESULTS_PATH.


πŸ”„ Stage 2 β€” Preliminary Schema Refinement

Refines the Stage 1 schema using domain-expert feedback and a curated corpus of scientific papers.

Prerequisites: set STAGE2_PAPERS_PATH in .env; have a Stage 1 schema file available.

# Basic β€” process papers one by one (default) without initial expert feedback and prompt user for feedback after each paper
schema-miner --stage 2 --schema results/stage-1/mistral-large-3-675b-instruct-2512.json

# With inline expert feedback for the first batch, and prompt user for feedback after each subsequent paper
schema-miner --stage 2 --schema results/stage-1/mistral-large-3-675b-instruct-2512.json \
    --expert-feedback "Please add units for all temperature and pressure fields."

# With expert feedback for the first batch, and prompt user for feedback after each subsequent paper
schema-miner --stage 2 --schema results/stage-1/mistral-large-3-675b-instruct-2512.json \
    --expert-feedback data/stage-2/reviews/mistral-large-3-675b-instruct-2512.txt

# Process papers in batches of 3, and NO expert feedback is provided for the first batch but user is prompted to provided feedback after each batch
schema-miner --stage 2 --schema results/stage-1/mistral-large-3-675b-instruct-2512.json --papers 3

# Process all papers in a single batch with NO expert feedback
schema-miner --stage 2 --schema results/stage-1/mistral-large-3-675b-instruct-2512.json --papers all

# Process papers in batches of 3 with inline expert feedback for the first batch, and prompt user for feedback after each subsequent batch
schema-miner --stage 2 --schema results/stage-1/mistral-large-3-675b-instruct-2512.json --papers 3 \
    --expert-feedback "Please add units for all temperature and pressure fields."

Schema-Miner iteratively processes the papers, updating the schema after each paper and optionally incorporating expert feedback. The intermediate schemas after each iteration and the final refined schema are saved to RESULTS_PATH.


🏁 Stage 3 β€” Final Schema Refinement

Validates and finalises the schema using a larger, non-curated paper corpus and expert review, ensuring generalisability and semantic robustness.

Prerequisites: set STAGE3_PAPERS_PATH in .env; have a Stage 2 schema file available.

# Basic β€” process papers one by one (default) without initial expert feedback and prompt user for feedback after each paper
schema-miner --stage 3 --schema results/stage-2/mistral-large-3-675b-instruct-2512.json

# With inline expert feedback for the first batch, and prompt user for feedback after each subsequent paper
schema-miner --stage 3 --schema results/stage-2/mistral-large-3-675b-instruct-2512.json \
    --expert-feedback "Ensure all quantities reference standard SI units."

# With expert feedback  for the first batch, and prompt user for feedback after each subsequent paper
schema-miner --stage 3 --schema results/stage-2/mistral-large-3-675b-instruct-2512.json \
    --expert-feedback data/stage-3/reviews/mistral-large-3-675b-instruct-2512.txt

# Process papers in batches of 5, and NO expert feedback is provided for the first batch but user is prompted to provided feedback after each batch
schema-miner --stage 3 --schema results/stage-2/mistral-large-3-675b-instruct-2512.json --papers 5

# Process all papers in a single batch with NO expert feedback
schema-miner --stage 3 --schema results/stage-2/mistral-large-3-675b-instruct-2512.json --papers all

# Process papers in batches of 5 with inline expert feedback for the first batch, and prompt user for feedback after each subsequent batch
schema-miner --stage 3 --schema results/stage-2/mistral-large-3-675b-instruct-2512.json --papers 5 \
    --expert-feedback "Ensure all quantities reference standard SI units."

Schema-Miner processes the papers iteratively, updating the schema after each paper and optionally incorporating expert feedback. The intermediate schemas after each iteration and the final refined schema are saved to RESULTS_PATH.


🌐 Ontology Grounding with QUDT

Semantically grounds the discovered schema against the QUDT (Quantities, Units, Dimensions, and Data Types) ontology.

Two grounding methods are available:

Method --ontology-grounding value Description
Prompt-based prompt Single LLM call per schema field; fast and lightweight
Agentic agentic Multi-step reasoning with lexical heuristics and semantic similarity search; higher accuracy
# Prompt-based grounding
schema-miner --ontology-grounding prompt --schema results/stage-3/mistral-large-3-675b-instruct-2512.json

# Agentic grounding (recommended)
schema-miner --ontology-grounding agentic --schema results/stage-3/mistral-large-3-675b-instruct-2512.json

The grounded schema is saved to RESULTS_PATH.

πŸ“š Citing this Work

If you use this repository in your research or applications, please cite the following paper(s):

  • LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models:

    Sameer Sadruddin, Jennifer D’Souza, Eleni Poupaki, Alex Watkins, Hamed Babaei Giglou, Anisa Rula, Bora Karasulu, SΓΆren Auer, Adrie Mackus, and Erwin Kessels. LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models. In The Semantic Web – ESWC 2025, Springer, Cham, pp. 244–261. https://doi.org/10.1007/978-3-031-94578-6_14

    πŸ“Œ BibTeX

    @InProceedings{10.1007/978-3-031-94578-6_14,
      author    = {Sadruddin, Sameer and D'Souza, Jennifer and Poupaki, Eleni and Watkins, Alex and Babaei Giglou, Hamed and Rula, Anisa and Karasulu, Bora and Auer, S{\"o}ren and Mackus, Adrie and Kessels, Erwin},
      editor    = {Curry, Edward and Acosta, Maribel and Poveda-Villal{\'o}n, Maria and van Erp, Marieke and Ojo, Adegboyega and Hose, Katja and Shimizu, Cogan and Lisena, Pasquale},
      title     = {LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models},
      booktitle = {The Semantic Web},
      year      = {2025},
      publisher = {Springer Nature Switzerland},
      address   = {Cham},
      pages     = {244--261},
      isbn      = {978-3-031-94578-6},
    }
  • SCHEMA-MINERpro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow

    Sameer Sadruddin, Jennifer D’Souza, Eleni Poupaki, Alex Watkins, Bora Karasulu, SΓΆren Auer, Adrie Mackus, and Erwin Kessels. SCHEMA-MINERpro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow. In Semantic Web Journal. https://www.semantic-web-journal.net/system/files/swj3871.pdf

    πŸ“Œ BibTeX

    @InProceedings{10.1007/978-3-031-94578-6_14,
      author    = {Sadruddin, Sameer and D'Souza, Jennifer and Poupaki, Eleni and Watkins, Alex and Karasulu, Bora and Auer, S{\"o}ren and Mackus, Adrie and Kessels, Erwin},
      title     = {SCHEMA-MINERpro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow},
      journal = {Semantic Web Journal},
      year      = {2025},
    }

πŸ‘₯ Contact & Contributions

We’d love to hear from you! Whether you're interested in collaborating on Schema-MinerPro or have ideas to extend its capabilities, feel free to reach out:

  • Collaboration inquiries: Contact Jennifer D'Souza at jennifer.dsouza [at] tib.eu

  • Development questions or bug reports: Please open an issue right here in the repository or get in touch with the lead developer Sameer Sadruddin at sameer.sadruddin [at] tib.eu

Let’s build better schema-mining toolsβ€”together!

πŸ“ƒ License

This work is licensed under a MIT License