A production-grade, modular framework for building and querying knowledge graphs from heterogeneous documents with advanced RAG capabilities.
GraphBuilder-RAG extracts structured knowledge from documents, validates facts, builds versioned knowledge graphs, and provides hybrid retrieval with hallucination detection.
- Multi-format ingestion: HTML, PDF, CSV, JSON APIs
- Intelligent extraction: Rule-based + LLM-based triple extraction
- Fact validation: Ontology rules + external verification
- Versioned knowledge graph: Neo4j with full provenance tracking
- Hybrid retrieval: FAISS semantic search + Neo4j graph traversal
- Hallucination detection: GraphVerify for claim validation
- Self-healing agents: Auto-verification, conflict resolution, schema evolution
βββββββββββββββββββ
β Ingestion β β MongoDB GridFS (raw docs)
ββββββββββ¬βββββββββ
β
βββββββββββββββββββ
β Normalization β β MongoDB (normalized_docs)
ββββββββββ¬βββββββββ
β
βββββββββββββββββββ
β Extraction β β MongoDB (candidate_triples)
β DeepSeek 1.5B β
ββββββββββ¬βββββββββ
β
βββββββββββββββββββ
β Validation β β MongoDB (validated_triples)
ββββββββββ¬βββββββββ
β
βββββββββββββββββββ
β Fusion β β Neo4j (knowledge graph)
ββββββββββ¬βββββββββ
β
βββββββββββββββββββββββββββββββββββ
β Query Pipeline β
β ββββββββββββ βββββββββββββββ β
β β FAISS β β Neo4j β β
β β Semantic β β Graph β β
β ββββββ¬ββββββ ββββββββ¬βββββββ β
β ββββββββββ¬ββββββββ β
β β β
β ββββββββββββββ β
β β Prompt β β
β β Builder β β
β βββββββ¬βββββββ β
β β β
β ββββββββββββββββββ β
β β Groq Llama 70B β β
β β Reasoning β β
β ββββββββββ¬ββββββββ β
β β β
β ββββββββββββββββββ β
β β GraphVerify β β
β ββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ
- Extraction: DeepSeek-R1-Distill-Qwen-1.5B (
deepseek-r1:1.5b) via Ollama (local) - Reasoning/QA: Llama-3.3-70B-Versatile via Groq Cloud API (fast inference)
- Embeddings: BGE-small (
BAAI/bge-small-en-v1.5)
- MongoDB: Document storage, triples, metadata, audit logs
- Neo4j: Canonical knowledge graph with versioning
- FAISS: Vector similarity search (CPU-based)
graphbuilder-rag/
βββ services/
β βββ ingestion/ # Document ingestion
β βββ normalization/ # Text extraction & cleaning
β βββ extraction/ # Triple extraction (rules + LLM)
β βββ embedding/ # BGE embeddings + FAISS
β βββ entity_resolution/ # Entity linking & deduplication
β βββ validation/ # Fact validation engine
β βββ fusion/ # Neo4j graph fusion
β βββ retrieval/ # Hybrid retrieval
β βββ query/ # QA service with GraphVerify
β βββ agents/ # Self-healing agents
βββ shared/
β βββ config/ # Configuration management
β βββ database/ # DB connectors
β βββ models/ # Pydantic schemas
β βββ prompts/ # LLM prompt templates
β βββ utils/ # Shared utilities
βββ workers/ # Celery task workers
βββ api/ # FastAPI endpoints
βββ tests/ # Unit & integration tests
βββ docker/ # Docker configs
βββ deployment/ # K8s/compose configs
macOS:
brew install mongodb-community neo4j redis ollama tesseract popplerLinux:
# See SETUP.md for detailed Linux installation# macOS
brew services start mongodb-community
brew services start neo4j
brew services start redis
ollama serve &
# Pull Ollama model (for extraction only)
ollama pull deepseek-r1:1.5b
# Get Groq API key for Q&A (free tier available)
# Visit: https://console.groq.com/keys# Clone and setup
git clone <repository-url>
cd graphbuilder-rag
chmod +x setup.sh
./setup.shOption A: Separate terminals
# Terminal 1: API
python -m api.main
# Terminal 2: Worker
celery -A workers.tasks worker --loglevel=info --concurrency=4
# Terminal 3: Beat
celery -A workers.tasks beat --loglevel=info
# Terminal 4: Agents (optional)
python -m agents.agentsOption B: Tmux (all-in-one)
chmod +x run.sh
./run.shIngest a document:
curl -X POST http://localhost:8000/api/v1/ingest \
-H "Content-Type: application/json" \
-d '{
"source": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"source_type": "HTML",
"metadata": {"topic": "AI"}
}'Query the system:
curl -X POST http://localhost:8000/api/v1/query
-H "Content-Type: application/json"
-d '{
"question": "What are the side effects of aspirin?",
"max_chunks": 5,
"graph_depth": 2
}'
## π§ Configuration
Edit `config/config.yaml`:
```yaml
mongodb:
uri: mongodb://localhost:27017
database: graphbuilder_rag
neo4j:
uri: bolt://localhost:7687
user: neo4j
password: password
ollama:
base_url: http://localhost:11434
extraction_model: deepseek-r1:1.5b # For entity/relationship extraction
groq:
api_key: your-groq-api-key-here # Get from https://console.groq.com/keys
model: llama-3.3-70b-versatile # For fast Q&A reasoning
faiss:
index_type: IndexFlatIP
embedding_dim: 384
agents:
reverify_interval: 86400 # 24 hours
conflict_check_interval: 3600 # 1 hour
Access metrics at:
- API Health:
http://localhost:8000/health - Metrics:
http://localhost:8000/metrics - Neo4j Browser:
http://localhost:7474 - MongoDB Compass:
mongodb://localhost:27017
# Run all tests
pytest tests/
# Run specific service tests
pytest tests/services/extraction/
# Run integration tests
pytest tests/integration/- Setup Guide - Complete installation and configuration
- Installation Checklist - Step-by-step setup verification
- Quick Installation - Fast setup for all platforms
- System Architecture - Complete system overview
- Framework Guide - Customization and extension guide
- Celery & Agents - Background tasks and autonomous agents
- Quick Start - Get started in 5 minutes
- Testing Guide - Test workflows and examples
- External Verification - Third-party fact checking
See CONTRIBUTING.md
MIT License - see LICENSE