Skip to content

Sagar-S-R/GraphBuilder-RAG

Repository files navigation

GraphBuilder-RAG: Graph-Enhanced Retrieval Augmented Generation System

A production-grade, modular framework for building and querying knowledge graphs from heterogeneous documents with advanced RAG capabilities.

🎯 System Overview

GraphBuilder-RAG extracts structured knowledge from documents, validates facts, builds versioned knowledge graphs, and provides hybrid retrieval with hallucination detection.

Key Features

  • Multi-format ingestion: HTML, PDF, CSV, JSON APIs
  • Intelligent extraction: Rule-based + LLM-based triple extraction
  • Fact validation: Ontology rules + external verification
  • Versioned knowledge graph: Neo4j with full provenance tracking
  • Hybrid retrieval: FAISS semantic search + Neo4j graph traversal
  • Hallucination detection: GraphVerify for claim validation
  • Self-healing agents: Auto-verification, conflict resolution, schema evolution

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Ingestion     β”‚ β†’ MongoDB GridFS (raw docs)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Normalization   β”‚ β†’ MongoDB (normalized_docs)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Extraction    β”‚ β†’ MongoDB (candidate_triples)
β”‚  DeepSeek 1.5B  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Validation    β”‚ β†’ MongoDB (validated_triples)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Fusion      β”‚ β†’ Neo4j (knowledge graph)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Query Pipeline              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  FAISS   β”‚  β”‚   Neo4j     β”‚ β”‚
β”‚  β”‚ Semantic β”‚  β”‚   Graph     β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                ↓                β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚         β”‚   Prompt   β”‚          β”‚
β”‚         β”‚  Builder   β”‚          β”‚
β”‚         β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚               ↓                 β”‚
β”‚      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚      β”‚ Groq Llama 70B β”‚         β”‚
β”‚      β”‚   Reasoning    β”‚         β”‚
β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚               ↓                 β”‚
β”‚      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚      β”‚ GraphVerify    β”‚         β”‚
β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🧠 Models Used

  • Extraction: DeepSeek-R1-Distill-Qwen-1.5B (deepseek-r1:1.5b) via Ollama (local)
  • Reasoning/QA: Llama-3.3-70B-Versatile via Groq Cloud API (fast inference)
  • Embeddings: BGE-small (BAAI/bge-small-en-v1.5)

πŸ’Ύ Data Stores

  • MongoDB: Document storage, triples, metadata, audit logs
  • Neo4j: Canonical knowledge graph with versioning
  • FAISS: Vector similarity search (CPU-based)

πŸ“ Project Structure

graphbuilder-rag/
β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ ingestion/          # Document ingestion
β”‚   β”œβ”€β”€ normalization/      # Text extraction & cleaning
β”‚   β”œβ”€β”€ extraction/         # Triple extraction (rules + LLM)
β”‚   β”œβ”€β”€ embedding/          # BGE embeddings + FAISS
β”‚   β”œβ”€β”€ entity_resolution/  # Entity linking & deduplication
β”‚   β”œβ”€β”€ validation/         # Fact validation engine
β”‚   β”œβ”€β”€ fusion/             # Neo4j graph fusion
β”‚   β”œβ”€β”€ retrieval/          # Hybrid retrieval
β”‚   β”œβ”€β”€ query/              # QA service with GraphVerify
β”‚   └── agents/             # Self-healing agents
β”œβ”€β”€ shared/
β”‚   β”œβ”€β”€ config/             # Configuration management
β”‚   β”œβ”€β”€ database/           # DB connectors
β”‚   β”œβ”€β”€ models/             # Pydantic schemas
β”‚   β”œβ”€β”€ prompts/            # LLM prompt templates
β”‚   └── utils/              # Shared utilities
β”œβ”€β”€ workers/                # Celery task workers
β”œβ”€β”€ api/                    # FastAPI endpoints
β”œβ”€β”€ tests/                  # Unit & integration tests
β”œβ”€β”€ docker/                 # Docker configs
└── deployment/             # K8s/compose configs

πŸš€ Quick Start

1. Install Services

macOS:

brew install mongodb-community neo4j redis ollama tesseract poppler

Linux:

# See SETUP.md for detailed Linux installation

2. Start Services

# macOS
brew services start mongodb-community
brew services start neo4j
brew services start redis
ollama serve &

# Pull Ollama model (for extraction only)
ollama pull deepseek-r1:1.5b

# Get Groq API key for Q&A (free tier available)
# Visit: https://console.groq.com/keys

3. Setup Project

# Clone and setup
git clone <repository-url>
cd graphbuilder-rag
chmod +x setup.sh
./setup.sh

4. Run Application

Option A: Separate terminals

# Terminal 1: API
python -m api.main

# Terminal 2: Worker
celery -A workers.tasks worker --loglevel=info --concurrency=4

# Terminal 3: Beat
celery -A workers.tasks beat --loglevel=info

# Terminal 4: Agents (optional)
python -m agents.agents

Option B: Tmux (all-in-one)

chmod +x run.sh
./run.sh

5. Test the API

Ingest a document:

curl -X POST http://localhost:8000/api/v1/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "source_type": "HTML",
    "metadata": {"topic": "AI"}
  }'

Query the system: curl -X POST http://localhost:8000/api/v1/query
-H "Content-Type: application/json"
-d '{ "question": "What are the side effects of aspirin?", "max_chunks": 5, "graph_depth": 2 }'


## πŸ”§ Configuration

Edit `config/config.yaml`:

```yaml
mongodb:
  uri: mongodb://localhost:27017
  database: graphbuilder_rag

neo4j:
  uri: bolt://localhost:7687
  user: neo4j
  password: password

ollama:
  base_url: http://localhost:11434
  extraction_model: deepseek-r1:1.5b  # For entity/relationship extraction

groq:
  api_key: your-groq-api-key-here  # Get from https://console.groq.com/keys
  model: llama-3.3-70b-versatile  # For fast Q&A reasoning

faiss:
  index_type: IndexFlatIP
  embedding_dim: 384

agents:
  reverify_interval: 86400  # 24 hours
  conflict_check_interval: 3600  # 1 hour

πŸ“Š Monitoring

Access metrics at:

  • API Health: http://localhost:8000/health
  • Metrics: http://localhost:8000/metrics
  • Neo4j Browser: http://localhost:7474
  • MongoDB Compass: mongodb://localhost:27017

πŸ§ͺ Testing

# Run all tests
pytest tests/

# Run specific service tests
pytest tests/services/extraction/

# Run integration tests
pytest tests/integration/

πŸ“– Documentation

Setup & Installation

Architecture & Design

Usage & Testing

Advanced Topics

🀝 Contributing

See CONTRIBUTING.md

πŸ“„ License

MIT License - see LICENSE

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages