Skip to content

hugopalma17/RAG_experiments

Repository files navigation

What This Project Is

A RAG (Retrieval-Augmented Generation) pipeline and embedding model benchmark over ~5,559 job listing markdown files. The pipeline loads job .md files, chunks them, embeds them via multiple embedding providers, and answers questions using cosine similarity retrieval + LLM generation.

The benchmark evaluates 5 embedding models across 18 queries designed to test different retrieval capabilities, with progressive data cleaning iterations to measure how data quality affects retrieval.

Key Commands

# Activate the venv (Python 3.14)
source venv/bin/activate

# Run RAG pipeline steps individually
python rag_pipeline.py --step load                          # Load .md files from jobs/
python rag_pipeline.py --step chunk                         # Chunk documents
python rag_pipeline.py --step embed                         # Embed chunks
python rag_pipeline.py --step query --query "your question" # Full retrieval + generation
python rag_pipeline.py --step all --query "your question"   # All steps end-to-end

# Useful flags
--limit N              # Process first N files (default: 5, 0 = all ~5559)
--chunk-size N         # Chunk size in chars (default: 1000)
--overlap N            # Chunk overlap in chars (default: 200)
--embed-model <name>   # Embedding model: gemini, openai_small, openai_large, e5, bge

# Run benchmark (18 queries against a saved pickle run)
python benchmark.py --run "<model>_baseline" --embed-model <name>

# Generate benchmark report (aggregate stats + per-query breakdown)
python benchmark_report.py --json benchmark_results/<model>/<run>.json

# Clean boilerplate from job markdown files
python clean_boilerplate.py                                        # Clean jobs/jobs-no-boilerplate/
python clean_boilerplate.py --target-dir jobs/jobs-structured      # Clean a specific directory
python clean_boilerplate.py --dry-run                              # Preview without writing

# Export jobs from Postgres to markdown
node generate-md.js              # Full export (requires Postgres env vars)
node generate-structured-md.js   # ID-filtered export to jobs/jobs-structured/

Architecture

Data flow: Postgres DB → generate-md.jsjobs/*.mdrag_pipeline.py (load → chunk → embed → query)

  • generate-md.js — Node script that connects to a Postgres jobs table and writes one .md file per job into jobs/. Each file has structured sections: Job Details, AI Analysis, Interview Insights.
  • generate-structured-md.js — ID-filtered export that only fetches jobs matching existing file IDs in jobs/, outputs to jobs/jobs-structured/ without AI Analysis or Interview Insights sections.
  • rag_pipeline.py — Main pipeline. Four sequential steps, each runnable independently:
    1. load — Glob jobs/*.md, read into memory
    2. chunk — Two-pass splitting: MarkdownHeaderTextSplitter (by #, ##, ###) then RecursiveCharacterTextSplitter
    3. embed — Supports 5 embedding providers (see Models below). Computes and displays cosine similarity matrix.
    4. query — Embeds the query, cosine-similarity ranks all chunks, takes top-5, builds augmented prompt, calls claude-opus-4-6 for generation.
  • benchmark.py — Runs all 18 queries against a saved pickle, records similarity scores + LLM responses, outputs JSON.
  • benchmark_report.py — Generates aggregate stats (top-1, floor, spread, tokens) and per-query breakdowns from benchmark JSON files.
  • clean_boilerplate.py — Strips noise from job markdown files (AI Analysis sections, Interview Insights, EEO text, benefits boilerplate, structural artifacts).

Data Directories

Directory Contents
jobs/ Original raw export from Postgres (~5,559 files). Baseline. Do not edit.
jobs/jobs-no-boilerplate/ Cleaned files — AI Analysis, Interview Insights, EEO, benefits boilerplate stripped. ~27% size reduction.
jobs/jobs-structured/ Re-exported from Postgres with only job content fields (no AI Analysis/Interview Insights). Nearly identical to no-boilerplate after cleaning.

Environment Variables

  • GOOGLE_API_KEY — Required for Gemini embeddings + LLM generation
  • OPENAI_API_KEY — Required for OpenAI embeddings
  • HF_TOKEN — Required for HuggingFace Inference API (e5, bge)
  • GROQ_API_KEY — Required for groqmachine.py
  • POSTGRES_HOST, POSTGRES_PORT, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB — Required for generate-md.js

Key Dependencies

  • Python: langchain, langchain-google-genai, langchain-text-splitters, openai, huggingface_hub, groq
  • Node: pg (PostgreSQL client)

Embedding Model Benchmark

Objective

Evaluate how different embedding models affect retrieval quality and answer usefulness across identical data, chunking, and queries. Each model embeds the same ~5,559 job files using the same chunking parameters (chunk_size=1000, overlap=200). All 18 benchmark queries run against every model, and results are compared across two data iterations: baseline (raw) and no-boilerplate (cleaned).

Models Under Test

Model Dims Cost Provider Notes
e5-large-instruct 1024 free (HF Inference) BAAI/HuggingFace Highest baseline accuracy (0.879 top-1)
gemini-embedding-001 3072 $0.15/1M tok Google Structural matching problem on analytical queries
bge-base-en-v1.5 768 free (HF Inference) BAAI/HuggingFace Reliable, good coverage
text-embedding-3-small 1536 $0.02/1M tok OpenAI Low similarity scores but surprisingly good answers
text-embedding-3-large 3072 $0.13/1M tok OpenAI No clear advantage over small in response quality

18 Benchmark Queries

Each query tests a specific retrieval + reasoning capability. 3 queries per category, same queries across every model and data iteration.

Q Category Query Tests
1 synthesis What does a typical senior ML engineer role look like in terms of day-to-day responsibilities? Synthesize patterns across multiple job descriptions into a coherent picture
2 synthesis What tech stack do companies building LLM-powered products typically require? Extract and combine technical requirements from LLM-related roles
3 synthesis What does the interview process look like for AI engineering roles based on these listings? Pull and synthesize interview details scattered across descriptions
4 comparison How do junior versus senior AI roles differ in what they expect candidates to know? Compare and contrast requirements across seniority levels
5 comparison What is the difference between what startups and large companies look for in machine learning engineers? Distinguish company-stage signals and compare expectations
6 comparison How do roles focused on building AI products from scratch differ from those integrating existing models or APIs? Semantic depth, build vs integrate distinction across descriptions
7 inference Which roles seem to expect someone who can work independently with minimal supervision? Infer autonomy expectations from indirect language cues
8 inference Based on the job descriptions, which roles are more research-oriented versus production engineering? Classify roles by inferred focus without explicit labels
9 inference Which jobs sound like they want a full-stack engineer who also does ML, rather than a pure ML researcher? Infer hybrid role expectations from combined skill signals
10 pattern What soft skills keep appearing across AI and ML engineering job descriptions? Identify recurring non-technical requirements across retrieved chunks
11 pattern What tools and frameworks are most commonly mentioned alongside LLM or RAG work? Extract co-occurring technical terms in a specific subdomain
12 pattern What benefits beyond salary do AI companies highlight to attract engineering candidates? Identify perks and cultural signals across multiple listings
13 nuanced-retrieval Find roles where the focus is on data quality and pipeline reliability rather than model building Retrieve based on semantic intent, not keyword overlap with ML terms
14 nuanced-retrieval Jobs that emphasize mentorship, career growth, or a strong engineering culture Retrieve on soft cultural signals buried in descriptions
15 nuanced-retrieval Roles that involve deploying models to production and managing inference at scale Distinguish MLOps/deployment focus from training/research focus
16 analysis Based on these job listings, what skills would you recommend someone learn to be competitive for AI engineering roles? LLM must reason about market signals and form a recommendation
17 analysis Which job descriptions seem the most well-written and informative versus vague and generic? LLM judges content quality, requires meta-reasoning about the text itself
18 analysis Based on the requirements listed, which roles seem the hardest to fill and why? Infer hiring difficulty from requirement complexity and specificity

Metrics

Retrieval Quality (per query)

  • Top-1 similarity score
  • Top-5 similarity scores with source file references
  • Boilerplate flag per retrieved chunk

Aggregate (per model run)

  • Top-1 avg — mean of best similarity across all 18 queries
  • Floor — lowest top-1 score (worst-case retrieval)
  • Spread — top-1 minus floor (consistency measure; tighter = better)
  • Total tokens — total context tokens across all queries

Answer Quality (per query)

  • LLM response generated by claude-opus-4-6 using top-200 retrieved chunks
  • Response character count
  • Qualitative evaluation in llm_response_evaluation.md

Data Iterations

Round Directory Description
Baseline jobs/ Raw export including AI Analysis, Interview Insights, EEO, benefits boilerplate
No-Boilerplate jobs/jobs-no-boilerplate/ Cleaned — AI sections removed, boilerplate patterns stripped, ~27% size reduction

Baseline Results

Model Top-1 Floor Spread Tokens
e5-large-instruct 0.879 0.853 0.025 25,992
gemini-embedding-001 0.848 0.808 0.040 24,885
bge-base-en-v1.5 0.742 0.669 0.073 33,936
text-embedding-3-small 0.598 0.501 0.097 38,496
text-embedding-3-large 0.571 0.475 0.096 37,895

Note: Absolute similarity scores are not comparable across models due to different embedding space geometries. What matters is the ranking of retrieved chunks, not the raw number.

No-Boilerplate Results

After stripping AI Analysis sections, Interview Insights, EEO text, benefits boilerplate, and structural artifacts. ~35% reduction in chunk count (51,545 to 33,409 vectors for HF models, 36,523 for Gemini).

Model Top-1 Floor Spread Tokens
e5-large-instruct 0.872 0.838 0.034 36,099
gemini-embedding-001 0.850 0.805 0.045 27,770
bge-base-en-v1.5 0.724 0.655 0.069 44,365
text-embedding-3-small 0.582 0.478 0.104 44,475
text-embedding-3-large 0.561 0.462 0.099 43,583

Key finding: Top-1 scores barely changed, but spreads widened for most models (E5: 0.025 to 0.034, +36%). Fewer chunks but denser content per retrieval window, resulting in higher token counts per query.

Run Protocol

# 1. Embed (same data, same chunking, different model)
python rag_pipeline.py --step embed --limit 0 --embed-model <name>

# 2. Run benchmark (18 queries)
python benchmark.py --run "<model>_baseline" --embed-model <name>

# 3. Generate report
python benchmark_report.py --json benchmark_results/<model>/<run>.json

Key Files

  • benchmark.py — Runs 18 queries against saved pickle runs, records metrics + LLM responses
  • benchmark_report.py — Computes aggregate stats and per-query breakdowns from benchmark JSONs
  • benchmark_results/ — One subdirectory per model, containing benchmark JSON output
  • llm_response_evaluation.md — Qualitative evaluation of LLM response quality across all 5 models and 18 queries
  • clean_boilerplate.py — Data cleaning script for progressive iteration testing
  • runs/*.pkl — One pickle per embedding run
  • raw.log — Raw commands for each embedding round

About

Learning and doing A/B statistical analysis on RAG pipelins

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages