GitHub - hugopalma17/RAG_experiments: Learning and doing A/B statistical analysis on RAG pipelins

What This Project Is

A RAG (Retrieval-Augmented Generation) pipeline and embedding model benchmark over ~5,559 job listing markdown files. The pipeline loads job .md files, chunks them, embeds them via multiple embedding providers, and answers questions using cosine similarity retrieval + LLM generation.

The benchmark evaluates 5 embedding models across 18 queries designed to test different retrieval capabilities, with progressive data cleaning iterations to measure how data quality affects retrieval.

Key Commands

# Activate the venv (Python 3.14)
source venv/bin/activate

# Run RAG pipeline steps individually
python rag_pipeline.py --step load                          # Load .md files from jobs/
python rag_pipeline.py --step chunk                         # Chunk documents
python rag_pipeline.py --step embed                         # Embed chunks
python rag_pipeline.py --step query --query "your question" # Full retrieval + generation
python rag_pipeline.py --step all --query "your question"   # All steps end-to-end

# Useful flags
--limit N              # Process first N files (default: 5, 0 = all ~5559)
--chunk-size N         # Chunk size in chars (default: 1000)
--overlap N            # Chunk overlap in chars (default: 200)
--embed-model <name>   # Embedding model: gemini, openai_small, openai_large, e5, bge

# Run benchmark (18 queries against a saved pickle run)
python benchmark.py --run "<model>_baseline" --embed-model <name>

# Generate benchmark report (aggregate stats + per-query breakdown)
python benchmark_report.py --json benchmark_results/<model>/<run>.json

# Clean boilerplate from job markdown files
python clean_boilerplate.py                                        # Clean jobs/jobs-no-boilerplate/
python clean_boilerplate.py --target-dir jobs/jobs-structured      # Clean a specific directory
python clean_boilerplate.py --dry-run                              # Preview without writing

# Export jobs from Postgres to markdown
node generate-md.js              # Full export (requires Postgres env vars)
node generate-structured-md.js   # ID-filtered export to jobs/jobs-structured/

Architecture

Data flow: Postgres DB → generate-md.js → jobs/*.md → rag_pipeline.py (load → chunk → embed → query)

generate-md.js — Node script that connects to a Postgres jobs table and writes one .md file per job into jobs/. Each file has structured sections: Job Details, AI Analysis, Interview Insights.
generate-structured-md.js — ID-filtered export that only fetches jobs matching existing file IDs in jobs/, outputs to jobs/jobs-structured/ without AI Analysis or Interview Insights sections.
rag_pipeline.py — Main pipeline. Four sequential steps, each runnable independently:
1. load — Glob jobs/*.md, read into memory
2. chunk — Two-pass splitting: MarkdownHeaderTextSplitter (by #, ##, ###) then RecursiveCharacterTextSplitter
3. embed — Supports 5 embedding providers (see Models below). Computes and displays cosine similarity matrix.
4. query — Embeds the query, cosine-similarity ranks all chunks, takes top-5, builds augmented prompt, calls claude-opus-4-6 for generation.
benchmark.py — Runs all 18 queries against a saved pickle, records similarity scores + LLM responses, outputs JSON.
benchmark_report.py — Generates aggregate stats (top-1, floor, spread, tokens) and per-query breakdowns from benchmark JSON files.
clean_boilerplate.py — Strips noise from job markdown files (AI Analysis sections, Interview Insights, EEO text, benefits boilerplate, structural artifacts).

Data Directories

Directory	Contents
`jobs/`	Original raw export from Postgres (~5,559 files). Baseline. Do not edit.
`jobs/jobs-no-boilerplate/`	Cleaned files — AI Analysis, Interview Insights, EEO, benefits boilerplate stripped. ~27% size reduction.
`jobs/jobs-structured/`	Re-exported from Postgres with only job content fields (no AI Analysis/Interview Insights). Nearly identical to no-boilerplate after cleaning.

Environment Variables

GOOGLE_API_KEY — Required for Gemini embeddings + LLM generation
OPENAI_API_KEY — Required for OpenAI embeddings
HF_TOKEN — Required for HuggingFace Inference API (e5, bge)
GROQ_API_KEY — Required for groqmachine.py
POSTGRES_HOST, POSTGRES_PORT, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB — Required for generate-md.js

Key Dependencies

Python: langchain, langchain-google-genai, langchain-text-splitters, openai, huggingface_hub, groq
Node: pg (PostgreSQL client)

Embedding Model Benchmark

Objective

Evaluate how different embedding models affect retrieval quality and answer usefulness across identical data, chunking, and queries. Each model embeds the same ~5,559 job files using the same chunking parameters (chunk_size=1000, overlap=200). All 18 benchmark queries run against every model, and results are compared across two data iterations: baseline (raw) and no-boilerplate (cleaned).

Models Under Test

Model	Dims	Cost	Provider	Notes
`e5-large-instruct`	1024	free (HF Inference)	BAAI/HuggingFace	Highest baseline accuracy (0.879 top-1)
`gemini-embedding-001`	3072	$0.15/1M tok	Google	Structural matching problem on analytical queries
`bge-base-en-v1.5`	768	free (HF Inference)	BAAI/HuggingFace	Reliable, good coverage
`text-embedding-3-small`	1536	$0.02/1M tok	OpenAI	Low similarity scores but surprisingly good answers
`text-embedding-3-large`	3072	$0.13/1M tok	OpenAI	No clear advantage over small in response quality

18 Benchmark Queries

Each query tests a specific retrieval + reasoning capability. 3 queries per category, same queries across every model and data iteration.

Q	Category	Query	Tests
1	synthesis	What does a typical senior ML engineer role look like in terms of day-to-day responsibilities?	Synthesize patterns across multiple job descriptions into a coherent picture
2	synthesis	What tech stack do companies building LLM-powered products typically require?	Extract and combine technical requirements from LLM-related roles
3	synthesis	What does the interview process look like for AI engineering roles based on these listings?	Pull and synthesize interview details scattered across descriptions
4	comparison	How do junior versus senior AI roles differ in what they expect candidates to know?	Compare and contrast requirements across seniority levels
5	comparison	What is the difference between what startups and large companies look for in machine learning engineers?	Distinguish company-stage signals and compare expectations
6	comparison	How do roles focused on building AI products from scratch differ from those integrating existing models or APIs?	Semantic depth, build vs integrate distinction across descriptions
7	inference	Which roles seem to expect someone who can work independently with minimal supervision?	Infer autonomy expectations from indirect language cues
8	inference	Based on the job descriptions, which roles are more research-oriented versus production engineering?	Classify roles by inferred focus without explicit labels
9	inference	Which jobs sound like they want a full-stack engineer who also does ML, rather than a pure ML researcher?	Infer hybrid role expectations from combined skill signals
10	pattern	What soft skills keep appearing across AI and ML engineering job descriptions?	Identify recurring non-technical requirements across retrieved chunks
11	pattern	What tools and frameworks are most commonly mentioned alongside LLM or RAG work?	Extract co-occurring technical terms in a specific subdomain
12	pattern	What benefits beyond salary do AI companies highlight to attract engineering candidates?	Identify perks and cultural signals across multiple listings
13	nuanced-retrieval	Find roles where the focus is on data quality and pipeline reliability rather than model building	Retrieve based on semantic intent, not keyword overlap with ML terms
14	nuanced-retrieval	Jobs that emphasize mentorship, career growth, or a strong engineering culture	Retrieve on soft cultural signals buried in descriptions
15	nuanced-retrieval	Roles that involve deploying models to production and managing inference at scale	Distinguish MLOps/deployment focus from training/research focus
16	analysis	Based on these job listings, what skills would you recommend someone learn to be competitive for AI engineering roles?	LLM must reason about market signals and form a recommendation
17	analysis	Which job descriptions seem the most well-written and informative versus vague and generic?	LLM judges content quality, requires meta-reasoning about the text itself
18	analysis	Based on the requirements listed, which roles seem the hardest to fill and why?	Infer hiring difficulty from requirement complexity and specificity

Metrics

Retrieval Quality (per query)

Top-1 similarity score
Top-5 similarity scores with source file references
Boilerplate flag per retrieved chunk

Aggregate (per model run)

Top-1 avg — mean of best similarity across all 18 queries
Floor — lowest top-1 score (worst-case retrieval)
Spread — top-1 minus floor (consistency measure; tighter = better)
Total tokens — total context tokens across all queries

Answer Quality (per query)

LLM response generated by claude-opus-4-6 using top-200 retrieved chunks
Response character count
Qualitative evaluation in llm_response_evaluation.md

Data Iterations

Round	Directory	Description
Baseline	`jobs/`	Raw export including AI Analysis, Interview Insights, EEO, benefits boilerplate
No-Boilerplate	`jobs/jobs-no-boilerplate/`	Cleaned — AI sections removed, boilerplate patterns stripped, ~27% size reduction

Baseline Results

Model	Top-1	Floor	Spread	Tokens
e5-large-instruct	0.879	0.853	0.025	25,992
gemini-embedding-001	0.848	0.808	0.040	24,885
bge-base-en-v1.5	0.742	0.669	0.073	33,936
text-embedding-3-small	0.598	0.501	0.097	38,496
text-embedding-3-large	0.571	0.475	0.096	37,895

Note: Absolute similarity scores are not comparable across models due to different embedding space geometries. What matters is the ranking of retrieved chunks, not the raw number.

No-Boilerplate Results

After stripping AI Analysis sections, Interview Insights, EEO text, benefits boilerplate, and structural artifacts. ~35% reduction in chunk count (51,545 to 33,409 vectors for HF models, 36,523 for Gemini).

Model	Top-1	Floor	Spread	Tokens
e5-large-instruct	0.872	0.838	0.034	36,099
gemini-embedding-001	0.850	0.805	0.045	27,770
bge-base-en-v1.5	0.724	0.655	0.069	44,365
text-embedding-3-small	0.582	0.478	0.104	44,475
text-embedding-3-large	0.561	0.462	0.099	43,583

Key finding: Top-1 scores barely changed, but spreads widened for most models (E5: 0.025 to 0.034, +36%). Fewer chunks but denser content per retrieval window, resulting in higher token counts per query.

Run Protocol

# 1. Embed (same data, same chunking, different model)
python rag_pipeline.py --step embed --limit 0 --embed-model <name>

# 2. Run benchmark (18 queries)
python benchmark.py --run "<model>_baseline" --embed-model <name>

# 3. Generate report
python benchmark_report.py --json benchmark_results/<model>/<run>.json

Key Files

benchmark.py — Runs 18 queries against saved pickle runs, records metrics + LLM responses
benchmark_report.py — Computes aggregate stats and per-query breakdowns from benchmark JSONs
benchmark_results/ — One subdirectory per model, containing benchmark JSON output
llm_response_evaluation.md — Qualitative evaluation of LLM response quality across all 5 models and 18 queries
clean_boilerplate.py — Data cleaning script for progressive iteration testing
runs/*.pkl — One pickle per embedding run
raw.log — Raw commands for each embedding round

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmark_results		benchmark_results
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
benchmark_queries.json		benchmark_queries.json
benchmark_report.py		benchmark_report.py
chunk_analyzer.py		chunk_analyzer.py
clean_boilerplate.py		clean_boilerplate.py
llm_response_evaluation.md		llm_response_evaluation.md
rag_pipeline.py		rag_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What This Project Is

Key Commands

Architecture

Data Directories

Environment Variables

Key Dependencies

Embedding Model Benchmark

Objective

Models Under Test

18 Benchmark Queries

Metrics

Data Iterations

Baseline Results

No-Boilerplate Results

Run Protocol

Key Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What This Project Is

Key Commands

Architecture

Data Directories

Environment Variables

Key Dependencies

Embedding Model Benchmark

Objective

Models Under Test

18 Benchmark Queries

Metrics

Data Iterations

Baseline Results

No-Boilerplate Results

Run Protocol

Key Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages