A RAG (Retrieval-Augmented Generation) pipeline and embedding model benchmark over ~5,559 job listing markdown files. The pipeline loads job .md files, chunks them, embeds them via multiple embedding providers, and answers questions using cosine similarity retrieval + LLM generation.
The benchmark evaluates 5 embedding models across 18 queries designed to test different retrieval capabilities, with progressive data cleaning iterations to measure how data quality affects retrieval.
# Activate the venv (Python 3.14)
source venv/bin/activate
# Run RAG pipeline steps individually
python rag_pipeline.py --step load # Load .md files from jobs/
python rag_pipeline.py --step chunk # Chunk documents
python rag_pipeline.py --step embed # Embed chunks
python rag_pipeline.py --step query --query "your question" # Full retrieval + generation
python rag_pipeline.py --step all --query "your question" # All steps end-to-end
# Useful flags
--limit N # Process first N files (default: 5, 0 = all ~5559)
--chunk-size N # Chunk size in chars (default: 1000)
--overlap N # Chunk overlap in chars (default: 200)
--embed-model <name> # Embedding model: gemini, openai_small, openai_large, e5, bge
# Run benchmark (18 queries against a saved pickle run)
python benchmark.py --run "<model>_baseline" --embed-model <name>
# Generate benchmark report (aggregate stats + per-query breakdown)
python benchmark_report.py --json benchmark_results/<model>/<run>.json
# Clean boilerplate from job markdown files
python clean_boilerplate.py # Clean jobs/jobs-no-boilerplate/
python clean_boilerplate.py --target-dir jobs/jobs-structured # Clean a specific directory
python clean_boilerplate.py --dry-run # Preview without writing
# Export jobs from Postgres to markdown
node generate-md.js # Full export (requires Postgres env vars)
node generate-structured-md.js # ID-filtered export to jobs/jobs-structured/Data flow: Postgres DB → generate-md.js → jobs/*.md → rag_pipeline.py (load → chunk → embed → query)
generate-md.js— Node script that connects to a Postgresjobstable and writes one.mdfile per job intojobs/. Each file has structured sections: Job Details, AI Analysis, Interview Insights.generate-structured-md.js— ID-filtered export that only fetches jobs matching existing file IDs injobs/, outputs tojobs/jobs-structured/without AI Analysis or Interview Insights sections.rag_pipeline.py— Main pipeline. Four sequential steps, each runnable independently:- load — Glob
jobs/*.md, read into memory - chunk — Two-pass splitting:
MarkdownHeaderTextSplitter(by#,##,###) thenRecursiveCharacterTextSplitter - embed — Supports 5 embedding providers (see Models below). Computes and displays cosine similarity matrix.
- query — Embeds the query, cosine-similarity ranks all chunks, takes top-5, builds augmented prompt, calls
claude-opus-4-6for generation.
- load — Glob
benchmark.py— Runs all 18 queries against a saved pickle, records similarity scores + LLM responses, outputs JSON.benchmark_report.py— Generates aggregate stats (top-1, floor, spread, tokens) and per-query breakdowns from benchmark JSON files.clean_boilerplate.py— Strips noise from job markdown files (AI Analysis sections, Interview Insights, EEO text, benefits boilerplate, structural artifacts).
| Directory | Contents |
|---|---|
jobs/ |
Original raw export from Postgres (~5,559 files). Baseline. Do not edit. |
jobs/jobs-no-boilerplate/ |
Cleaned files — AI Analysis, Interview Insights, EEO, benefits boilerplate stripped. ~27% size reduction. |
jobs/jobs-structured/ |
Re-exported from Postgres with only job content fields (no AI Analysis/Interview Insights). Nearly identical to no-boilerplate after cleaning. |
GOOGLE_API_KEY— Required for Gemini embeddings + LLM generationOPENAI_API_KEY— Required for OpenAI embeddingsHF_TOKEN— Required for HuggingFace Inference API (e5, bge)GROQ_API_KEY— Required for groqmachine.pyPOSTGRES_HOST,POSTGRES_PORT,POSTGRES_USER,POSTGRES_PASSWORD,POSTGRES_DB— Required for generate-md.js
- Python:
langchain,langchain-google-genai,langchain-text-splitters,openai,huggingface_hub,groq - Node:
pg(PostgreSQL client)
Evaluate how different embedding models affect retrieval quality and answer usefulness across identical data, chunking, and queries. Each model embeds the same ~5,559 job files using the same chunking parameters (chunk_size=1000, overlap=200). All 18 benchmark queries run against every model, and results are compared across two data iterations: baseline (raw) and no-boilerplate (cleaned).
| Model | Dims | Cost | Provider | Notes |
|---|---|---|---|---|
e5-large-instruct |
1024 | free (HF Inference) | BAAI/HuggingFace | Highest baseline accuracy (0.879 top-1) |
gemini-embedding-001 |
3072 | $0.15/1M tok | Structural matching problem on analytical queries | |
bge-base-en-v1.5 |
768 | free (HF Inference) | BAAI/HuggingFace | Reliable, good coverage |
text-embedding-3-small |
1536 | $0.02/1M tok | OpenAI | Low similarity scores but surprisingly good answers |
text-embedding-3-large |
3072 | $0.13/1M tok | OpenAI | No clear advantage over small in response quality |
Each query tests a specific retrieval + reasoning capability. 3 queries per category, same queries across every model and data iteration.
| Q | Category | Query | Tests |
|---|---|---|---|
| 1 | synthesis | What does a typical senior ML engineer role look like in terms of day-to-day responsibilities? | Synthesize patterns across multiple job descriptions into a coherent picture |
| 2 | synthesis | What tech stack do companies building LLM-powered products typically require? | Extract and combine technical requirements from LLM-related roles |
| 3 | synthesis | What does the interview process look like for AI engineering roles based on these listings? | Pull and synthesize interview details scattered across descriptions |
| 4 | comparison | How do junior versus senior AI roles differ in what they expect candidates to know? | Compare and contrast requirements across seniority levels |
| 5 | comparison | What is the difference between what startups and large companies look for in machine learning engineers? | Distinguish company-stage signals and compare expectations |
| 6 | comparison | How do roles focused on building AI products from scratch differ from those integrating existing models or APIs? | Semantic depth, build vs integrate distinction across descriptions |
| 7 | inference | Which roles seem to expect someone who can work independently with minimal supervision? | Infer autonomy expectations from indirect language cues |
| 8 | inference | Based on the job descriptions, which roles are more research-oriented versus production engineering? | Classify roles by inferred focus without explicit labels |
| 9 | inference | Which jobs sound like they want a full-stack engineer who also does ML, rather than a pure ML researcher? | Infer hybrid role expectations from combined skill signals |
| 10 | pattern | What soft skills keep appearing across AI and ML engineering job descriptions? | Identify recurring non-technical requirements across retrieved chunks |
| 11 | pattern | What tools and frameworks are most commonly mentioned alongside LLM or RAG work? | Extract co-occurring technical terms in a specific subdomain |
| 12 | pattern | What benefits beyond salary do AI companies highlight to attract engineering candidates? | Identify perks and cultural signals across multiple listings |
| 13 | nuanced-retrieval | Find roles where the focus is on data quality and pipeline reliability rather than model building | Retrieve based on semantic intent, not keyword overlap with ML terms |
| 14 | nuanced-retrieval | Jobs that emphasize mentorship, career growth, or a strong engineering culture | Retrieve on soft cultural signals buried in descriptions |
| 15 | nuanced-retrieval | Roles that involve deploying models to production and managing inference at scale | Distinguish MLOps/deployment focus from training/research focus |
| 16 | analysis | Based on these job listings, what skills would you recommend someone learn to be competitive for AI engineering roles? | LLM must reason about market signals and form a recommendation |
| 17 | analysis | Which job descriptions seem the most well-written and informative versus vague and generic? | LLM judges content quality, requires meta-reasoning about the text itself |
| 18 | analysis | Based on the requirements listed, which roles seem the hardest to fill and why? | Infer hiring difficulty from requirement complexity and specificity |
Retrieval Quality (per query)
- Top-1 similarity score
- Top-5 similarity scores with source file references
- Boilerplate flag per retrieved chunk
Aggregate (per model run)
- Top-1 avg — mean of best similarity across all 18 queries
- Floor — lowest top-1 score (worst-case retrieval)
- Spread — top-1 minus floor (consistency measure; tighter = better)
- Total tokens — total context tokens across all queries
Answer Quality (per query)
- LLM response generated by
claude-opus-4-6using top-200 retrieved chunks - Response character count
- Qualitative evaluation in
llm_response_evaluation.md
| Round | Directory | Description |
|---|---|---|
| Baseline | jobs/ |
Raw export including AI Analysis, Interview Insights, EEO, benefits boilerplate |
| No-Boilerplate | jobs/jobs-no-boilerplate/ |
Cleaned — AI sections removed, boilerplate patterns stripped, ~27% size reduction |
| Model | Top-1 | Floor | Spread | Tokens |
|---|---|---|---|---|
| e5-large-instruct | 0.879 | 0.853 | 0.025 | 25,992 |
| gemini-embedding-001 | 0.848 | 0.808 | 0.040 | 24,885 |
| bge-base-en-v1.5 | 0.742 | 0.669 | 0.073 | 33,936 |
| text-embedding-3-small | 0.598 | 0.501 | 0.097 | 38,496 |
| text-embedding-3-large | 0.571 | 0.475 | 0.096 | 37,895 |
Note: Absolute similarity scores are not comparable across models due to different embedding space geometries. What matters is the ranking of retrieved chunks, not the raw number.
After stripping AI Analysis sections, Interview Insights, EEO text, benefits boilerplate, and structural artifacts. ~35% reduction in chunk count (51,545 to 33,409 vectors for HF models, 36,523 for Gemini).
| Model | Top-1 | Floor | Spread | Tokens |
|---|---|---|---|---|
| e5-large-instruct | 0.872 | 0.838 | 0.034 | 36,099 |
| gemini-embedding-001 | 0.850 | 0.805 | 0.045 | 27,770 |
| bge-base-en-v1.5 | 0.724 | 0.655 | 0.069 | 44,365 |
| text-embedding-3-small | 0.582 | 0.478 | 0.104 | 44,475 |
| text-embedding-3-large | 0.561 | 0.462 | 0.099 | 43,583 |
Key finding: Top-1 scores barely changed, but spreads widened for most models (E5: 0.025 to 0.034, +36%). Fewer chunks but denser content per retrieval window, resulting in higher token counts per query.
# 1. Embed (same data, same chunking, different model)
python rag_pipeline.py --step embed --limit 0 --embed-model <name>
# 2. Run benchmark (18 queries)
python benchmark.py --run "<model>_baseline" --embed-model <name>
# 3. Generate report
python benchmark_report.py --json benchmark_results/<model>/<run>.jsonbenchmark.py— Runs 18 queries against saved pickle runs, records metrics + LLM responsesbenchmark_report.py— Computes aggregate stats and per-query breakdowns from benchmark JSONsbenchmark_results/— One subdirectory per model, containing benchmark JSON outputllm_response_evaluation.md— Qualitative evaluation of LLM response quality across all 5 models and 18 queriesclean_boilerplate.py— Data cleaning script for progressive iteration testingruns/*.pkl— One pickle per embedding runraw.log— Raw commands for each embedding round