Five minds enter. They argue, challenge, cross-examine. Only the truth walks out.
verd spawns multiple AI models from different families — each with a specialized role, has them debate your question across rounds, then a stronger judge delivers the final verdict with strengths, issues, and actionable fixes.
Use it everywhere: CLI for code reviews, MCP inside Claude Code and Cursor, and Slack as @verd in any conversation.
Requires Python 3.11+.
pip install verd
verd setupThe setup wizard walks you through provider selection (OpenRouter, LiteLLM, or other) and outputs the exact config you need — for both CLI (.env) and MCP (JSON to paste into your editor config).
verd runs multiple models in parallel (Claude, Gemini, GPT, DeepSeek) so it needs a multi-provider router. OpenRouter is the easiest — one key, all models. LiteLLM proxy works too.
CLI
verd "can this auth middleware be bypassed?" -f auth.py middleware.py
verdh "should we merge this?" -gb main # deep mode — 5 models, 3 roundsMCP (Claude Code / Cursor) — use verd, verdl, verdh as tools directly in chat:
verdh based on the context above and this file, do you think we can proceed?
is this approach correct given what we discussed, use verdl
Slack — mention @verd in any channel or thread:
@verd what do you think? — reads thread context, debates, replies
@verd deep is this secure? — uses verdh (5 models, 3 rounds)
/verd Kafka, SQS, or RabbitMQ for our event pipeline? — slash command with live progress
verd is for critical decisions and deep analysis — not simple lookups. If a single model can answer it, verd is overkill.
FAIL 77% In-memory rate limiter is unsafe for production
claude:FAIL gpt:FAIL gemini:FAIL gpt:FAIL (FULL)
+ Conceptually correct sliding-window logic
- Global dict is unsynchronized — race conditions in multi-thread servers
- Per-user lists grow without bounds — memory leak / DoS vector
! gpt-5-mini caught the risk of system clock jumps with time.time()
→ Move state to Redis with atomic operations
completed in 69.3s • 22,449 tokens • ~$0.07
Vote breakdown, unique catches (!), dissent, strengths, issues, and actionable fixes — all in one view.
| Command | Debaters | Roles | Rounds | Speed | Cost |
|---|---|---|---|---|---|
verdl |
2 + judge | analyst, devils_advocate | 1 | ~15s+ | ~$0.01 |
verd |
4 + judge | analyst, devils_advocate, logic_checker, pragmatist | 2 | ~30s+ | ~$0.05+ |
verdh |
5 + judge | analyst, devils_advocate, logic_checker, fact_checker, pragmatist | 3 | ~60s+ | ~$0.25+ |
Tested on the Martian Code Review Benchmark — 50 real PRs from Cal.com, Discourse, Grafana, Keycloak, and Sentry with expert-labeled golden comments. No code-review-specific tuning.
| Mode | Precision | Recall | F1 Score | Avg Issues |
|---|---|---|---|---|
| GPT-5.4 (alone) | 13.0% | 70.6% | 21.9% | 14.6 |
| Claude Opus 4.6 (alone) | 18.5% | 69.9% | 29.2% | 10.1 |
| verdh (5-model debate) | 29.1% | 64.0% | 40.0% | 5.9 |
+37% F1 over Claude solo. 57% more precise. 42% fewer false positives.
- Your question + content gets sent to multiple AI models in parallel
- Each model has a specialized role (analyst, devils_advocate, logic_checker, fact_checker, pragmatist)
- Models see each other's responses and cross-examine for 1-3 rounds
- Anti-groupthink prompts ensure models hold their ground when they have evidence — consensus without new evidence is rejected
- A stronger judge model synthesizes the debate, weighting each reviewer by their role
- Confidence is calculated from vote distribution — a fact_checker's dissent lowers confidence more than a devils_advocate's expected pushback
- You get: verdict, vote breakdown, strengths, issues, unique catches, dissent, and actionable fixes
The key insight: different model families have different blind spots and training biases. Claude spots nuance GPT misses. Gemini catches logic errors DeepSeek overlooks. More importantly — if the same model writes the review and judges its quality, it's likely to agree with itself. Cross-model diversity means the judge is a genuine quality gate, not a model grading its own homework. The debate surfaces what each model uniquely caught and tells you exactly which model caught what.
| Role | Job | Example catch |
|---|---|---|
| analyst | Balanced initial assessment, main arguments for and against | "The architecture is sound but the auth flow has a gap" |
| devils_advocate | Find what others miss — edge cases, hidden assumptions, failure modes | "What happens when the token expires mid-transaction?" |
| logic_checker | Verify reasoning quality — fallacies, off-by-one, race conditions | "The pagination math is wrong: total_pages needs ceil division" |
| fact_checker | Web-grounded verification — do these APIs/libraries actually work? | "That library was deprecated in v3, use the new API" |
| pragmatist | Real-world practicality — will this ship? What's the ops burden? | "This works but needs 3 new infra dependencies your team doesn't know" |
The judge weighs each reviewer's input by role — a fact_checker citing sources carries more weight than a devils_advocate pushing back.
Override models via env vars or CLI flags. Per-tier env vars let you set different models for each mode:
VERDL_JUDGE=o4-mini VERDL_DEBATERS=gpt-4.1-mini,gemini-3.1-flash-lite-preview
VERD_JUDGE=o3 VERD_DEBATERS=claude-sonnet-4-6,gpt-4.1,gemini-3.1-pro-preview,gpt-4.1-mini
VERDH_JUDGE=o3 VERDH_DEBATERS=claude-opus-4-6,deepseek-r1,gemini-3.1-pro-preview,sonar-pro,gpt-4.1Or use VERD_JUDGE / VERD_DEBATERS as a global override for all tiers. verd setup generates the right config for your provider.
-c TEXT inline content string
-f FILE [FILE ...] one or more files to evaluate
-d [DIR] read all files in a directory (default: current dir)
-g use unstaged git diff as content
-gs use staged git diff as content
-gb REF use git diff REF...HEAD as content (e.g. main)
-a / --all scan all files, skip smart selection (use with -d)
--ext EXT [EXT ...] filter by extension (use with -d)
--exclude PATTERN glob patterns to exclude (use with -d)
-q / --quiet hide debate transcript, show only verdict
--json output raw JSON
--judge MODEL override judge model
--debaters MODEL ... override debater models
--budget USD max cost in USD — abort if estimate exceeds budget
--timeout SECONDS override timeout per model call
--version show version and exit
verd setup # select "MCP" and your providerThis prints the exact JSON to paste into ~/.claude/settings.json (Claude Code) or ~/.cursor/mcp.json (Cursor), with the correct absolute path to verd-mcp and model overrides for your provider. Then use verd, verdl, or verdh as tools directly in chat.
Install with Slack dependencies:
pip install "verd[slack]"Create a Slack app with Socket Mode enabled, add bot scopes (app_mentions:read, channels:history, groups:history, chat:write, reactions:write, im:history, im:write, users:read), then:
export SLACK_BOT_TOKEN=xoxb-...
export SLACK_APP_TOKEN=xapp-...
export SLACK_SIGNING_SECRET=...
verd-slackOptional: restrict access via environment variables:
export VERD_ALLOWED_CHANNELS=C123,C456 # empty = all channels
export VERD_ALLOWED_USERS=U123,U456 # empty = all users