Evaluation Infrastructure for AI Agents
-
Updated
Feb 25, 2026 - TypeScript
Evaluation Infrastructure for AI Agents
A hands-on interactive AI-evals course for product folks, who want to develop product sense based on real life applications.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.
Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.
End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.
Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.
Lightweight eval framework for LLMs & AI apps combining deterministic scoring, LLM-as-judge, and regression testing.
Hands-on Agentic AI learning project — ReAct agents, memory systems, evals, and multi-agent architecture. Built as a structured AI PM curriculum.
Multi-agent system orchestrating an AI-driven software team using the Claude Agents SDK. Agents take on defined roles and collaborate autonomously on software tasks.
An AI-powered tool for data extration designed for the Job Intelligence Engine project
Collection of frameworks and tools for AI evalations, including tool-use, agentic AI, MCP, and multimodal
EvalLoop is a self-improving agent that iterates on its own outputs using evals + automatic policy patches. It runs a task, scores the result against a rubric, updates its rules, and re-runs until it hits a target score with a UI showing attempts, score trends, violations, and policy diffs.
Portfolio project showing product strategy, evals, roadmap, and GTM for a GenAI coding assistant
CLI release gate for structured AI changes.
Build deterministic local contract tests for Foundry workflows with TypeScript, schema validation, telemetry, and proof artifact export
Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.
To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."