evals
Here are 206 public repositories matching this topic...
AI Observability & Evaluation
-
Updated
Apr 18, 2026 - Python
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
-
Updated
Mar 19, 2026 - Python
Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
-
Updated
Apr 19, 2026 - Python
AI observability platform for production LLM and agent systems.
-
Updated
Apr 18, 2026 - Python
Evaluation and Tracking for LLM Experiments and AI Agents
-
Updated
Apr 17, 2026 - Python
Laminar - open-source observability platform purpose-built for AI agents. YC S24.
-
Updated
Apr 19, 2026 - TypeScript
Development platform to debug, chat, inspect, and evaluate MCP servers, MCP apps, and ChatGPT apps.
-
Updated
Apr 19, 2026 - TypeScript
Harbor is a framework for running agent evaluations and creating and using RL environments.
-
Updated
Apr 19, 2026 - Python
OpenSource Production ready Customer service with built in Evals and monitoring
-
Updated
Jan 12, 2026 - TypeScript
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL
-
Updated
Apr 16, 2026 - Python
AI system design guide for engineers building production AI systems and evals.
-
Updated
Apr 6, 2026
Test Generation for Prompts
-
Updated
Apr 17, 2026 - TeX
[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
-
Updated
Jul 12, 2025 - Jupyter Notebook
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
-
Updated
Feb 15, 2026 - TypeScript
AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS, PromptFoo and DeepEval do for Python, AgentEval does for .NET
-
Updated
Apr 18, 2026 - C#
A benchmark suite for evaluating how coding models solve real React Native tasks.
-
Updated
Apr 17, 2026 - TypeScript
Improve this page
Add a description, image, and links to the evals topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the evals topic, visit your repo's landing page and select "manage topics."