evals

Here are 206 public repositories matching this topic...

mastra-ai / mastra

From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

nodejs javascript typescript ai reactjs mcp nextjs tts chatbots workflows agents llm evals

Updated Apr 19, 2026
TypeScript

Arize-ai / phoenix

Star

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated Apr 18, 2026
Python

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

agent ai openai evaluation-metrics mistral cost-estimation autogen groq agentops llm langchain anthropic evals ollama crewai agents-sdk openai-agents

Updated Mar 19, 2026
Python

Kiln-AI / Kiln

Star

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Updated Apr 19, 2026
Python

pydantic / logfire

Sponsor

Star

AI observability platform for production LLM and agent systems.

python ai metrics logging trace openai observability pydantic fastapi opentelemetry ai-tools ai-observability evals llm-observability pydantic-ai agent-observability

Updated Apr 18, 2026
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Apr 17, 2026
Python

lmnr-ai / lmnr

Star

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

Updated Apr 19, 2026
TypeScript

MCPJam / inspector

Sponsor

Star

Development platform to debug, chat, inspect, and evaluate MCP servers, MCP apps, and ChatGPT apps.

Updated Apr 19, 2026
TypeScript

harbor-framework / harbor

Star

Harbor is a framework for running agent evaluations and creating and using RL environments.

rl-environments evals terminal-bench

Updated Apr 19, 2026
Python

mattpocock / evalite

Sponsor

Star

Evaluate your LLM-powered apps with TypeScript

typescript ai evals

Updated Mar 27, 2026
TypeScript

GitHamza0206 / simba

Star

OpenSource Production ready Customer service with built in Evals and monitoring

knowledge-base customer-service rag llm evals

Updated Jan 12, 2026
TypeScript

superlinear-ai / raglite

Star

🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL

markdown pdf postgres sqlite postgresql reranking rag vector-search duckdb colbert llm pgvector chainlit retrieval-augmented-generation evals late-interaction late-chunking query-adapter

Updated Apr 16, 2026
Python

waynesutton / opensync

Star

Cloud-synced dashboards for OpenCode and Claude Code. Track sessions, search with semantic lookup, export eval datasets.

open-source ai sessions convex opensync dasbhoard evals

Updated Feb 23, 2026
TypeScript

ombharatiya / ai-system-design-guide

Star

AI system design guide for engineers building production AI systems and evals.

aws machine-learning natural-language-processing azure gcp artificial-intelligence gemini llama interview-questions claude open-ai rag system-design-interview llm gen-ai evals agentic-workflow agentic-ai

Updated Apr 6, 2026

microsoft / promptpex

Star

Test Generation for Prompts

testing evaluations prompt-engineering llms chatgpt evals gpt-4o

Updated Apr 17, 2026
TeX

keshik6 / HourVideo

Star

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

navigation perception summarization reasoning visual-reasoning egocentric-videos gpt-4 multiple-choice-questions benchmark-dataset video-language-understanding multimodal-large-language-models evals gemini-pro spatial-intelligence neurips-2024 1-hour-video-language-understanding long-form-video-language-understanding long-context-understanding

Updated Jul 12, 2025
Jupyter Notebook

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Feb 15, 2026
TypeScript

mclenhard / mcp-evals

Star

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

ai mcp evals

Updated Jun 23, 2025
TypeScript

AgentEvalHQ / AgentEval

Star

AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS, PromptFoo and DeepEval do for Python, AgentEval does for .NET

testing agent framework evaluations net workflows red-teaming agentic evals

Updated Apr 18, 2026
C#

callstackincubator / evals

Star

A benchmark suite for evaluating how coding models solve real React Native tasks.

react-native ai agents evals

Updated Apr 17, 2026
TypeScript

Improve this page

Add a description, image, and links to the evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals

Here are 206 public repositories matching this topic...

mastra-ai / mastra

Arize-ai / phoenix

AgentOps-AI / agentops

Kiln-AI / Kiln

pydantic / logfire

truera / trulens

lmnr-ai / lmnr

MCPJam / inspector

harbor-framework / harbor

mattpocock / evalite

GitHamza0206 / simba

superlinear-ai / raglite

waynesutton / opensync

ombharatiya / ai-system-design-guide

microsoft / promptpex

keshik6 / HourVideo

METR / vivaria

mclenhard / mcp-evals

AgentEvalHQ / AgentEval

callstackincubator / evals

Improve this page

Add this topic to your repo