Summary
Built-in budget evaluator for per-agent cumulative cost tracking.
Motivation
Retries and recursive tool chains pile up fast -- a 3-layer retry loop is 64 API calls from one user request. Current evaluators are stateless, so there's no way to express "deny after $X total spend" without maintaining a separate counter service outside the control plane.
Current behavior
Controls evaluate step content (regex, list, JSON, SQL) but can't track cumulative state across evaluations. Cost enforcement requires a custom evaluator with external state management.
Expected behavior
A built-in budget evaluator that:
- Tracks cumulative cost per agent (in-memory, or via PostgreSQL for persistence)
- Config:
max_cost_usd, cost_per_1k_input_tokens, cost_per_1k_output_tokens
- On post-stage evaluation, reads token counts from
step.output or step.context, accumulates
- Returns
matched=True when ceiling is hit
- Pairs with existing actions --
deny for hard stop, steer with steering_context: {fallback_model: "..."} for degradation
Proposed solution
Should work as a regular evaluator. confidence could just be spent / limit (0-1 utilization ratio).
The stateful part is new -- current evaluators are stateless -- but the SQL evaluator already caches query analysis across calls, so there's precedent for evaluator-level state.
Additional context
LMK if a PR for this makes sense.
Summary
Built-in budget evaluator for per-agent cumulative cost tracking.
Motivation
Retries and recursive tool chains pile up fast -- a 3-layer retry loop is 64 API calls from one user request. Current evaluators are stateless, so there's no way to express "deny after $X total spend" without maintaining a separate counter service outside the control plane.
Current behavior
Controls evaluate step content (regex, list, JSON, SQL) but can't track cumulative state across evaluations. Cost enforcement requires a custom evaluator with external state management.
Expected behavior
A built-in
budgetevaluator that:max_cost_usd,cost_per_1k_input_tokens,cost_per_1k_output_tokensstep.outputorstep.context, accumulatesmatched=Truewhen ceiling is hitdenyfor hard stop,steerwithsteering_context: {fallback_model: "..."}for degradationProposed solution
Should work as a regular evaluator.
confidencecould just bespent / limit(0-1 utilization ratio).The stateful part is new -- current evaluators are stateless -- but the SQL evaluator already caches query analysis across calls, so there's precedent for evaluator-level state.
Additional context
LMK if a PR for this makes sense.