feat: ci-metrics data quality + dashboard improvements by ludamad · Pull Request #20489 · AztecProtocol/aztec-packages

ludamad · 2026-02-13T14:07:12Z

Summary

Test success tracking: New test_daily_stats table tracks daily passed/failed/flaked counts per test command without persisting individual passed events. Backfills from existing test_events on startup.
Instance type fix: log_ci_run now prefers EC2_INSTANCE_TYPE env var over metadata endpoint (which fails inside Docker containers). Fixes 99% of CI runs showing "unknown" instance type.
CloudTrail backfill: Resolves historical unknown instance types by querying CloudTrail RunInstances events and matching by timestamp proximity. Recalculates costs with correct instance rates.
CI Insights chart: "Test outcomes per day" now shows stacked bars for successes (green), flakes (orange), and failures (red).
Time period display: All API responses include period metadata; dashboard headers show the active date range.
Proxy timeout: Increased from 60s to 180s for slow BigQuery billing fetches.
Single worker: ci-metrics reduced to 1 gunicorn worker to avoid redundant cache warmups and SQLite lock contention.

Test plan

Deploy to bastion via ci3/dashboard/deploy.sh
Verify test_daily_stats table created and backfilled
Verify CI Insights chart shows successes bar
Verify period labels appear on all dashboard pages
Verify new CI runs report correct instance types
Check CloudTrail resolution logs: sudo journalctl -u rkapp | grep cloudtrail
Verify namespace billing still loads within timeout

- Fix subprocess race condition with fcntl file lock - Warm billing caches on startup with --preload - Add test timings link to all dashboard nav bars - Reduce gunicorn workers from 100 to 50 - Add METRICS_DB_PATH env var for SQLite location - Fix Content-Encoding stripping for proxied responses - Kill stale ci-metrics process before restart

- Track test successes via daily aggregate table (test_daily_stats) without persisting individual passed events; backfill from existing test_events - Fix instance type detection in log_ci_run to prefer EC2_INSTANCE_TYPE env var over metadata endpoint (which fails in Docker) - Add CloudTrail backfill to resolve unknown instance types for historical CI runs and recalculate costs - Add test success counts to CI Insights chart (stacked bar: successes, flakes, failures) - Add time period metadata to all API responses and display in dashboard headers (ci-insights, cost-overview, test-timings) - Use test_daily_stats for CI performance endpoint counts (proper aggregation across weekly/monthly granularity) - Increase proxy timeout to 180s for slow BigQuery fetches - Reduce ci-metrics to 1 worker to avoid redundant cache warmups

…ange - CloudTrail resolver now joins RunInstances + CreateTags events by instance ID, then matches to ci_runs via Dashboard and Name tags instead of bare timestamp proximity - Restore merge_train_failure_slack_notify to match base branch

The previous CloudTrail resolver had three issues causing near-zero match rates: 1. Single-pass event fetching hit the 5000-event pagination limit, missing most RunInstances events beyond ~16 days. Now fetches in daily chunks. 2. CreateTags filter discarded Name-only events (line 126 of aws_request_instance_type), losing the Name tag for ~90% of instances. Now accumulates all tags first, then filters by Group=build-instance. 3. Name tag parsing couldn't handle INSTANCE_POSTFIX suffixes (e.g. pr-123_arm64_a1-fast). Now uses regex to extract branch name regardless of postfix format. 4. Matching window was 10 minutes (only matched first CI step). Now allows 90 minutes to match all steps on an instance. Tested against real data: resolves 4187/4638 (90%) unknown instance types across 90 days of CloudTrail history.

The API was reading CI runs from a Redis+SQLite hybrid, but the hourly Redis sync used INSERT OR REPLACE which overwrote CloudTrail-enriched instance_type and cost_usd back to empty values. Now: - get_ci_runs() reads exclusively from SQLite - sync_ci_runs_to_sqlite() uses ON CONFLICT DO UPDATE that preserves enriched fields (only overwrites if Redis has non-empty values) - app.py calls updated to drop unused Redis connection argument

- Add hardcoded rates for m6a.xlarge/4xlarge/8xlarge/24xlarge that were missing, causing 192-vCPU fallback ($100+ instead of ~$8 for 8xlarge) - Make pricing discovery dynamic: query DB for distinct instance types so newly resolved types get live pricing automatically - Add recalculate_all_costs() to fix historical cost data

Instead of guessing 192 vCPUs (which massively overestimates), return None so the cost shows as unknown rather than a fabricated number.

ludamad requested a review from charlielye as a code owner February 13, 2026 14:07

ludamad changed the base branch from next to merge-train/spartan February 13, 2026 14:07

ludamad added 2 commits February 13, 2026 15:21

ludamad force-pushed the ad/fix/ci-metrics-deploy branch from 363fa18 to 77aff10 Compare February 13, 2026 15:30

ludamad changed the title ~~fix: ci-metrics deployment issues~~ feat: ci-metrics data quality + dashboard improvements Feb 13, 2026

ludamad added 5 commits February 13, 2026 15:49

fix: return unknown cost when instance type and vCPUs are both unknown

0245302

Instead of guessing 192 vCPUs (which massively overestimates), return None so the cost shows as unknown rather than a fabricated number.

ludamad force-pushed the ad/fix/ci-metrics-deploy branch from 9a6cabb to 0245302 Compare February 13, 2026 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ci-metrics data quality + dashboard improvements#20489

feat: ci-metrics data quality + dashboard improvements#20489
ludamad wants to merge 7 commits intomerge-train/spartanfrom
ad/fix/ci-metrics-deploy

ludamad commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ludamad commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ludamad commented Feb 13, 2026 •

edited

Loading