feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396
feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396amir-deris wants to merge 15 commits into
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3396 +/- ##
==========================================
+ Coverage 59.26% 59.28% +0.02%
==========================================
Files 2110 2112 +2
Lines 174242 174389 +147
==========================================
+ Hits 103259 103394 +135
- Misses 62053 62062 +9
- Partials 8930 8933 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
bdchatham
left a comment
There was a problem hiding this comment.
Migration shape is right and dual-emit is sound. Three blockers worth fixing before merge — all silent (no runtime error, just wrong data):
- Counter and histogram names will double-suffix under the Prometheus exporter (
_total_total,_seconds_seconds). app_tx_count_totaldouble-counts every tx via the unlabeled + labeledAddpair.txGasUsed/txGasWantedasInt64Gaugeis last-write-wins under OCC concurrency.
Plus should-fixes around bucket density at the p99=2.5s SLO threshold, the lazy sync.OnceValue init in app/ante/, and the raw proposer label encoding. Inline below.
Dashboard rewrites land alongside the PLT-327 removal step; existing legacy refs in clusters/prod/monitoring/grafana-dashboards-protocol.yaml are summary-typed ({quantile="0.5"}), so query rewrites are histogram_quantile-based, not 1:1. No alert/recording rules touch these names — the dual-emit window is sufficient buffer.
| // InitAnteMetrics registers all OTel instruments for the ante package. | ||
| // Safe to call concurrently; instruments are registered exactly once. | ||
| func InitAnteMetrics() { | ||
| appAnteMetrics.mu.Lock() |
| var millisecondBuckets = metric.WithExplicitBucketBoundaries( | ||
| 0.000025, 0.000050, 0.0001, 0.0005, 0.001, 0.0025, 0.005, 0.010, 0.020, 0.050, 0.075, 0.1, | ||
| ) |
There was a problem hiding this comment.
nit: I'd add 0.25, 0.5, 1.0 — commitDuration is on this set, and pebble compaction stalls during heavy write load can push commit past 100ms (verified by walking through BaseApp.Commit → cms.Commit(true) — the foreground write blocks during L0→L1 stalls). 100ms ceiling buckets that exactly when you most want tail visibility.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2358a11. Configure here.
| metrics.IncrTxProcessTypeCounter(metrics.SYNCHRONOUS) | ||
| utilmetrics.IncrTxProcessTypeCounter(utilmetrics.SYNCHRONOUS) // TODO(PLT-327): remove once app_tx_process_type_total verified | ||
| appMetrics.txProcessType.Add(ctx.Context(), 1, | ||
| otelmetric.WithAttributes(attribute.String("type", utilmetrics.SYNCHRONOUS))) |
There was a problem hiding this comment.
Giga per-tx type label mismatches block-level label
Medium Severity
The OTel txProcessType counter in ProcessTxsSynchronousGiga uses utilmetrics.SYNCHRONOUS ("synchronous") for the per-tx attribute, but the block-level blockProcessDuration metric in the same function correctly uses utilmetrics.SYNCHRONOUS_GIGA ("synchronous_giga"). Since SYNCHRONOUS_GIGA was specifically introduced in this PR to differentiate giga from regular synchronous processing, the per-tx counter appears to be a copy-paste from ProcessTxsSynchronousV2 that wasn't updated. This makes it impossible to distinguish giga-synchronous from regular-synchronous transactions in the new OTel app_tx_process_type_total metric.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 2358a11. Configure here.
| appMetrics.txProcessType.Add(ctx, 1, otelmetric.WithAttributes(attribute.String("type", "sequential"))) | ||
| appMetrics.txGas.Add(ctx, 100, otelmetric.WithAttributes(attribute.String("type", "gas_used"))) | ||
|
|
||
| appMetrics.optimisticProcessing.Add(ctx, 1, otelmetric.WithAttributes(attribute.Bool("enabled", true))) |
There was a problem hiding this comment.
Test uses Bool attribute, production uses String
Low Severity
The test uses attribute.Bool("enabled", true) for the optimisticProcessing counter, but all production call sites use attribute.String("enabled", "true") and attribute.String("enabled", "false"). In OTel, attribute.Bool and attribute.String produce different attribute types, which would generate distinct time series in Prometheus. The test doesn't reflect actual production usage and would mask type mismatches if upgraded to use a real meter provider.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 2358a11. Configure here.
There was a problem hiding this comment.
yup, we want this one fixed too


Summary
Migrates telemetry in the
appandapp/antepackages from the legacytelemetry/utilmetricshelpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3265 forevmrpc.app/metrics.gowith a struct-based OTel instrument set registered viaotel.Meter("app"), initialized once inNewAppafterSetupOtelMetricsProvider()app/ante/metrics.gowith a lazily-initializedanteMetricsstruct viasync.OnceValue(fires on first CheckTx, after the global MeterProvider is set)telemetry.MeasureSince/utilmetricscalls inabci.goandinvariance.gowith dual-emitted OTel instruments; legacy calls remain withTODO(PLT-327)markers until dashboards are migratedctx/ctx.Context()through all.Record()and.Add()call sites to support OTel's context-based propagationGaugeSeidVersionAndCommitcall with anapp_build_infoobservable gauge that fires on Prometheus scrape — no per-block overheadNew metrics (OTel naming convention, exported via the process-wide MeterProvider)
ABCI phase durations — histograms bucketed at SLO thresholds (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s):
app_abci_begin_block_duration_secondsapp_abci_end_block_duration_secondsapp_abci_module_end_block_duration_secondsapp_abci_check_tx_duration_secondsapp_abci_deliver_tx_duration_secondsapp_abci_deliver_batch_tx_duration_secondsapp_block_process_duration_seconds— block tx processing duration by execution typeTransaction counters and gas:
app_tx_count_totalapp_tx_process_type_total— by execution type labelapp_tx_gas_total— cumulative gas by type labelapp_tx_gas_used/app_tx_gas_wanted— per-tx gaugesApp-level flow counters:
app_optimistic_processing_total— cache hit (enabled=true) vs missapp_failed_total_gas_wanted_check_totalapp_giga_fallback_to_v2_totalLight invariance:
app_lightinvariance_supply_duration_secondsapp_lightinvariance_supply_invalid_key_totalapp_lightinvariance_supply_unmarshal_failure_totalBuild info:
app_build_info— observable gauge, always 1, labels:seid_version,commitAnte:
app_pending_nonce_total— pending nonce events by type (added, expired, rejected, accepted)Migration note
Legacy metrics (
telemetry.MeasureSince,utilmetrics.MeasureDeliverTxDuration, etc.) are dual-emitted during the migration window so existing dashboards continue to receive data. Each legacy call site is annotatedTODO(PLT-327)and will be removed once dashboards are verified against the new OTel series.Note
Medium Risk
Touches hot-path ABCI/tx-processing and invariance-check code to add new OpenTelemetry metrics; behavior should be unchanged, but any mistakes could impact performance or introduce panics in consensus-critical flows.
Overview
Adds a new
appMetrics/appAnteMetricsOpenTelemetry instrument set (histograms, counters, and anapp_build_infoobservable gauge) and initializes it once during app construction.Updates ABCI handlers (
BeginBlock/EndBlock/CheckTx/DeliverTx/DeliverTxBatch/Commit), block tx execution paths, EVM pending-nonce tracking, and light invariance checks to dual-emit the new OTel metrics alongside existingtelemetry/utilmetricscalls (annotated withTODO(PLT-327)for later removal), threadingcontext.Contextthrough metric calls.Extends
utils/metricslabels to distinguish new execution types (e.g.SYNCHRONOUS_GIGA,OCC_GIGA) and adds basic noop-provider tests to ensure metric registration/use doesn’t panic.Reviewed by Cursor Bugbot for commit 2358a11. Bugbot is set up for automated code reviews on this repo. Configure here.