Skip to content

feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396

Open
amir-deris wants to merge 15 commits into
mainfrom
amir/plt-323-migrate-app-metrics-to-otel-meter-api
Open

feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396
amir-deris wants to merge 15 commits into
mainfrom
amir/plt-323-migrate-app-metrics-to-otel-meter-api

Conversation

@amir-deris
Copy link
Copy Markdown
Contributor

@amir-deris amir-deris commented May 5, 2026

Summary

Migrates telemetry in the app and app/ante packages from the legacy telemetry/utilmetrics helpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3265 for evmrpc.

  • Adds app/metrics.go with a struct-based OTel instrument set registered via otel.Meter("app"), initialized once in NewApp after SetupOtelMetricsProvider()
  • Adds app/ante/metrics.go with a lazily-initialized anteMetrics struct via sync.OnceValue (fires on first CheckTx, after the global MeterProvider is set)
  • Replaces telemetry.MeasureSince / utilmetrics calls in abci.go and invariance.go with dual-emitted OTel instruments; legacy calls remain with TODO(PLT-327) markers until dashboards are migrated
  • Threads ctx/ctx.Context() through all .Record() and .Add() call sites to support OTel's context-based propagation
  • Replaces the per-BeginBlock GaugeSeidVersionAndCommit call with an app_build_info observable gauge that fires on Prometheus scrape — no per-block overhead

New metrics (OTel naming convention, exported via the process-wide MeterProvider)

ABCI phase durations — histograms bucketed at SLO thresholds (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s):

  • app_abci_begin_block_duration_seconds
  • app_abci_end_block_duration_seconds
  • app_abci_module_end_block_duration_seconds
  • app_abci_check_tx_duration_seconds
  • app_abci_deliver_tx_duration_seconds
  • app_abci_deliver_batch_tx_duration_seconds
  • app_block_process_duration_seconds — block tx processing duration by execution type

Transaction counters and gas:

  • app_tx_count_total
  • app_tx_process_type_total — by execution type label
  • app_tx_gas_total — cumulative gas by type label
  • app_tx_gas_used / app_tx_gas_wanted — per-tx gauges

App-level flow counters:

  • app_optimistic_processing_total — cache hit (enabled=true) vs miss
  • app_failed_total_gas_wanted_check_total
  • app_giga_fallback_to_v2_total

Light invariance:

  • app_lightinvariance_supply_duration_seconds
  • app_lightinvariance_supply_invalid_key_total
  • app_lightinvariance_supply_unmarshal_failure_total

Build info:

  • app_build_info — observable gauge, always 1, labels: seid_version, commit

Ante:

  • app_pending_nonce_total — pending nonce events by type (added, expired, rejected, accepted)

Migration note

Legacy metrics (telemetry.MeasureSince, utilmetrics.MeasureDeliverTxDuration, etc.) are dual-emitted during the migration window so existing dashboards continue to receive data. Each legacy call site is annotated TODO(PLT-327) and will be removed once dashboards are verified against the new OTel series.


Note

Medium Risk
Touches hot-path ABCI/tx-processing and invariance-check code to add new OpenTelemetry metrics; behavior should be unchanged, but any mistakes could impact performance or introduce panics in consensus-critical flows.

Overview
Adds a new appMetrics/appAnteMetrics OpenTelemetry instrument set (histograms, counters, and an app_build_info observable gauge) and initializes it once during app construction.

Updates ABCI handlers (BeginBlock/EndBlock/CheckTx/DeliverTx/DeliverTxBatch/Commit), block tx execution paths, EVM pending-nonce tracking, and light invariance checks to dual-emit the new OTel metrics alongside existing telemetry/utilmetrics calls (annotated with TODO(PLT-327) for later removal), threading context.Context through metric calls.

Extends utils/metrics labels to distinguish new execution types (e.g. SYNCHRONOUS_GIGA, OCC_GIGA) and adds basic noop-provider tests to ensure metric registration/use doesn’t panic.

Reviewed by Cursor Bugbot for commit 2358a11. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 12, 2026, 4:17 AM

@amir-deris amir-deris changed the title Added otel metrics for app and app/ante feat(app): migrate app and ante telemetry to OpenTelemetry Meter API May 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 67.20000% with 82 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.28%. Comparing base (09d0d60) to head (2358a11).

Files with missing lines Patch % Lines
app/app.go 45.90% 32 Missing and 1 partial ⚠️
app/invariance.go 30.76% 18 Missing ⚠️
app/metrics.go 89.62% 9 Missing and 2 partials ⚠️
app/abci.go 75.00% 10 Missing ⚠️
app/ante/evm_checktx.go 0.00% 8 Missing ⚠️
app/ante/metrics.go 77.77% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3396      +/-   ##
==========================================
+ Coverage   59.26%   59.28%   +0.02%     
==========================================
  Files        2110     2112       +2     
  Lines      174242   174389     +147     
==========================================
+ Hits       103259   103394     +135     
- Misses      62053    62062       +9     
- Partials     8930     8933       +3     
Flag Coverage Δ
sei-chain-pr 50.81% <67.20%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
app/ante/metrics.go 77.77% <77.77%> (ø)
app/ante/evm_checktx.go 34.72% <0.00%> (-0.66%) ⬇️
app/abci.go 63.19% <75.00%> (+0.69%) ⬆️
app/metrics.go 89.62% <89.62%> (ø)
app/invariance.go 55.81% <30.76%> (+14.77%) ⬆️
app/app.go 68.17% <45.90%> (-1.07%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@bdchatham bdchatham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration shape is right and dual-emit is sound. Three blockers worth fixing before merge — all silent (no runtime error, just wrong data):

  1. Counter and histogram names will double-suffix under the Prometheus exporter (_total_total, _seconds_seconds).
  2. app_tx_count_total double-counts every tx via the unlabeled + labeled Add pair.
  3. txGasUsed / txGasWanted as Int64Gauge is last-write-wins under OCC concurrency.

Plus should-fixes around bucket density at the p99=2.5s SLO threshold, the lazy sync.OnceValue init in app/ante/, and the raw proposer label encoding. Inline below.

Dashboard rewrites land alongside the PLT-327 removal step; existing legacy refs in clusters/prod/monitoring/grafana-dashboards-protocol.yaml are summary-typed ({quantile="0.5"}), so query rewrites are histogram_quantile-based, not 1:1. No alert/recording rules touch these names — the dual-emit window is sufficient buffer.

Comment thread app/metrics.go Outdated
Comment thread app/metrics.go Outdated
Comment thread app/metrics.go Outdated
Comment thread app/abci.go Outdated
Comment thread app/ante/metrics.go Outdated
Comment thread app/app.go Outdated
Comment thread app/ante/metrics.go Outdated
// InitAnteMetrics registers all OTel instruments for the ante package.
// Safe to call concurrently; instruments are registered exactly once.
func InitAnteMetrics() {
appAnteMetrics.mu.Lock()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sync.Once instead?

@amir-deris amir-deris requested a review from masih May 7, 2026 23:12
Comment thread app/metrics.go
Comment on lines +24 to 26
var millisecondBuckets = metric.WithExplicitBucketBoundaries(
0.000025, 0.000050, 0.0001, 0.0005, 0.001, 0.0025, 0.005, 0.010, 0.020, 0.050, 0.075, 0.1,
)
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd add 0.25, 0.5, 1.0commitDuration is on this set, and pebble compaction stalls during heavy write load can push commit past 100ms (verified by walking through BaseApp.Commitcms.Commit(true) — the foreground write blocks during L0→L1 stalls). 100ms ceiling buckets that exactly when you most want tail visibility.

Comment thread app/metrics.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2358a11. Configure here.

Comment thread app/app.go
metrics.IncrTxProcessTypeCounter(metrics.SYNCHRONOUS)
utilmetrics.IncrTxProcessTypeCounter(utilmetrics.SYNCHRONOUS) // TODO(PLT-327): remove once app_tx_process_type_total verified
appMetrics.txProcessType.Add(ctx.Context(), 1,
otelmetric.WithAttributes(attribute.String("type", utilmetrics.SYNCHRONOUS)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giga per-tx type label mismatches block-level label

Medium Severity

The OTel txProcessType counter in ProcessTxsSynchronousGiga uses utilmetrics.SYNCHRONOUS ("synchronous") for the per-tx attribute, but the block-level blockProcessDuration metric in the same function correctly uses utilmetrics.SYNCHRONOUS_GIGA ("synchronous_giga"). Since SYNCHRONOUS_GIGA was specifically introduced in this PR to differentiate giga from regular synchronous processing, the per-tx counter appears to be a copy-paste from ProcessTxsSynchronousV2 that wasn't updated. This makes it impossible to distinguish giga-synchronous from regular-synchronous transactions in the new OTel app_tx_process_type_total metric.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2358a11. Configure here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amir-deris is this one worth addressinng?

Comment thread app/metrics_test.go
appMetrics.txProcessType.Add(ctx, 1, otelmetric.WithAttributes(attribute.String("type", "sequential")))
appMetrics.txGas.Add(ctx, 100, otelmetric.WithAttributes(attribute.String("type", "gas_used")))

appMetrics.optimisticProcessing.Add(ctx, 1, otelmetric.WithAttributes(attribute.Bool("enabled", true)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test uses Bool attribute, production uses String

Low Severity

The test uses attribute.Bool("enabled", true) for the optimisticProcessing counter, but all production call sites use attribute.String("enabled", "true") and attribute.String("enabled", "false"). In OTel, attribute.Bool and attribute.String produce different attribute types, which would generate distinct time series in Prometheus. The test doesn't reflect actual production usage and would mask type mismatches if upgraded to use a real meter provider.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2358a11. Configure here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, we want this one fixed too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants