feat(app): migrate app and ante telemetry to OpenTelemetry Meter API by amir-deris · Pull Request #3396 · sei-protocol/sei-chain

amir-deris · 2026-05-05T22:13:16Z

Summary

Migrates telemetry in the app and app/ante packages from the legacy telemetry/utilmetrics helpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3265 for evmrpc.

Adds app/metrics.go with a struct-based OTel instrument set registered via otel.Meter("app"), initialized once in NewApp after SetupOtelMetricsProvider()
Adds app/ante/metrics.go with a lazily-initialized anteMetrics struct via sync.OnceValue (fires on first CheckTx, after the global MeterProvider is set)
Replaces telemetry.MeasureSince / utilmetrics calls in abci.go and invariance.go with dual-emitted OTel instruments; legacy calls remain with TODO(PLT-327) markers until dashboards are migrated
Threads ctx/ctx.Context() through all .Record() and .Add() call sites to support OTel's context-based propagation
Replaces the per-BeginBlock GaugeSeidVersionAndCommit call with an app_build_info observable gauge that fires on Prometheus scrape — no per-block overhead

New metrics (OTel naming convention, exported via the process-wide MeterProvider)

ABCI phase durations — histograms bucketed at SLO thresholds (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s):

app_abci_begin_block_duration_seconds
app_abci_end_block_duration_seconds
app_abci_module_end_block_duration_seconds
app_abci_check_tx_duration_seconds
app_abci_deliver_tx_duration_seconds
app_abci_deliver_batch_tx_duration_seconds
app_block_process_duration_seconds — block tx processing duration by execution type

Transaction counters and gas:

app_tx_count_total
app_tx_process_type_total — by execution type label
app_tx_gas_total — cumulative gas by type label
app_tx_gas_used / app_tx_gas_wanted — per-tx gauges

App-level flow counters:

app_optimistic_processing_total — cache hit (enabled=true) vs miss
app_failed_total_gas_wanted_check_total
app_giga_fallback_to_v2_total

Light invariance:

app_lightinvariance_supply_duration_seconds
app_lightinvariance_supply_invalid_key_total
app_lightinvariance_supply_unmarshal_failure_total

Build info:

app_build_info — observable gauge, always 1, labels: seid_version, commit

Ante:

app_pending_nonce_total — pending nonce events by type (added, expired, rejected, accepted)

Migration note

Legacy metrics (telemetry.MeasureSince, utilmetrics.MeasureDeliverTxDuration, etc.) are dual-emitted during the migration window so existing dashboards continue to receive data. Each legacy call site is annotated TODO(PLT-327) and will be removed once dashboards are verified against the new OTel series.

Note

Medium Risk
Touches hot-path ABCI/tx-processing and invariance-check code to add new OpenTelemetry metrics; behavior should be unchanged, but any mistakes could impact performance or introduce panics in consensus-critical flows.

Overview
Adds a new appMetrics/appAnteMetrics OpenTelemetry instrument set (histograms, counters, and an app_build_info observable gauge) and initializes it once during app construction.

Updates ABCI handlers (BeginBlock/EndBlock/CheckTx/DeliverTx/DeliverTxBatch/Commit), block tx execution paths, EVM pending-nonce tracking, and light invariance checks to dual-emit the new OTel metrics alongside existing telemetry/utilmetrics calls (annotated with TODO(PLT-327) for later removal), threading context.Context through metric calls.

Extends utils/metrics labels to distinguish new execution types (e.g. SYNCHRONOUS_GIGA, OCC_GIGA) and adds basic noop-provider tests to ensure metric registration/use doesn’t panic.

^{Reviewed by Cursor Bugbot for commit 2358a11. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-05-05T22:14:22Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	May 12, 2026, 4:17 AM

codecov · 2026-05-06T21:04:33Z

Codecov Report

❌ Patch coverage is 67.20000% with 82 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.28%. Comparing base (09d0d60) to head (2358a11).

Files with missing lines	Patch %	Lines
app/app.go	45.90%	32 Missing and 1 partial ⚠️
app/invariance.go	30.76%	18 Missing ⚠️
app/metrics.go	89.62%	9 Missing and 2 partials ⚠️
app/abci.go	75.00%	10 Missing ⚠️
app/ante/evm_checktx.go	0.00%	8 Missing ⚠️
app/ante/metrics.go	77.77%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3396      +/-   ##
==========================================
+ Coverage   59.26%   59.28%   +0.02%     
==========================================
  Files        2110     2112       +2     
  Lines      174242   174389     +147     
==========================================
+ Hits       103259   103394     +135     
- Misses      62053    62062       +9     
- Partials     8930     8933       +3

Flag	Coverage Δ
sei-chain-pr	`50.81% <67.20%> (?)`
sei-db	`70.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
app/ante/metrics.go	`77.77% <77.77%> (ø)`
app/ante/evm_checktx.go	`34.72% <0.00%> (-0.66%)`	⬇️
app/abci.go	`63.19% <75.00%> (+0.69%)`	⬆️
app/metrics.go	`89.62% <89.62%> (ø)`
app/invariance.go	`55.81% <30.76%> (+14.77%)`	⬆️
app/app.go	`68.17% <45.90%> (-1.07%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bdchatham

Migration shape is right and dual-emit is sound. Three blockers worth fixing before merge — all silent (no runtime error, just wrong data):

Counter and histogram names will double-suffix under the Prometheus exporter (_total_total, _seconds_seconds).
app_tx_count_total double-counts every tx via the unlabeled + labeled Add pair.
txGasUsed / txGasWanted as Int64Gauge is last-write-wins under OCC concurrency.

Plus should-fixes around bucket density at the p99=2.5s SLO threshold, the lazy sync.OnceValue init in app/ante/, and the raw proposer label encoding. Inline below.

Dashboard rewrites land alongside the PLT-327 removal step; existing legacy refs in clusters/prod/monitoring/grafana-dashboards-protocol.yaml are summary-typed ({quantile="0.5"}), so query rewrites are histogram_quantile-based, not 1:1. No alert/recording rules touch these names — the dual-emit window is sufficient buffer.

masih · 2026-05-07T08:07:19Z

+// InitAnteMetrics registers all OTel instruments for the ante package.
+// Safe to call concurrently; instruments are registered exactly once.
+func InitAnteMetrics() {
+	appAnteMetrics.mu.Lock()


Use sync.Once instead?

bdchatham · 2026-05-07T23:31:01Z

+var millisecondBuckets = metric.WithExplicitBucketBoundaries(
+	0.000025, 0.000050, 0.0001, 0.0005, 0.001, 0.0025, 0.005, 0.010, 0.020, 0.050, 0.075, 0.1,
 )


nit: I'd add 0.25, 0.5, 1.0 — commitDuration is on this set, and pebble compaction stalls during heavy write load can push commit past 100ms (verified by walking through BaseApp.Commit → cms.Commit(true) — the foreground write blocks during L0→L1 stalls). 100ms ceiling buckets that exactly when you most want tail visibility.

…er-api

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2358a11. Configure here.}

cursor · 2026-05-12T04:25:03Z

-		metrics.IncrTxProcessTypeCounter(metrics.SYNCHRONOUS)
+		utilmetrics.IncrTxProcessTypeCounter(utilmetrics.SYNCHRONOUS) // TODO(PLT-327): remove once app_tx_process_type_total verified
+		appMetrics.txProcessType.Add(ctx.Context(), 1,
+			otelmetric.WithAttributes(attribute.String("type", utilmetrics.SYNCHRONOUS)))


Giga per-tx type label mismatches block-level label

Medium Severity

The OTel txProcessType counter in ProcessTxsSynchronousGiga uses utilmetrics.SYNCHRONOUS ("synchronous") for the per-tx attribute, but the block-level blockProcessDuration metric in the same function correctly uses utilmetrics.SYNCHRONOUS_GIGA ("synchronous_giga"). Since SYNCHRONOUS_GIGA was specifically introduced in this PR to differentiate giga from regular synchronous processing, the per-tx counter appears to be a copy-paste from ProcessTxsSynchronousV2 that wasn't updated. This makes it impossible to distinguish giga-synchronous from regular-synchronous transactions in the new OTel app_tx_process_type_total metric.

Additional Locations (1)

app/app.go#L1465-L1467

^{Reviewed by Cursor Bugbot for commit 2358a11. Configure here.}

@amir-deris is this one worth addressinng?

cursor · 2026-05-12T04:25:03Z

+	appMetrics.txProcessType.Add(ctx, 1, otelmetric.WithAttributes(attribute.String("type", "sequential")))
+	appMetrics.txGas.Add(ctx, 100, otelmetric.WithAttributes(attribute.String("type", "gas_used")))
+
+	appMetrics.optimisticProcessing.Add(ctx, 1, otelmetric.WithAttributes(attribute.Bool("enabled", true)))


Test uses Bool attribute, production uses String

Low Severity

The test uses attribute.Bool("enabled", true) for the optimisticProcessing counter, but all production call sites use attribute.String("enabled", "true") and attribute.String("enabled", "false"). In OTel, attribute.Bool and attribute.String produce different attribute types, which would generate distinct time series in Prometheus. The test doesn't reflect actual production usage and would mask type mismatches if upgraded to use a real meter provider.

Additional Locations (1)

app/app.go#L1351-L1353

^{Reviewed by Cursor Bugbot for commit 2358a11. Configure here.}

yup, we want this one fixed too

Added otel metrics for app and app/ante

defc4b2

Fixed build errors, passed correct context

55a79e1

amir-deris changed the title ~~Added otel metrics for app and app/ante~~ feat(app): migrate app and ante telemetry to OpenTelemetry Meter API May 6, 2026

amir-deris added 2 commits May 6, 2026 13:45

Added metric for commit duration

391a489

Added test coverage

b7b4868

amir-deris self-assigned this May 6, 2026

amir-deris added the non-app-hash-breaking label May 6, 2026

amir-deris requested review from bdchatham and masih May 6, 2026 21:02

bdchatham reviewed May 6, 2026

View reviewed changes

Comment thread app/metrics.go Outdated

Comment thread app/metrics.go Outdated

Comment thread app/metrics.go Outdated

Comment thread app/abci.go Outdated

Comment thread app/ante/metrics.go Outdated

Comment thread app/app.go Outdated

amir-deris added 2 commits May 6, 2026 17:46

Fixing the failed race detection tests

9aa174c

Refactored ante metrics to use package level variable

352089b

masih reviewed May 7, 2026

View reviewed changes

amir-deris added 4 commits May 7, 2026 14:59

Addressed some pr feedback

c1bcbef

Added Giga vs OCC attributes for block process duration

ccd7515

Addressing more pr feedback

5d61836

Used sync.once instead of mutex

6903eef

amir-deris requested a review from masih May 7, 2026 23:12

bdchatham reviewed May 7, 2026

View reviewed changes

Updated histogram comment and added new boundary to bucket

7739026

bdchatham approved these changes May 7, 2026

View reviewed changes

masih and others added 2 commits May 8, 2026 09:37

Merge branch 'main' into amir/plt-323-migrate-app-metrics-to-otel-met…

e096ccf

…er-api

Merge branch 'main' into amir/plt-323-migrate-app-metrics-to-otel-met…

6b728cd

…er-api

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread app/metrics.go

amir-deris added 2 commits May 11, 2026 21:15

added some extra buckets for millisecond tail latency

4d4b0d3

Merge branch 'main' into amir/plt-323-migrate-app-metrics-to-otel-met…

2358a11

…er-api

cursor Bot reviewed May 12, 2026

View reviewed changes

Conversation

amir-deris commented May 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New metrics (OTel naming convention, exported via the process-wide MeterProvider)

Migration note

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bdchatham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

masih May 7, 2026

Choose a reason for hiding this comment

Uh oh!

bdchatham May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 12, 2026

Choose a reason for hiding this comment

Giga per-tx type label mismatches block-level label

Uh oh!

masih May 12, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 12, 2026

Choose a reason for hiding this comment

Test uses Bool attribute, production uses String

Uh oh!

masih May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amir-deris commented May 5, 2026 •

edited by cursor Bot

Loading

github-actions Bot commented May 5, 2026 •

edited

Loading

codecov Bot commented May 6, 2026 •

edited

Loading

bdchatham May 7, 2026 •

edited

Loading