Skip to content

TraceDB: Snapshot-backed state for the trace baker#3360

Open
Kbhat1 wants to merge 9 commits into
mainfrom
pr/trace-snapshot-main
Open

TraceDB: Snapshot-backed state for the trace baker#3360
Kbhat1 wants to merge 9 commits into
mainfrom
pr/trace-snapshot-main

Conversation

@Kbhat1
Copy link
Copy Markdown
Contributor

@Kbhat1 Kbhat1 commented May 1, 2026

Describe your changes and provide context

  • Adds optional snapshot-backed trace baking so the baker can replay from in-memory memiavl state instead of SS-pebble.
  • Refcounts memiavl snapshots and releases trace leases through geth's existing StateReleaseFunc, avoiding GC finalizers.
  • Opt-in via trace_bake_use_snapshot; falls back when the backend cannot provide a snapshot.

Testing performed to validate your change

  • Verified on node
  • Unit tests
  • Long-running node on mainnet
  • go test ./sei-db/state_db/sc/memiavl -run 'TreeCopy|Snapshot' -count=1
Screen Shot 2026-05-07 at 12 59 13 PM

Note

Medium Risk
Introduces optional snapshot-backed state for debug_trace* baking and changes memiavl snapshot lifecycle/refcounting, which can affect memory usage and correctness of trace/state replay if mishandled (though it is opt-in with fallbacks).

Overview
Adds an opt-in path (evm.trace_bake_use_snapshot, evm.trace_bake_snapshot_window) for the trace baker to replay blocks against in-memory memiavl snapshots instead of SS-pebble, including wiring snapshot capture at EndBlock and closing retained snapshots on app shutdown.

Plumbs a new TraceContextProvider through the EVM RPC servers and tracing backend so debug_trace* can build contexts from leased snapshots and return a release function via geth’s StateReleaseFunc.

Extends storev2/rootmulti and SeiDB committers to support O(1) snapshot Copy() and safe release of snapshot refs, and adds refcounting to memiavl snapshot mmap lifecycles with regression/unit tests covering rewrite/reload and snapshot-store eviction behavior.

Reviewed by Cursor Bugbot for commit 0e3215d. Bugbot is set up for automated code reviews on this repo. Configure here.

@Kbhat1 Kbhat1 force-pushed the pr/trace-snapshot-main branch 4 times, most recently from 1764fc9 to c173794 Compare May 1, 2026 15:42
@Kbhat1 Kbhat1 changed the base branch from main to pr/trace-baker-main May 1, 2026 15:46
@Kbhat1 Kbhat1 changed the title Snapshot-backed state for the trace baker TraceDB: Snapshot-backed state for the trace baker May 1, 2026
@Kbhat1 Kbhat1 force-pushed the pr/trace-baker-main branch from 7b8a363 to af0de0f Compare May 1, 2026 15:59
@Kbhat1 Kbhat1 force-pushed the pr/trace-snapshot-main branch 2 times, most recently from 2412434 to e0a56bd Compare May 1, 2026 18:22
@Kbhat1 Kbhat1 force-pushed the pr/trace-baker-main branch from af0de0f to 6981a66 Compare May 1, 2026 19:05
@Kbhat1 Kbhat1 force-pushed the pr/trace-snapshot-main branch from e0a56bd to c5e3d21 Compare May 1, 2026 19:07
@Kbhat1 Kbhat1 force-pushed the pr/trace-baker-main branch from 6981a66 to fe9ec89 Compare May 1, 2026 21:02
@Kbhat1 Kbhat1 force-pushed the pr/trace-snapshot-main branch from c5e3d21 to 4bfc441 Compare May 1, 2026 21:04
@sei-protocol sei-protocol deleted a comment from github-actions Bot May 4, 2026
@Kbhat1 Kbhat1 force-pushed the pr/trace-baker-main branch from ae34e85 to f861506 Compare May 4, 2026 21:03
@Kbhat1 Kbhat1 force-pushed the pr/trace-snapshot-main branch from 4bfc441 to cf01744 Compare May 4, 2026 21:05
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 11, 2026, 2:37 PM

@Kbhat1 Kbhat1 force-pushed the pr/trace-baker-main branch from 7c210c4 to 27dc9b0 Compare May 4, 2026 21:40
@Kbhat1 Kbhat1 force-pushed the pr/trace-snapshot-main branch from cf01744 to c28c23f Compare May 4, 2026 21:41
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.22%. Comparing base (654d40b) to head (0e3215d).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3360      +/-   ##
==========================================
- Coverage   59.25%   59.22%   -0.03%     
==========================================
  Files        2110     2110              
  Lines      174181   174058     -123     
==========================================
- Hits       103210   103094     -116     
+ Misses      62044    62032      -12     
- Partials     8927     8932       +5     
Flag Coverage Δ
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
app/app.go 69.77% <ø> (+0.08%) ⬆️
evmrpc/config/config.go 74.31% <ø> (ø)
evmrpc/server.go 88.46% <ø> (ø)
evmrpc/simulate.go 74.58% <ø> (+0.21%) ⬆️
evmrpc/tracers.go 67.95% <ø> (ø)
sei-cosmos/storev2/rootmulti/store.go 65.30% <ø> (ø)
sei-db/state_db/sc/composite/store.go 68.78% <ø> (-5.34%) ⬇️
sei-db/state_db/sc/memiavl/db.go 66.66% <ø> (+0.44%) ⬆️
sei-db/state_db/sc/memiavl/snapshot.go 62.42% <ø> (-0.18%) ⬇️
sei-db/state_db/sc/memiavl/store.go 90.69% <ø> (-2.80%) ⬇️
... and 3 more

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Base automatically changed from pr/trace-baker-main to main May 8, 2026 18:31
Kbhat1 and others added 5 commits May 8, 2026 17:17
Stacks on the trace baker PR. Captures an O(1) memiavl snapshot of
the SC tree at EndBlock and serves trace re-execution from in-RAM
state instead of SS-pebble.

memiavl: refcount *Snapshot. Tree.Copy() Acquires; Snapshot.Close
unmaps only on the final release. Without this a held copy was a
use-after-munmap waiting to happen — the background snapshot rewrite
calls Tree.ReplaceWith → snapshot.Close mid-flight, segfaulting any
held copy. The internal rewrite goroutine also drops its clone's
ref so the refcount can reach zero.

Committer interface gains Copy(). memiavl delegates to *DB.Copy.
composite returns nil when flatkv is engaged so the snapshot path
silently falls back.

storev2 rootmulti adds SnapshotSCStore + CacheMultiStoreFromCommitter.

EVM keeper: TraceSnapshotStore (bounded by-height map) and EndBlock
capture keyed by snapshot.Version() (= H-1 at EndBlock(H)).

App: SnapshotAwareRPCContextProvider builds the sdk.Context directly
from the snapshot CMS to skip the throwaway CacheMultiStoreWithVersion
that CreateQueryContext would otherwise make.

Configurable via [evm]:
  trace_bake_use_snapshot     (default false)
  trace_bake_snapshot_window  (default 64)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Kbhat1 and others added 3 commits May 8, 2026 17:17
Point to the existing memiavl MemNode gauges and trace-baker counters that
operators should watch when enabling the snapshot path on high-throughput
nodes. No new metrics — just signposts to ones that already exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kbhat1 Kbhat1 force-pushed the pr/trace-snapshot-main branch from 9bba87f to aacf715 Compare May 8, 2026 21:25
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit aacf715. Configure here.

Comment thread evmrpc/simulate.go
Resolve conflict in sei-db/state_db/sc/types/types.go by keeping both
the Copy() addition from this branch and the Importer doc comment from
main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kbhat1 Kbhat1 requested a review from blindchaser May 11, 2026 20:00
Comment thread evmrpc/server.go
homeDir string,
stateStore types.StateStore,
isPanicOrSyntheticTxFunc func(ctx context.Context, hash common.Hash) (bool, error), // used in *ExcludeTraceFail endpoints
traceCtxProviders ...TraceContextProvider,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the first element is ever read. A non-variadic *TraceContextProvider parameter (or a small options struct) avoids the "what if someone passes two" ambiguity and keeps the signature self-documenting.

}

// Close releases all retained snapshots.
func (s *TraceSnapshotStore) Close() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceSnapshotStore.Close() returns nothing, but inside it ignores per-snapshot release errors:

x/evm/keeper/trace_snapshot.go:97-112

_ = releaser.ReleaseSnapshotRefs()

Same as the WARN above: refcount mismatches are real bugs. Either return error or log at WARN level on close to keep ops visibility.

Comment thread x/evm/keeper/abci.go
defer telemetry.ModuleMeasureSince(types.ModuleName, time.Now(), telemetry.MetricKeyEndBlocker)
// Bake height-1: at EndBlock(N) the indexer's safe latest is N-1, so
// N-1 is the most recent block guaranteed to be queryable.
// Bake height-1: at EndBlock(N) the indexer's safe latest is N-1. When
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EndBlock snapshot semantics are subtle — comment is dense, easy to mis-read

The off-by-one here is correct but non-obvious: storev2/rootmulti.flush() doesn't run until Commit(), so at EndBlock(N) the SC tree state is state_after_commit_of_(N-1) and snap.Version() == N-1. The baker then traces H=N-1, whose initializeBlock calls ctxProvider(H-1) = ctxProvider(N-2), which finds snap[N-2] Put at the previous EndBlock(N-1). Worth a one-line "lined up because rs.flush is called from Commit, not from EndBlock" in the comment to save the next reader a half-hour.

Also: initializeBlock calls the provider twice — once for prevBlockHeight (H-1) and once for blockNumber (H) (for WithNextMs). That means a single trace leases both snap[H-1] and snap[H]. As long as TraceBakeSnapshotWindow >= 2 this is fine, but if an operator misconfigures window=1 the second lease will miss and silently fall through to SS-pebble for oracle_mem/WithNextMs. Consider clamping window to >= 2 (or whatever the documented minimum is) at config-load time.

defer close(ch)
// Release per-tree snapshot refs; don't call cloned.Close() which
// would also tear down the live db's writer pool and stream handler.
defer func() { _ = cloned.MultiTree.Close() }()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ = will silently hide the new "Snapshot over-close" path. Since over-close is a real bug (it means refcounts are unbalanced), at least:

defer func() {
if err := cloned.MultiTree.Close(); err != nil {
logger.Error("failed to release cloned snapshot refs after rewrite", "err", err)
}
}()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants