-
Notifications
You must be signed in to change notification settings - Fork 249
feat(sequencer): catchup from base #3057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @julienrbrt's task in 1m 57s —— View job Code Review CompleteI've completed a comprehensive review of the sequencer catch-up mechanism. While the overall architecture is sound and the syncer-side logic is well-implemented, there are critical issues that must be addressed before merging. Critical Issues (Must Fix)1. Non-Monotonic Block Timestamps During Catch-Up 🚨 HIGH PRIORITYLocation: Problem: The current implementation reuses the same DA epoch timestamp for all blocks produced within that epoch. This violates the fundamental blockchain requirement that timestamps must be strictly monotonic (each block's timestamp > previous block's timestamp). Scenario: // Epoch 100 has DA end time T = 12:00:00
// Sequencer produces 3 blocks from forced txs in this epoch:
Block N: timestamp = 12:00:00 ✓
Block N+1: timestamp = 12:00:00 ✗ INVALID (same as parent)
Block N+2: timestamp = 12:00:00 ✗ INVALID (same as parent)Additionally, if an empty DA epoch is encountered (no forced inclusion txs), Impact: Most execution layers (EVM, Cosmos SDK, etc.) will reject blocks with non-monotonic timestamps, causing the chain to halt. This is a denial of service condition. Evidence from code: // sequencer.go:365-371
timestamp := time.Now()
if c.catchingUp && !c.currentDAEndTime.IsZero() {
timestamp = c.currentDAEndTime // ❌ Reused for every block in the epoch
}Recommended fix: Implement timestamp advancement for multiple blocks per epoch. Options:
2. Data Race on Catch-Up State Fields 🚨 MEDIUM-HIGH PRIORITYLocation: Problem: The fields
Impact:
Evidence: // sequencer.go:66-69 - No atomic protection
catchingUp bool
currentDAEndTime time.Time
// sequencer.go:425 - Concurrent read
func (c *Sequencer) IsCatchingUp() bool {
return c.catchingUp // ❌ Unsynchronized read
}Recommended fix: Use proper synchronization: // Option 1: atomic.Bool + atomic.Pointer[time.Time]
catchingUp atomic.Bool
currentDAEndTime atomic.Pointer[time.Time]
// Option 2: sync.RWMutex protecting both fields
mu sync.RWMutex
catchingUp bool
currentDAEndTime time.Time3. Magic Number Should Be Named ConstantLocation: Problem: The catch-up detection logic uses epoch-based comparison but the Gemini reviewer mentioned a Recommendation: Define as package-level constant: const (
catchUpEpochThreshold = 1 // Number of epochs that triggers catch-up
)Major Issues (Should Fix)4. Incomplete Catch-Up State ManagementLocation: Issue: The catch-up state is only evaluated once per sequencer lifecycle (line 528 early return if already catching up). This means:
Recommendation: Consider periodic re-evaluation or retry logic with backoff for the DA height query. 5. Checkpoint Persistence TimingLocation: Issue: Checkpoint is saved after every GetNextBatch call, even during catch-up when processing many epochs rapidly. This could cause:
Recommendation: Consider batching checkpoint updates or using async persistence during catch-up mode. 6. Test Logger VerbosityLocation: Issue: Tests use Recommendation: // Use this for production-like logging that only shows on failure:
logger := zerolog.New(zerolog.NewTestWriter(t))
// Or this if logs aren't needed:
logger := zerolog.Nop()Minor Issues & Suggestions7. Incomplete Error ContextLocation: When 8. Documentation ClarityLocation: The comments are excellent but could benefit from:
9. Missing MetricsConsider adding metrics for catch-up mode:
This would help operators monitor sequencer health and catch-up performance. 10. Syncer DA Height Advancement LogicLocation: Positive: The incremental DA height advancement logic is well-designed and correctly handles the catch-up scenario. The comments clearly explain why we advance by one epoch at a time instead of jumping to Suggestion: Consider adding a metric to track when this special catch-up advancement is triggered, as it's an important operational signal that a sequencer was restarted after extended downtime. Positive Aspects ✅
Test Coverage AnalysisCoverage: 62.85% patch coverage with 39 lines missing coverage High-value missing coverage:
Recommendation: Add integration tests that exercise:
Architecture AssessmentOverall Design: The catch-up mechanism is architecturally sound. The approach of:
...is correct and aligns with the goal of ensuring sequencer nodes produce identical blocks to base sequencing nodes during downtime. Key Strength: The separation between sequencer catch-up (producing blocks) and syncer verification (consuming blocks) is clean and maintainable. RecommendationsPriority 1 (Block Merge):
Priority 2 (Before Release):
Priority 3 (Nice to Have):
SummaryThe implementation demonstrates a good understanding of the problem and provides a solid foundation. However, the timestamp monotonicity issue is critical and will cause chain halts in production. The data race is also serious and violates Go's memory model. Once these two issues are fixed, the PR will be in much better shape. The test coverage is good, and the architectural approach is sound. Verdict: ❌ Request Changes - Critical issues must be addressed before merge. |
Summary of ChangesHello @julienrbrt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a critical "catch-up" feature for the sequencer, designed to enhance its resilience and consistency following periods of inactivity or downtime. By intelligently replaying missed Data Availability (DA) epochs, the sequencer can resynchronize its state, ensuring that all forced inclusion transactions are processed in the correct order and with accurate timestamps, thereby maintaining a consistent view of the blockchain's history, even when operating in a decentralized environment. This mechanism allows the sequencer to seamlessly transition back to normal operation without compromising data integrity. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3057 +/- ##
==========================================
+ Coverage 61.83% 61.89% +0.06%
==========================================
Files 111 111
Lines 11123 11223 +100
==========================================
+ Hits 6878 6947 +69
- Misses 3506 3537 +31
Partials 739 739
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust catch-up mechanism for the sequencer, designed to handle restarts after extended downtime. While the overall approach for consuming and verifying catch-up blocks in the syncer, including the incremental advancement of DAHeight, is well-implemented and tested, the implementation of catch-up mode in the single sequencer has significant flaws. Specifically, it produces non-monotonic block timestamps when multiple blocks are generated for a single DA epoch or when empty epochs are encountered, which will likely cause the execution layer to reject blocks and halt the chain. Additionally, there is a data race on the new catch-up state fields due to a lack of synchronization primitives. Minor suggestions for code clarity and testing experience were also noted.
| timestamp := time.Now() | ||
| if c.catchingUp && !c.currentDAEndTime.IsZero() { | ||
| timestamp = c.currentDAEndTime | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During catch-up mode, the sequencer uses the timestamp from the last fetched DA epoch as the block timestamp. However, multiple blocks can be produced for a single DA epoch (e.g., if the epoch contains many transactions that don't fit in one block). In this case, all blocks produced for that epoch will have the exact same timestamp (c.currentDAEndTime). Furthermore, if an empty DA epoch is encountered, c.currentDAEndTime is not updated, causing the next block to reuse the timestamp from the previous non-empty epoch.
Most blockchain execution layers (like EVM) require that each new block has a timestamp strictly greater than its parent. If the sequencer produces multiple blocks with the same timestamp, the execution layer will reject them, halting the block production process and causing a Denial of Service.
| catchingUp bool | ||
| // currentDAEndTime is the DA epoch end timestamp from the last fetched epoch. | ||
| // Used as the block timestamp during catch-up to match based sequencing behavior. | ||
| currentDAEndTime time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fields catchingUp and currentDAEndTime are accessed concurrently without proper synchronization. GetNextBatch and fetchNextDAEpoch (which update these fields) can be called by the block production loop, while IsCatchingUp() can be called concurrently by other components (e.g., RPC, monitoring, or metrics).
In Go, concurrent read/write access to non-atomic fields is a data race, which leads to undefined behavior. Specifically, a race on time.Time (which is a struct) can result in a corrupted timestamp being read, potentially leading to invalid blocks and a Denial of Service.
pkg/sequencers/single/sequencer.go
Outdated
| if catchUpThreshold < 30*time.Second { | ||
| catchUpThreshold = 30 * time.Second | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value 30*time.Second is used as a minimum threshold for the catch-up logic. To improve readability and maintainability, consider defining this as a named constant at the package level, for example: const minCatchUpThreshold = 30 * time.Second, and using that constant here. This makes the intent of the value clearer and centralizes its definition.
| // This test uses maxBytes to limit how many txs are fetched, triggering the unprocessed txs scenario. | ||
| func TestSequencer_CatchUp_DetectsOldEpoch(t *testing.T) { | ||
| ctx := context.Background() | ||
| logger := zerolog.New(zerolog.NewConsoleWriter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using zerolog.New(zerolog.NewConsoleWriter()) in tests can lead to verbose output in the console during test runs. It's generally better to use a logger that integrates with the testing framework, such as zerolog.New(zerolog.NewTestWriter(t)), which only shows logs for failed tests, or zerolog.Nop() if logs are not needed. This practice is followed in other new tests in this file (e.g., TestSequencer_CatchUp_UsesDATimestamp). Applying this consistently would improve the developer experience. This comment also applies to lines 1309, 1457, and 1676.
| logger := zerolog.New(zerolog.NewConsoleWriter()) | |
| logger := zerolog.New(zerolog.NewTestWriter(t)) |
Closes: #3016
Opus did a pretty good job. I'll be improving it.