Conversation
…e is drained before marking the node as caught up in failover.
📝 WalkthroughWalkthroughTwo modifications enable the block syncing pipeline to track pending events: a new Changes
Sequence DiagramsequenceDiagram
participant Node as Node
participant Failover as Failover Loop
participant Syncer as Syncer
participant DA as DA Status
participant P2P as P2P Status
loop Catch-up Check
Failover->>DA: Check if caught up
DA-->>Failover: daCaughtUp status
Failover->>P2P: Check if caught up
P2P-->>Failover: p2pCaughtUp status
Failover->>Syncer: PendingCount()
Syncer-->>Failover: pending height events count
alt All conditions met: daCaughtUp && p2pCaughtUp && pipelineDrained
Failover->>Node: Report fully caught up
Note over Node: Block production can resume
else Any condition false
Note over Failover: Continue waiting for sync
end
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude finished @alpe's task in 2m 21s —— View job Code Review
SummaryThis is a targeted, minimal fix for a race condition in the failover catchup phase. The previous condition declared "caught up" as soon as What's Good
Observations / Concerns1. Pipeline partially covered — pending cache not counted
A scenario where this matters: In practice the A more complete guard could be: pipelineDrained := f.bc.Syncer == nil ||
(f.bc.Syncer.PendingCount() == 0 && f.bc.Syncer.PendingCacheCount() == 0)2. Mild TOCTOU window (accepted) Even after 3. No regression test The PR has no test covering the func TestWaitForCatchupDrainsChannel(t *testing.T) {
// Set daCaughtUp=true, p2pCaughtUp=true, but PendingCount > 0
// Assert that waitForCatchup does NOT return caught-up
// Then drain the channel and assert it returns true
}4. Minor: The method is placed between VerdictThe fix is correct and directly addresses the stated race condition. The concern about the pending cache is a real edge case worth discussing — depending on how the node behaves during catchup with out-of-order blocks, it may or may not be worth a follow-up. The core of the PR is sound and ready to merge with this in mind. |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@node/failover.go`:
- Around line 318-319: The current catchup check uses Syncer.PendingCount()
which only looks at channel buffer and misses pending cache entries and
in-flight processing; add a dedicated idle mechanism on Syncer (e.g., WaitIdle()
or IdleChan/IsIdle()) that atomically ensures: heightInCh is empty, an in-flight
counter (incremented/decremented in processHeightEvent and when dequeuing from
heightInCh) is zero, and the pending cache (populated by SetPendingEvent and
polled in pendingWorkerLoop) is empty; update Syncer to maintain the in-flight
counter and a condition or channel that is signaled when all three are zero,
expose a blocking WaitIdle() or boolean IsIdle() used in failover.go (replace
pipelineDrained := f.bc.Syncer == nil || f.bc.Syncer.PendingCount() == 0 with a
call to the new idle API) so catchup waits for true quiescence.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2f0c2f1f-a1ec-4274-a820-a1c7a3703f46
📒 Files selected for processing (2)
block/internal/syncing/syncer.gonode/failover.go
| pipelineDrained := f.bc.Syncer == nil || f.bc.Syncer.PendingCount() == 0 | ||
| if daCaughtUp && p2pCaughtUp && pipelineDrained { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== 1) PendingCount implementation ==="
rg -n -A5 'func \(s \*Syncer\) PendingCount\(\) int' block/internal/syncing/syncer.go
echo
echo "=== 2) Syncer struct definition and state fields ==="
rg -n -B2 -A10 'type Syncer struct' block/internal/syncing/syncer.go
echo
echo "=== 3) heightInCh and pipeline queue operations ==="
rg -n -B2 -A2 'heightInCh' block/internal/syncing/syncer.go | head -80
echo
echo "=== 4) Failover.go catchup logic ==="
rg -n -B5 -A5 'pipelineDrained' node/failover.goRepository: evstack/ev-node
Length of output: 3017
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== 1) processHeightEvent implementation ==="
rg -n -A20 'func \(s \*Syncer\) processHeightEvent' block/internal/syncing/syncer.go
echo
echo "=== 2) Cache operations and pending events ==="
rg -n -B2 -A2 'cache\.GetNextPendingEvent\|SetPendingEvent\|pendingWorkerLoop' block/internal/syncing/syncer.go | head -100
echo
echo "=== 3) Check if Syncer has any Done/Idle/Quiescent signals ==="
rg -n 'sync\.WaitGroup\|chan.*done\|chan.*idle\|chan.*quiescent' block/internal/syncing/syncer.go
echo
echo "=== 4) Any existing waiters for pipeline quiescence ==="
rg -n 'WaitForIdle\|WaitForCompletion\|WaitForQuiescence' block/internal/syncing/syncer.goRepository: evstack/ev-node
Length of output: 869
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== 1) Full processHeightEvent implementation ==="
rg -n -A50 'func \(s \*Syncer\) processHeightEvent' block/internal/syncing/syncer.go | head -80
echo
echo "=== 2) pendingWorkerLoop - cache to channel pipeline ==="
rg -n -A30 'func \(s \*Syncer\) pendingWorkerLoop' block/internal/syncing/syncer.go
echo
echo "=== 3) Check Syncer fields for tracking work state ==="
sed -n '46,150p' block/internal/syncing/syncer.goRepository: evstack/ev-node
Length of output: 6599
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== 1) processPendingEvents to confirm cache polling ==="
rg -n -A15 'func \(s \*Syncer\) processPendingEvents' block/internal/syncing/syncer.go
echo
echo "=== 2) Check cache.GetNextPendingEvent calls ==="
rg -n 'GetNextPendingEvent' block/internal/syncing/syncer.go
echo
echo "=== 3) How failover uses/waits for syncer ==="
rg -n -B5 -A10 'bc\.Syncer.*PendingCount\|bc\.Stop\|Syncer.*idle\|Syncer.*done' node/failover.go | head -80Repository: evstack/ev-node
Length of output: 874
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== 1) Failover.bc.Stop and syncer lifecycle ==="
rg -n -B10 -A10 'bc\.Stop\(\)' node/failover.go
echo
echo "=== 2) Check if syncer has any Stop or WaitAll methods ==="
rg -n 'func \(s \*Syncer\) Stop\|func \(s \*Syncer\) Wait' block/internal/syncing/syncer.go
echo
echo "=== 3) Full catchup logic context in failover ==="
sed -n '300,330p' node/failover.go
echo
echo "=== 4) Verify cache.GetNextPendingEvent can be non-nil when heightInCh empty ==="
rg -n -B5 -A10 'cache.GetNextPendingEvent\|pendingEvents\|cache.*map' block/internal/syncing/syncer.go | head -60Repository: evstack/ev-node
Length of output: 1013
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== 1) Syncer Stop method ==="
rg -n 'func \(s \*Syncer\) Stop' block/internal/syncing/syncer.go
echo
echo "=== 2) All public Syncer methods ==="
rg -n '^func \(s \*Syncer\)' block/internal/syncing/syncer.go | head -30
echo
echo "=== 3) Look at sync.WaitGroup usage in syncer ==="
rg -n 'wg\.Add\|wg\.Wait\|wg\.Done' block/internal/syncing/syncer.go
echo
echo "=== 4) Full catchup decision logic in failover ==="
sed -n '300,330p' node/failover.goRepository: evstack/ev-node
Length of output: 2363
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== 1) Syncer.Stop() method ==="
sed -n '209,250p' block/internal/syncing/syncer.go
echo
echo "=== 2) Syncer.Start() - understand wg usage ==="
sed -n '166,210p' block/internal/syncing/syncer.go
echo
echo "=== 3) Full failover catchup check context ==="
sed -n '295,330p' node/failover.go
echo
echo "=== 4) Understanding cache state - pending events structure ==="
rg -n 'pendingEventsMu\|pendingEvents\s*map' block/internal/syncing/ -A2Repository: evstack/ev-node
Length of output: 3638
Pipeline can complete catchup before pending cache events are processed.
PendingCount() only measures channel buffer length (line 471-472) and misses two important pipeline states:
-
Pending cache events: When
processHeightEvent()encounters an out-of-sequence block, it stores it in cache viaSetPendingEvent()(line 599). ThependingWorkerLoop()polls this cache every 10ms (line 481) and re-queues events toheightInChwhen the sequence is complete (line 1109-1111). These cache events don't appear inPendingCount(). -
In-flight processing: Events dequeued from
heightInCh(line 354) remain in-flight duringprocessHeightEvent()execution, which performs I/O and state validation (lines 584-602). This work isn't tracked by channel depth.
The evidence: Stop() calls wg.Wait() (line 216) to wait for all worker goroutines, then after goroutine completion, drains heightInCh in a loop (lines 221-237). This pattern shows the channel can be empty while the pipeline hasn't reached true quiescence.
Line 318 in failover.go declares catchup complete as soon as PendingCount() == 0, which creates a race where pending cache events or in-flight processing aren't accounted for.
Implement a dedicated idle signal in Syncer that waits for all three stages: empty channel, no in-flight processing, and empty pending cache.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@node/failover.go` around lines 318 - 319, The current catchup check uses
Syncer.PendingCount() which only looks at channel buffer and misses pending
cache entries and in-flight processing; add a dedicated idle mechanism on Syncer
(e.g., WaitIdle() or IdleChan/IsIdle()) that atomically ensures: heightInCh is
empty, an in-flight counter (incremented/decremented in processHeightEvent and
when dequeuing from heightInCh) is zero, and the pending cache (populated by
SetPendingEvent and polled in pendingWorkerLoop) is empty; update Syncer to
maintain the in-flight counter and a condition or channel that is signaled when
all three are zero, expose a blocking WaitIdle() or boolean IsIdle() used in
failover.go (replace pipelineDrained := f.bc.Syncer == nil ||
f.bc.Syncer.PendingCount() == 0 with a call to the new idle API) so catchup
waits for true quiescence.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3133 +/- ##
==========================================
+ Coverage 60.81% 60.82% +0.01%
==========================================
Files 113 113
Lines 11692 11695 +3
==========================================
+ Hits 7111 7114 +3
Misses 3777 3777
Partials 804 804
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
* main: fix(node): race on caught up (#3133)
Overview
Add
Syncer.PendingCountand use it to ensure the sync pipeline is drained before marking the node as caught up in failover.Summary by CodeRabbit