Skip to content

refactor(store,cache)!: optimize cache restore as O(1)#3134

Draft
julienrbrt wants to merge 5 commits intomainfrom
julien/cache
Draft

refactor(store,cache)!: optimize cache restore as O(1)#3134
julienrbrt wants to merge 5 commits intomainfrom
julien/cache

Conversation

@julienrbrt
Copy link
Member

@julienrbrt julienrbrt commented Mar 4, 2026

Overview

Optimize cache restore as O(n) + prep v1.0.0.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 50539669-3e24-4f3d-9566-b0ed8fe1e54c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch julien/cache

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 4, 2026, 9:11 PM

@claude
Copy link
Contributor

claude bot commented Mar 4, 2026

Claude finished @julienrbrt's task in 2m 40s —— View job


Code Review

This PR replaces the O(n) cache restore (scanning block history) with a compact binary snapshot key (storeKeyPrefix + "__snap") that captures only the in-flight DA inclusion window. A single GetMetadata call on startup recovers the state. The design is sound and the test suite is thorough. Below are my findings.


Issues

1. Bounded spin loop in setMaxDAHeight may silently drop updates

generic_cache.go:172-181

for range 1_000 {
    current := c.maxDAHeight.Load()
    if daHeight <= current { return }
    if c.maxDAHeight.CompareAndSwap(current, daHeight) { return }
}

If 1,000 CAS attempts all fail under extreme concurrent contention the update is silently dropped. For a correctness-critical field like maxDAHeight, this is dangerous—DaHeight() is used to seed the DA retriever on restart and could return a stale value. An unbounded loop is the standard pattern here:

for {
    current := c.maxDAHeight.Load()
    if daHeight <= current { return }
    if c.maxDAHeight.CompareAndSwap(current, daHeight) { return }
}

Fix this →


2. persistSnapshot silently swallows store errors

generic_cache.go:232

_ = c.store.SetMetadata(ctx, c.snapshotKey(), encodeSnapshot(entries))

A failed snapshot write means the node restarts without the latest in-flight window. There's no logging, no metric—the failure is completely invisible. Even a debug-level log would help operators diagnose an I/O issue before it becomes a mysterious DA height regression after restart.


3. hashByHeightMu is used inconsistently — potential race between deleteAllForHeight and setDAIncluded

generic_cache.go:193-208 vs generic_cache.go:149-156

deleteAllForHeight holds hashByHeightMu while doing read-then-remove on hashByHeight. However, setDAIncluded reads and writes hashByHeight without holding this lock. A concurrent deleteAllForHeight(h) and setDAIncluded(newHash, daH, h) can result in the new real-hash entry being removed immediately after being added (if deleteAllForHeight reads the updated hashByHeight and then removes the newly written hash). The hashByHeightMu should also be held in setDAIncluded when operating on the same height.


4. PR title says "O(n)" — should be "O(1)"

The PR title reads "Optimize cache restore as O(n)" but the actual optimization goes from O(n) → O(1). The body and commit messages are correct; the title has the typo.


5. SaveToStore always returns nil even when persistSnapshot fails

generic_cache.go:301-307

SaveToStore calls persistSnapshot which discards errors (see point 2). Since SaveToStore is called at shutdown (manager.go:331-342), a silent failure here leaves no record of the missed flush. The method could at minimum propagate the store error through persistSnapshot.


Observations

persistSnapshot snapshot consistency

persistSnapshot calls c.hashByHeight.Keys() and then c.hashByHeight.Peek(h) per key—two separate lock-released operations. Between them a concurrent mutation can add or remove an entry, making the snapshot slightly stale. Because every subsequent mutation rewrites the snapshot, this resolves quickly, but it means a snapshot written immediately after a crash could omit an entry added concurrently. This is acceptable given the self-healing design, but worth documenting.

HeightPlaceholderKey custom hex encoding

generic_cache.go:288-298

The manual byte-by-byte hex encoding is correct and efficient, but harder to audit than fmt.Sprintf("%s__h/%016x", prefix, height). Since this runs only during restore (a cold path), readability may outweigh the micro-optimisation. Not a blocker.

Snapshot binary format lacks a version/magic byte

The 16-byte-per-entry format is compact, but there is no version marker. A future format change (e.g. adding a third field) would silently corrupt existing snapshots or require a migration. A single leading version byte would be cheap insurance.

ClearFromStore recreates ephemeral caches unnecessarily

manager.go:382-384

txCache and pendingEventsCache are already ephemeral (created with nil store). Recreating them in ClearFromStore is harmless but redundant.


What's Good

  • The snapshot-encode/decode logic (encodeSnapshot / decodeSnapshot) is clean, robust against malformed input, and well-tested.
  • The placeholder key design (__h/<hex>) elegantly bridges the gap between restart and DA-retriever re-fire without requiring a store scan.
  • The orphan-eviction in setDAIncluded (removing the old placeholder when the real hash arrives) prevents an unbounded leak across restarts—and there's a dedicated test for it (TestCache_NoPlaceholderLeakAfterRefire).
  • initDAHeightFromStore correctly seeds DaHeight() from the finalized-tip metadata so the DA retriever doesn't re-scan from genesis after a restart or ClearCache.
  • The submitter fallback (GetHeaderDAIncludedByHeight) is properly exercised by two end-to-end restart tests.
  • The store interface cleanup (removing unused DA-height query methods) is a solid housekeeping improvement.
  • Test coverage for the restart scenarios is comprehensive and readable.

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 90.62500% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.00%. Comparing base (c0bc141) to head (caae822).

Files with missing lines Patch % Lines
block/internal/cache/manager.go 66.66% 4 Missing and 2 partials ⚠️
block/internal/cache/generic_cache.go 95.45% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3134      +/-   ##
==========================================
+ Coverage   60.87%   61.00%   +0.12%     
==========================================
  Files         113      113              
  Lines       11695    11677      -18     
==========================================
+ Hits         7119     7123       +4     
+ Misses       3773     3758      -15     
+ Partials      803      796       -7     
Flag Coverage Δ
combined 61.00% <90.62%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt changed the title refactor(store,cache)!: optimize cache restore as O(n) refactor(store,cache)!: optimize cache restore as O(1) Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant