refactor(dataset-raw): bucketed dump partitions#1888
Open
Conversation
There was a problem hiding this comment.
Review Summary
Critical
BlockStreamerWithRetrydoes not delegatebucket_size(): The new bucketed partitioning path is unreachable in production because.with_retry()wraps the client before it reachesdump_ranges, and the retry wrapper inherits the defaultNonereturn. This makes the Solanabucket_sizeimplementation andsplit_and_partition_bucketedeffectively dead code.
Bugs / Robustness
- Silent fallback in Solana
bucket_size():NonZeroU64::new(DEFAULT_SLOTS_PER_EPOCH)should use.expect()or a const assertion instead of silently returningNoneif the constant were ever 0. - Potential arithmetic overflow in Phase 1:
this_end + 1and the bucket end calculation can overflow nearu64::MAX(theoretical, but worth guarding).
Consistency / Guidelines
- Missing
min_partition_blocksassertion:split_and_partition_bucketedlacks the invariant check thatsplit_and_partitionhas for non-leftover partitions. - Missing
# Panicsdocumentation: Both partition functions useassert!without documenting panic conditions perdocs/code/rust-documentation.md. - Documentation mismatch in
extractors.md:bucket_size()shown as a required method (no body) but has a default impl; checklist says "four required methods" but there are now five.
Testing
- Additional test cases suggested:
n=1,n > number of buckets, non-contiguous ranges with gaps across buckets, and a case exercising Phase 3 merge logic.
- In preparation for a Solana extractor refactoring, introduce `fn bucket_size() -> Option<NonZeroU64>` - a new method on the `BlockStreamer` trait. - The return value of this method (if `Some`) is the size of the buckets in which this extractor fetches blocks. This matters for determining how to split block ranges for parallel streaming. For example, the Solana extractor would benefit from knowing that each instance of it is the only one responsible for buckets that encompass the range that the instance needs to extract. - Refactors the dump logic to respect `bucket_size` (if `Some`) when partitioning block ranges that should be dumped.
3a3b34a to
fd3e758
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fn bucket_size() -> Option<NonZeroU64>- a new method on theBlockStreamertrait.Some) is the size of the buckets in which this extractor fetches blocks. This matters for determining how to split block ranges for parallel streaming. For example, the Solana extractor would benefit from knowing that each instance of it is the only one responsible for buckets that encompass the range that the instance needs to extract.bucket_size(ifSome) when partitioning block ranges that should be dumped.