Skip to content

refactor(dataset-raw): bucketed dump partitions#1888

Open
sistemd wants to merge 5 commits intomainfrom
sistemd/block-streamer-bucket-size
Open

refactor(dataset-raw): bucketed dump partitions#1888
sistemd wants to merge 5 commits intomainfrom
sistemd/block-streamer-bucket-size

Conversation

@sistemd
Copy link
Contributor

@sistemd sistemd commented Mar 3, 2026

  • In preparation for a Solana extractor refactoring, introduce fn bucket_size() -> Option<NonZeroU64> - a new method on the BlockStreamer trait.
  • The return value of this method (if Some) is the size of the buckets in which this extractor fetches blocks. This matters for determining how to split block ranges for parallel streaming. For example, the Solana extractor would benefit from knowing that each instance of it is the only one responsible for buckets that encompass the range that the instance needs to extract.
  • Refactors the dump logic to respect bucket_size (if Some) when partitioning block ranges that should be dumped.

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Critical

  • BlockStreamerWithRetry does not delegate bucket_size(): The new bucketed partitioning path is unreachable in production because .with_retry() wraps the client before it reaches dump_ranges, and the retry wrapper inherits the default None return. This makes the Solana bucket_size implementation and split_and_partition_bucketed effectively dead code.

Bugs / Robustness

  • Silent fallback in Solana bucket_size(): NonZeroU64::new(DEFAULT_SLOTS_PER_EPOCH) should use .expect() or a const assertion instead of silently returning None if the constant were ever 0.
  • Potential arithmetic overflow in Phase 1: this_end + 1 and the bucket end calculation can overflow near u64::MAX (theoretical, but worth guarding).

Consistency / Guidelines

  • Missing min_partition_blocks assertion: split_and_partition_bucketed lacks the invariant check that split_and_partition has for non-leftover partitions.
  • Missing # Panics documentation: Both partition functions use assert! without documenting panic conditions per docs/code/rust-documentation.md.
  • Documentation mismatch in extractors.md: bucket_size() shown as a required method (no body) but has a default impl; checklist says "four required methods" but there are now five.

Testing

  • Additional test cases suggested: n=1, n > number of buckets, non-contiguous ranges with gaps across buckets, and a case exercising Phase 3 merge logic.

sistemd added 5 commits March 3, 2026 18:53
- In preparation for a Solana extractor refactoring, introduce `fn
bucket_size() -> Option<NonZeroU64>` - a new method on the
`BlockStreamer` trait.
- The return value of this method (if `Some`) is the size of the
buckets in which this extractor fetches blocks. This matters for
determining how to split block ranges for parallel streaming.
For example, the Solana extractor would benefit from knowing that
each instance of it is the only one responsible for buckets that
encompass the range that the instance needs to extract.
- Refactors the dump logic to respect `bucket_size` (if `Some`) when
partitioning block ranges that should be dumped.
@sistemd sistemd force-pushed the sistemd/block-streamer-bucket-size branch from 3a3b34a to fd3e758 Compare March 3, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant