feat(storage): Use raw proto access for read resumption strategy#1764
feat(storage): Use raw proto access for read resumption strategy#1764googlyrahman wants to merge 1 commit intogoogleapis:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of data ingestion by optimizing how Protobuf messages are handled in hot paths. By switching from Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the read resumption strategy to use raw protobuf message access instead of proto-plus wrappers, aiming for a significant performance improvement in critical I/O paths. The changes correctly adapt field presence checks to use the raw proto API (e.g., HasField), which is a good improvement for both performance and correctness. I have one point of feedback regarding a type inconsistency introduced in the shared state, which could lead to potential issues.
Note: Security Review did not run due to the size of the PR.
google/cloud/storage/asyncio/retry/reads_resumption_strategy.py
Outdated
Show resolved
Hide resolved
d06e240 to
08eaf33
Compare
08eaf33 to
19d64d3
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request improves performance in the read resumption strategy by bypassing the proto-plus library and accessing the underlying raw protobuf messages directly in the hot path. The changes correctly use HasField for checking optional scalar fields and introduce local variables to reduce repeated attribute access, which are good micro-optimizations. The test files have also been updated with stylistic formatting improvements.
I have one suggestion to improve the robustness and consistency of the code when checking for the presence of a message field.
| if response.read_handle: | ||
| state["read_handle"] = response.read_handle | ||
| proto = getattr(response, "_pb", response) | ||
| if proto.read_handle: |
There was a problem hiding this comment.
For consistency and robustness, it's better to use HasField() to check for the presence of message fields on raw protobuf objects, as you've done for read_range. While checking the truthiness of proto.read_handle might work, HasField("read_handle") is the explicitly supported method and avoids any ambiguity.
| if proto.read_handle: | |
| if proto.HasField("read_handle"): |
The proto-plus library provides protocol buffer message classes that behave like native Python types. While convenient, my profiling confirms this abstraction comes with a significant performance penalty in our hot paths. Every time we access a field on a proto-plus object, the library triggers dynamic lookups and wrapper instantiation. In our hot data ingestion loop, this overhead accumulates rapidly over multiple chunks.
My benchmarking shows that accessing the underlying C++ Protobuf directly is ~2x faster than going through the proto-plus wrapper (measured over 30,000 iterations). While 30,000 operations might sound high, it corresponds to downloading just 60GB of data (assuming 2MB chunks) for accessing just one attribute. For high-performance workloads (e.g., downloading at 1GB/s) so it only takes 60s to complete the process. Additionally this wrapper overhead introduces measurable latency not just for data access, but for every metadata check and state update that repeats per chunk.
This PR bypass proto-plus in our critical IO loops and interact directly with the underlying C++ Protobuf structures. This eliminates the "Wrapper tax" without changing the external behavior of the application.