Skip to content

feat(storage): Use raw proto access for read resumption strategy#1764

Open
googlyrahman wants to merge 1 commit intogoogleapis:mainfrom
googlyrahman:proto-wrapper
Open

feat(storage): Use raw proto access for read resumption strategy#1764
googlyrahman wants to merge 1 commit intogoogleapis:mainfrom
googlyrahman:proto-wrapper

Conversation

@googlyrahman
Copy link

The proto-plus library provides protocol buffer message classes that behave like native Python types. While convenient, my profiling confirms this abstraction comes with a significant performance penalty in our hot paths. Every time we access a field on a proto-plus object, the library triggers dynamic lookups and wrapper instantiation. In our hot data ingestion loop, this overhead accumulates rapidly over multiple chunks.

My benchmarking shows that accessing the underlying C++ Protobuf directly is ~2x faster than going through the proto-plus wrapper (measured over 30,000 iterations). While 30,000 operations might sound high, it corresponds to downloading just 60GB of data (assuming 2MB chunks) for accessing just one attribute. For high-performance workloads (e.g., downloading at 1GB/s) so it only takes 60s to complete the process. Additionally this wrapper overhead introduces measurable latency not just for data access, but for every metadata check and state update that repeats per chunk.

This PR bypass proto-plus in our critical IO loops and interact directly with the underlying C++ Protobuf structures. This eliminates the "Wrapper tax" without changing the external behavior of the application.

@product-auto-label product-auto-label bot added size: s Pull request size is small. api: storage Issues related to the googleapis/python-storage API. labels Mar 3, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of data ingestion by optimizing how Protobuf messages are handled in hot paths. By switching from proto-plus wrappers to direct C++ Protobuf access, the system avoids substantial overhead, leading to faster processing of responses and more efficient data downloads without altering external application behavior.

Highlights

  • Performance Optimization: Bypassed the proto-plus library in critical I/O loops to directly interact with underlying C++ Protobuf structures, eliminating performance overhead associated with wrapper instantiation and dynamic lookups.
  • Raw Protobuf Access: Modified reads_resumption_strategy.py to retrieve the raw Protobuf object using getattr(response, '_pb', response) and access fields directly from it.
  • Robust Field Checks: Replaced implicit boolean checks for optional Protobuf fields (e.g., read_range, crc32c) with explicit HasField() calls for improved reliability and consistency.
Changelog
  • google/cloud/storage/asyncio/retry/reads_resumption_strategy.py
    • Introduced direct access to the underlying C++ Protobuf object via getattr(response, '_pb', response) to improve performance.
    • Updated all subsequent field accesses (e.g., read_handle, object_data_ranges, read_range, read_id, read_offset, checksummed_data, crc32c) to utilize the raw Protobuf object or intermediate variables derived from it.
    • Replaced direct truthiness checks for optional Protobuf fields with explicit HasField() method calls for more accurate and performant checks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the read resumption strategy to use raw protobuf message access instead of proto-plus wrappers, aiming for a significant performance improvement in critical I/O paths. The changes correctly adapt field presence checks to use the raw proto API (e.g., HasField), which is a good improvement for both performance and correctness. I have one point of feedback regarding a type inconsistency introduced in the shared state, which could lead to potential issues.

Note: Security Review did not run due to the size of the PR.

@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Mar 3, 2026
@googlyrahman googlyrahman changed the title Use raw proto access for read resumption strategy feat(storage): Use raw proto access for read resumption strategy Mar 3, 2026
@googlyrahman googlyrahman force-pushed the proto-wrapper branch 2 times, most recently from d06e240 to 08eaf33 Compare March 3, 2026 13:43
@googlyrahman googlyrahman marked this pull request as ready for review March 3, 2026 13:45
@googlyrahman googlyrahman requested review from a team as code owners March 3, 2026 13:45
@googlyrahman
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves performance in the read resumption strategy by bypassing the proto-plus library and accessing the underlying raw protobuf messages directly in the hot path. The changes correctly use HasField for checking optional scalar fields and introduce local variables to reduce repeated attribute access, which are good micro-optimizations. The test files have also been updated with stylistic formatting improvements.

I have one suggestion to improve the robustness and consistency of the code when checking for the presence of a message field.

if response.read_handle:
state["read_handle"] = response.read_handle
proto = getattr(response, "_pb", response)
if proto.read_handle:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and robustness, it's better to use HasField() to check for the presence of message fields on raw protobuf objects, as you've done for read_range. While checking the truthiness of proto.read_handle might work, HasField("read_handle") is the explicitly supported method and avoids any ambiguity.

Suggested change
if proto.read_handle:
if proto.HasField("read_handle"):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: storage Issues related to the googleapis/python-storage API. size: m Pull request size is medium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants