Skip to content

Comments

Add env variables to configure manifest cache#2993

Open
kris-gaudel wants to merge 5 commits intoapache:mainfrom
kris-gaudel:kris-gaudel/configurable-manifest-cache
Open

Add env variables to configure manifest cache#2993
kris-gaudel wants to merge 5 commits intoapache:mainfrom
kris-gaudel:kris-gaudel/configurable-manifest-cache

Conversation

@kris-gaudel
Copy link
Contributor

@kris-gaudel kris-gaudel commented Jan 31, 2026

Closes #2952

Rationale for this change

Through discussion in issue #2325, we realized that there was a memory leak in the manifest cache. PR #2951 fixed this memory leak, but we decided that it would be best for developer experience if we developers could configure the cache for their needs.

This includes the ability to disable/enable the manifest cache, and configure its size

Are these changes tested?

Yes - unit tests are included to verify these changes

Are there any user-facing changes?

Yes:

  • PYICEBERG_MANIFEST__CACHE__SIZE can be used to configure the size of the manifest cache

@kris-gaudel
Copy link
Contributor Author

kris-gaudel commented Jan 31, 2026

@kevinjqliu @Fokko @rambleraptor Please lmk what you think!

@thomas-pfeiffer
Copy link

@kris-gaudel I have a few questions:

  • Would it make sense to list the new configuration parameters somewhere in mkdocs/docs/configuration.md, so that it shows up in the documentation as well?
  • Would it make sense to allow PYICEBERG_MANIFEST__CACHE__SIZE to be set to 0 to effectively disable the cache?
  • What's the added benefit of having a _ManifestCacheManager class?
    • Wouldn't _manifest_cache: LRUCache[str, ManifestFile] = LRUCache(maxsize=int(environ.get("PYICEBERG_MANIFEST__CACHE__SIZE", 128))) do the trick as well? (obviously an oversimplified example)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kris-gaudel for raising this. I think it is reasonable to make this configurable. But I think we should reduce the number of changes and simplify where possible. For example, do we need PYICEBERG_MANIFEST__CACHE__ENABLED or can we set PYICEBERG_MANIFEST__CACHE__SIZE to 0 instead.

@kris-gaudel kris-gaudel force-pushed the kris-gaudel/configurable-manifest-cache branch from 9ceb651 to 4cad1c3 Compare February 21, 2026 23:18
@kris-gaudel kris-gaudel force-pushed the kris-gaudel/configurable-manifest-cache branch from f3b9f3d to f411547 Compare February 22, 2026 00:34
@kris-gaudel kris-gaudel requested review from Fokko and geruh February 22, 2026 19:31
Copy link

@thomas-pfeiffer thomas-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional remark:
I suggest to add paragraph to /mkdocs/docs/configuration.md to document the new configuration option. Something like:

## Manifest Caching

PyIceberg by default caches the Iceberg Manifest files locally. 
By default up to the `128` least recently used Manifests are cached, leveraging up to approx. X MB of memory. 
(Indicative calculation only. X MB in total = 128 Manifests * average size of X KB per Manifest. Your Manifest's sizes may be different.) 

The maximum cache size can be customised by setting  `PYICEBERG_MANIFEST_CACHE_SIZE` environment variable to a positive number. 
To deactivate the Manifest caching completely, set the max. cache size to `0`; 
To achieve an unlimited / unbound Manifest cache, set the max. cache size to `math.inf` or something comparable.

@kevinjqliu / @Fokko / @kris-gaudel - Do you have a rough ball-park number for the size of the cached Manifests? I would like to give a rough indication to users how much memory the cache may consume. Or is that a bad idea?

return hash(self.manifest_path)


# Global cache for ManifestFile objects, keyed by manifest_path.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail, but I would keep these original comments. Or is there a specific reason to remove them?

# share manifests after append operations.
_manifest_cache: LRUCache[str, ManifestFile] = LRUCache(maxsize=128)

# Lock for thread-safe cache access

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail, but I would keep these original comments. Or is there a specific reason to remove them?



def clear_manifest_cache() -> None:
"""Clear the manifest cache."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific use case / scenario, where one would use this method?

It might be helpful to mention them here, as I assume in most cases, users wouldn't ever call this method - except for very specific scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make manifest cache size configurable and allow for disabling

4 participants