feat(caching): implement negative stat cache to optimize polling of missing files#4729
feat(caching): implement negative stat cache to optimize polling of missing files#4729alleaditya wants to merge 57 commits into
Conversation
…issing files This change adds negative caching to StatCache, short-circuits LookUpChild on negative hits, and adds metrics/integration tests to verify.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces negative stat caching to optimize performance in scenarios where applications frequently poll for non-existent files. By caching negative results (404s) and short-circuiting lookups in the filesystem layer, the change significantly reduces redundant backend network traffic. The implementation includes configurable TTL settings and comprehensive integration tests to ensure correctness and efficiency. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements negative stat-cache functionality to optimize lookups for non-existent files and directories, reducing redundant backend GCS requests. Key changes include updates to LookUpChild in dir.go to short-circuit on confirmed negative hits, and logic in fast_stat_bucket.go to manage negative cache entries. Documentation and tests were also added to support this feature. Review feedback suggests removing speculative negative caching in insertListing to avoid correctness issues with paginated GCS listings and adding a warning in the documentation regarding the risks of using an infinite TTL for negative entries.
…s blocking CI - Remove speculative negative caching from listing path to avoid pagination bugs (addressed review comment). - Add warning about infinite negative TTL usage in semantics.md (addressed review comment). - Fix deadlock in createFile's defer in fs.go when error occurs before child inode is minted. - Fix data race in downloader Job by deep-copying MinObject, preventing race with reader thread. - Fix missing lock in fake bucket's MoveObject, resolving race with StatObject.
…hing, add doc warning - Revert data race fixes in downloader Job (moved to separate PR). - Revert name length checks and deadlock fix in fs.go (moved to separate PR). - Revert name length checks in dir.go (moved to separate PR), keeping only LookUpChild short-circuiting. - Revert MoveObject lock fix in fake bucket (moved to separate PR), keeping FetchOnlyFromCache checks. - Remove speculative negative caching on empty directory listing to avoid pagination bugs (addressed review comment). - Add warning about infinite TTL usage for negative caching in semantics.md (addressed review comment).
…_bucket Address PR comments by avoiding changes to inode/dir.go and limiting negative caching to fast_stat_bucket. Skip negative caching for HNS buckets.
…circuit for HNS" This reverts commit 006f3d9.
…ion, fix lock issue, and explicitly pass TTL
…uite cache pollution flake
efb0561 to
beab398
Compare
The dentry_cache integration tests were failing on the CI pipeline because `s.T().Name()` includes a slash (e.g. `SuiteName/TestName`). This was causing the tests to create implicit directories in GCS. When one test deleted its file, the implicit directory was negatively cached by the Stat Cache. Subsequent tests in the same suite then failed because they attempted to access files inside the negatively cached implicit directory. To fix this, `s.T().Name()` has been replaced with specific test names that do not contain slashes, ensuring no implicit directories are created and preventing test cross-talk.
…ries in legacy code paths
3343e28 to
b62cc6b
Compare
Description
This change implements negative entry caching (non-existent path caching) to optimize workloads that aggressively poll missing files (e.g., JupyterLab).
• Activation: Controlled by
metadata-cache: negative-ttl-secsconfig parameter (or--metadata-cache-negative-ttl-secsflag). It is enabled by default with a 5-second TTL. Setting it to0disables the feature.• Proactive Listing Cache: Empty directory listings (
ListObjectsreturning 0 results) are proactively cached as negative directory entries to fully protect against implicit directory network probes.• VFS Routing:
LookUpChildtracks definitive negative cache hits and short-circuits immediately in memory, avoiding network fallback.Benchmark Results
A custom benchmarking script (
bench.sh) was executed against a locally mounted test bucket to measure performance over sustained 10-second polling windows.The metrics demonstrate that negative caching successfully eliminates redundant network traffic across all lookup scenarios, including those requiring implicit directory checking.
1. Implicit Dirs vs Explicit Dirs (Throughput & Network Calls)
In
--implicit-dirsscenarios, FUSE previously required both aStatObjectand a fallbackListObjectscall per lookup. With this change, both are aggressively short-circuited:StatObjectcalls: 120,340ListObjectscalls: 1StatObjectcalls: 6ListObjectscalls: 1Throughput: +64%
StatObjectcalls: 61,606ListObjectscalls: 61,607StatObjectcalls: 2ListObjectscalls: 3Throughput: +51%
(Note: Network calls exactly match mathematically expected TTL expiration rates - approximately 1 network call per 5 seconds).
Link to the issue in case of a bug fix.
https://b.corp.google.com/issues/511786738
Testing details
--implicit-dirs, validating exact TTL expiration counts and throughput improvements.WrappedSaysNotFound_NegativeCachingDisabled,CacheHit_Negative_Disabled_FetchOnly, andEmptyListing_NegativeCaching. EnhancedTestGCSMetrics_RequestCount_NegativeCachingShortCircuitto strictly verifyListObjectsshort-circuiting.Any backward incompatible change? If so, please explain.
No.