Skip to content

[meta] MetaLocalBackend::ListKeys support cursor based multi-turn scan#62

Open
oldsharp wants to merge 4 commits intomainfrom
rc/test/fix-flaky-test-trim_cache
Open

[meta] MetaLocalBackend::ListKeys support cursor based multi-turn scan#62
oldsharp wants to merge 4 commits intomainfrom
rc/test/fix-flaky-test-trim_cache

Conversation

@oldsharp
Copy link
Copy Markdown
Collaborator

The original ListKeys implementation used an index-based cursor approach which ad several problems:

  1. Non-deterministic iteration order: unordered_map doesn't guarantee consistent iteration order.
  2. Unreliable cursor: If keys are added/deleted between scan calls, or if rehashing occurs, the index-based approach would miss or duplicate keys.
  3. The test had a bug: The loop condition current_cursor != SCAN_BASE_CURSOR was false from the start, so the multi-turn scan was never actually tested.

This patch replaced the index-based cursor with a key-based cursor approach:

  1. Collect and sort keys: All keys are collected into a sorted vector for deterministic ordering.
  2. Use key as cursor: The cursor is the string representation of the last returned key.
  3. Binary search for position: Use std::upper_bound to efficiently find the starting position.
  4. Return next cursor: The next cursor is the last returned key, or "0" (SCAN_BASE_CURSOR) when done.

Also fixed the test to properly verify multi-turn cursor-based scanning:

  • Changed from a for loop with incorrect condition to a do-while loop.
  • Added proper collection of all keys across multiple turns.
  • Added iteration count limit to prevent infinite loops.
  • Verified that all keys are collected after multiple scan turns.

The original ListKeys implementation used an index-based cursor approach which
ad several problems:

1. Non-deterministic iteration order: unordered_map doesn't guarantee
   consistent iteration order.
2. Unreliable cursor: If keys are added/deleted between scan calls, or
   if rehashing occurs, the index-based approach would miss or duplicate keys.
3. The test had a bug: The loop condition current_cursor != SCAN_BASE_CURSOR
   was false from the start, so the multi-turn scan was never actually tested.

This patch replaced the index-based cursor with a key-based cursor approach:
1. Collect and sort keys: All keys are collected into a sorted vector for
   deterministic ordering.
2. Use key as cursor: The cursor is the string representation of the last
   returned key.
3. Binary search for position: Use std::upper_bound to efficiently find the
   starting position.
4. Return next cursor: The next cursor is the last returned key, or
   "0" (SCAN_BASE_CURSOR) when done.

Also fixed the test to properly verify multi-turn cursor-based scanning:
- Changed from a for loop with incorrect condition to a do-while loop.
- Added proper collection of all keys across multiple turns.
- Added iteration count limit to prevent infinite loops.
- Verified that all keys are collected after multiple scan turns.
@oldsharp oldsharp self-assigned this Mar 18, 2026
Copy link
Copy Markdown

@qoderai qoderai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Review Summary

The change to a key-based cursor with deterministic ordering and the corresponding test fix address real correctness issues in the old ListKeys implementation and make the behavior more robust and predictable for basic multi-turn scans.

🛡️ Key Risks & Issues

  • ListKeys now builds and sorts the full key set on every call, which is a clear correctness win but a potential scalability regression compared to the previous early-exit iterator. For large tables or frequent scans, this could noticeably increase CPU and lock-hold time; if MetaLocalBackend is used in hot paths, it might be worth considering an ordered data structure or cached index to keep determinism without full-table sort per page.
  • Cursor semantics in MetaLocalBackend are now explicitly key-based (using the last key and upper_bound), which differs from the opaque cursor behavior in MetaRedisBackend and Redis SCAN. This is fine if intentional, but it would be good to document the divergence in the MetaStorageBackend contract so callers understand that local vs Redis backends may not have identical cursor behavior in edge cases.
  • The updated TestListKeys loop is much safer than the original buggy one, but it no longer verifies that each page respects the per-call limit or that cursors progress sensibly; it only checks that all keys are eventually collected and that the loop terminates.

🧪 Verification Advice

  • For larger key counts, consider manually exercising ListKeys with different limits (1, 10, 100, large) and checking that page sizes respect the limit and that the cursor sequence behaves as expected (no infinite loops, no regressions where everything comes back in a single page).
  • If MetaLocalBackend is used in performance-sensitive scenarios, benchmarking ListKeys before and after this change with large keyspaces (e.g., tens or hundreds of thousands of keys) would help quantify the impact of the full-table sort per call.

💡 Thoughts & Suggestions

  • If deterministic ordering is important but ListKeys is on a hot path, we might explore using an ordered container for keys or maintaining a separate sorted index, so each scan page does not require rebuilding and sorting the entire key set.
  • Strengthening the stepwise ListKeys test slightly (by asserting per-call keys.size() <= limit and, optionally, some simple invariants on cursor progression) would help ensure the new cursor-based behavior stays correct over time without adding much complexity to the tests.

🤖 Generated by QoderView workflow run

@github-actions github-actions bot added the ai reviewed AI has reviewed this PR label Mar 18, 2026
@oldsharp oldsharp removed the ai reviewed AI has reviewed this PR label Mar 18, 2026
Copy link
Copy Markdown

@qoderai qoderai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Review Summary

The move to a key-based cursor with a sorted in-memory cache plus the strengthened tests addresses the original nondeterministic iteration and the buggy one-shot scan test. The new tests do a nice job of checking cursor monotonicity, per-page limits, and cache invalidation on writes.

🛡️ Key Risks & Issues

  • Cursor validation is currently inconsistent between limit > 0 and limit <= 0: for positive limits, non-base cursors are validated and "invalid_cursor" returns EC_BADARGS, while for limit <= 0 the function returns EC_OK and silently resets the cursor to SCAN_BASE_CURSOR without ever parsing it. This is a behavior change from the previous implementation (which validated the cursor regardless of limit) and could hide client-side bugs where a bad cursor is passed along with a zero or negative limit.
  • The locking strategy around ListKeys is now stricter: we take mutex_ and hold it through RebuildSortedCacheUnderLock (which walks and sorts the full key set) and the subsequent page extraction. This makes the behavior more deterministic, but at the cost of serializing all writers on the same mutex for the duration of large scans or cache rebuilds. It’s not a correctness bug, but in write-heavy environments with large keyspaces the longer critical section could become a noticeable bottleneck.

🧪 Verification Advice

  • Extend TestListKeys with a case that exercises ListKeys("invalid_cursor", 0, ...) (and optionally negative limits) so the cursor handling behavior is intentional and protected by tests, either expecting EC_BADARGS uniformly or documenting/locking in the new permissive behavior.
  • If MetaLocalBackend is used in performance-sensitive code paths, consider a simple benchmark that compares ListKeys before and after this change for large tables (e.g., tens of thousands of keys) under mixed read/write load to confirm that the extra time holding mutex_ is acceptable.

💡 Thoughts & Suggestions

  • If we want consistent API semantics, it might be worth validating non-base cursors before the limit <= 0 early return, so that any invalid cursor always yields EC_BADARGS regardless of the limit. Alternatively, a short comment in the interface explaining that limit <= 0 effectively “resets” scanning and ignores the input cursor would clarify the intended contract.
  • The new cache behavior test is a good addition; if we decide to evolve the cache further (e.g., to handle updates without invalidation or to relax ordering slightly), it may be useful to distinguish tests that assert the public contract (no missing/duplicated keys, eventual termination) from those that assert specific internal pagination patterns, so we have room to optimize the internals later without rewriting tests.

🤖 Generated by QoderView workflow run

@github-actions github-actions bot added the ai reviewed AI has reviewed this PR label Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai reviewed AI has reviewed this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant