YTDB-604: Lazy RID-only iteration for MATCH traversal steps#863
YTDB-604: Lazy RID-only iteration for MATCH traversal steps#863
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a crucial performance optimization for the database's MATCH engine. By implementing a lazy loading strategy for intermediate traversal steps, the system now avoids materializing full Result objects when only Record IDs are needed. This change drastically cuts down on disk I/O, CPU cycles spent on deserialization, and memory overhead from short-lived objects, leading to more efficient execution of complex graph traversals, especially in scenarios with large datasets. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant optimization for the MATCH query engine by implementing a "RID-only path" for VertexFromLinkBagIterable. A new ridIterator() method is added to VertexFromLinkBagIterable which allows iterating over RecordId objects directly from a LinkBag without loading the full entities, applying class and RID filters. The MatchEdgeTraverser.toExecutionStream method is updated to utilize this new iterator, enabling lazy loading of entities in MATCH traversals. Comprehensive unit tests have been added to validate the functionality and lazy-loading behavior of the new ridIterator() and its integration with the MATCH execution stream. I have no feedback to provide as there were no review comments.
Test Count Gate ResultsTolerance: 5% drop allowed per module Overall: ✅ 18060 tests (baseline: 18048, +12)
|
Coverage Gate ResultsThresholds: 85% line, 70% branch Line Coverage: ✅ 100.0% (32/32 lines)
Branch Coverage: ✅ 100.0% (20/20 branches)
|
3039c3f to
ce57cae
Compare
JMH LDBC Benchmark ComparisonBase: Single-Thread Results
Multi-Thread Results
Scalability (MT/ST ratio)
|
|
Hi @sandrawar, please profile regressions using asyncprofiler on Hetzner CCX 33 node and find out what caused regressions. |
ce57cae to
eaf7698
Compare
…lter exists The unconditional ridIterator() path caused ~50% regression on IC4 and smaller regressions on IC1/IC2 MT because the WHERE filter forces loading every entity anyway, making the lazy ResultInternal path more expensive than eager VertexFromLinkBagIterator loading (extra isBlob() schema lookups per entity). Now ridOnlyPath=true only when filter==null.
PR Title:
YTDB-604: Lazy RID-only iteration for MATCH traversal steps
Motivation:
Async-profiler data from LDBC IC5 (128K+ traversals) showed that the MATCH engine loads every intermediate vertex from storage (
loadEntity()) even when only the RID is needed for traversal to the next hop. This causes unnecessary disk I/O, deserialization (EntityImpl.deserializeProperties()— 1.45% CPU), and GC pressure from short-livedResultInternalobjects wrapping full entities.Most intermediate MATCH steps only need the RID — properties are only read at the final projection (
RETURN post.title). By deferringloadEntity()to first property access, we skip I/O entirely for vertices that are just traversal waypoints or get rejected by downstream WHERE filters.The fix adds
ridIterator()toVertexFromLinkBagIterable, which yields bareRecordIdobjects from the LinkBag without touching storage.MatchEdgeTraverser.toExecutionStream()uses this path forVertexFromLinkBagIterableresults.ResultInternal's existing lazy loading handles the rest —getIdentity()returns the RID immediately,getProperty()triggersloadEntity()on first access.Class and RID pre-filters are preserved (both operate on the RID, no I/O needed).
No behavioral change for non-MATCH consumers —
iterator()still returnsloaded
Vertexobjects.