Conversation
LSM scan() and compaction previously collected all entries from all sources into a single Vec, sorted it O(n log n), then deduplicated. This allocated O(n) memory for the combined Vec. New MergeIterator uses a BinaryHeap to k-way merge pre-sorted sources in O(n log k) time with O(k) heap memory, where k = number of sources. Keys are moved (not cloned) through the heap via std::mem::take. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The memtable previously used RwLock<BTreeMap> which serialized all writers behind a single write lock. Replace with crossbeam-skiplist SkipMap for O(log n) lock-free concurrent inserts and reads. Multiple threads can now insert and scan the memtable simultaneously without any lock contention, eliminating the primary write bottleneck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
io_uring's submit_and_wait(N) is a blocking syscall that was previously called directly on Tokio worker threads, stalling async task execution. Now all io_uring operations (read, write, sync, batch) are wrapped in tokio::task::spawn_blocking to execute on the dedicated blocking pool. Batch operations are extracted into standalone functions (read_batch_blocking, write_batch_blocking) for clarity and ownership. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Physical planner now detects equi-join conditions (Column = Column across left/right inputs) and uses HashJoin instead of NestedLoopJoin. Build phase materializes right side into a HashMap, probe phase does O(1) lookups per left row. Supports Inner, Left, Right, and Full outer joins. Non-equi conditions and cross joins fall back to nested loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously only L0→L1 compaction existed behind a single global mutex. Now: two independent mutexes (L0 vs L1+) allow L0→L1 and L1+→L2+ compactions to run concurrently. After each L0→L1, cascading check runs L1→L2 and L2→L3 compaction if size thresholds are exceeded. Prevents unbounded level growth and the write stall death spiral. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace independent 64MB block cache and 64MB memtable flush threshold with a single configurable total_cache_bytes budget (default 128MB). Block cache capacity dynamically adjusts based on memtable pressure, giving reads more cache when writes are idle and vice versa. Also update README.md to better communicate performance goals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
build_snapshot() and get_current_snapshot() held log_mutex while scanning the entire LSM tree, blocking all Raft appends (writes) for the duration. At 1M rows this caused permanent 0 TPS stalls. The lock is unnecessary: snapshots only read non-Raft-prefixed keys, log operations only touch Raft-prefixed keys, and OpenRaft serializes state machine operations. Also removes redundant flush() before scan since LSM reads include active and immutable memtables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This pull request introduces significant performance optimizations and new features to the RooDB database system. The changes focus on reducing lock contention, improving memory management, and adding query execution optimizations.
Changes:
- Replaced RwLock-based memtable with lock-free crossbeam-skiplist for concurrent writes
- Implemented unified memory budget system where block cache and memtables dynamically share a configurable memory pool
- Added hash join executor for O(n+m) equi-join performance vs O(n*m) nested loop joins
- Offloaded io_uring blocking operations to Tokio's blocking thread pool to prevent worker thread starvation
- Implemented cascading level compaction (L1→L2, L2→L3) beyond just L0→L1
- Added k-way merge iterator for efficient merging during compaction and scans
Reviewed changes
Copilot reviewed 21 out of 23 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| src/storage/lsm/merge_iter.rs | New k-way merge iterator using min-heap for O(n log k) merging |
| src/storage/lsm/memtable.rs | Replaced BTreeMap+RwLock with lock-free SkipMap for concurrent access |
| src/storage/lsm/engine.rs | Added unified memory budget, dynamic cache rebalancing, and level compaction |
| src/storage/lsm/compaction.rs | Added L1+→L2+ compaction logic and refactored to use merge iterator |
| src/storage/lsm/block_cache.rs | Added dynamic capacity adjustment with set_capacity() |
| src/io/uring.rs | Offloaded blocking io_uring operations to spawn_blocking |
| src/executor/hash_join.rs | New hash join executor with build/probe phases |
| src/planner/physical.rs | Added HashJoin plan node and equi-key extraction logic |
| src/planner/explain.rs | Added EXPLAIN output for HashJoin |
| src/planner/cost.rs | Added cost estimation for HashJoin |
| src/executor/engine.rs | Integrated HashJoin executor |
| src/raft/lsm_storage.rs | Removed unnecessary flush before snapshot and log_mutex from snapshot ops |
| tests/storage_tests.rs | Added tests for unified memory budget |
| tests/* | Updated all LsmConfig instantiations to use Default::default() |
| src/main.rs | Updated LsmConfig usage |
| src/bin/roodb_init.rs | Updated LsmConfig usage |
| README.md | Rewrote to emphasize performance focus |
| Cargo.toml | Added crossbeam-skiplist dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Int(1) == Float(1.0) was true via PartialEq, but they hashed differently because Hash used std::mem::discriminant (different for Int vs Float). This caused hash join lookups to silently miss matches on mixed-type numeric keys. Fix by giving Int and Float a shared discriminant tag and hashing both as f64 bits. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…for UringIO IoUring 0.7+ explicitly implements Send+Sync, making the unsafe impls unnecessary. Replace with compile-time static assertions that will catch any future regression if the io_uring crate ever removes those impls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hash join incorrectly matched NULL keys because Datum::PartialEq treats NULL == NULL as true. Skip inserting NULL-keyed rows into the build-side hash table and skip probe lookups for NULL-keyed rows. Outer joins naturally emit NULL-keyed rows as unmatched with padded NULLs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ity, not snapshot Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update design.md: SkipMap memtable, dynamic flush threshold, 25 executor operators, smart physical planner, LSM bloom/block_cache/merge_iter/memory budget, init module, txn purge+timeouts, protocol prepared/metrics/starttls, scheduler file listing, sql privileges, corrected constants table. Update LOCKING.md: split compaction mutexes, manifest RwLock, block_cache Mutex, reader_cache RwLock, scheduler locks, node-ID-based row IDs, snapshot locking clarification. Update SQL.md: type aliases, constraint enforcement, expressions/operators, NULL semantics, functions with unimplemented notes, prepared statements, query optimizations, system tables, SET statements, limitations, system variables expansion. Remove #[allow(dead_code)] from scheduler engine config field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.