Skip to content

Performance Fixes & Updates#11

Merged
jgarzik merged 17 commits intomainfrom
updates
Feb 13, 2026
Merged

Performance Fixes & Updates#11
jgarzik merged 17 commits intomainfrom
updates

Conversation

@jgarzik
Copy link
Copy Markdown
Owner

@jgarzik jgarzik commented Feb 13, 2026

No description provided.

jgarzik and others added 9 commits February 13, 2026 14:15
LSM scan() and compaction previously collected all entries from all
sources into a single Vec, sorted it O(n log n), then deduplicated.
This allocated O(n) memory for the combined Vec.

New MergeIterator uses a BinaryHeap to k-way merge pre-sorted sources
in O(n log k) time with O(k) heap memory, where k = number of sources.
Keys are moved (not cloned) through the heap via std::mem::take.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The memtable previously used RwLock<BTreeMap> which serialized all
writers behind a single write lock. Replace with crossbeam-skiplist
SkipMap for O(log n) lock-free concurrent inserts and reads.

Multiple threads can now insert and scan the memtable simultaneously
without any lock contention, eliminating the primary write bottleneck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
io_uring's submit_and_wait(N) is a blocking syscall that was previously
called directly on Tokio worker threads, stalling async task execution.
Now all io_uring operations (read, write, sync, batch) are wrapped in
tokio::task::spawn_blocking to execute on the dedicated blocking pool.

Batch operations are extracted into standalone functions
(read_batch_blocking, write_batch_blocking) for clarity and ownership.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Physical planner now detects equi-join conditions (Column = Column across
left/right inputs) and uses HashJoin instead of NestedLoopJoin. Build phase
materializes right side into a HashMap, probe phase does O(1) lookups per
left row. Supports Inner, Left, Right, and Full outer joins. Non-equi
conditions and cross joins fall back to nested loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously only L0→L1 compaction existed behind a single global mutex.
Now: two independent mutexes (L0 vs L1+) allow L0→L1 and L1+→L2+
compactions to run concurrently. After each L0→L1, cascading check
runs L1→L2 and L2→L3 compaction if size thresholds are exceeded.
Prevents unbounded level growth and the write stall death spiral.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace independent 64MB block cache and 64MB memtable flush threshold
with a single configurable total_cache_bytes budget (default 128MB).
Block cache capacity dynamically adjusts based on memtable pressure,
giving reads more cache when writes are idle and vice versa.

Also update README.md to better communicate performance goals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
build_snapshot() and get_current_snapshot() held log_mutex while scanning
the entire LSM tree, blocking all Raft appends (writes) for the duration.
At 1M rows this caused permanent 0 TPS stalls. The lock is unnecessary:
snapshots only read non-Raft-prefixed keys, log operations only touch
Raft-prefixed keys, and OpenRaft serializes state machine operations.
Also removes redundant flush() before scan since LSM reads include
active and immutable memtables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jgarzik jgarzik requested a review from Copilot February 13, 2026 18:01
@jgarzik jgarzik self-assigned this Feb 13, 2026
@jgarzik jgarzik added bug Something isn't working enhancement New feature or request labels Feb 13, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces significant performance optimizations and new features to the RooDB database system. The changes focus on reducing lock contention, improving memory management, and adding query execution optimizations.

Changes:

  • Replaced RwLock-based memtable with lock-free crossbeam-skiplist for concurrent writes
  • Implemented unified memory budget system where block cache and memtables dynamically share a configurable memory pool
  • Added hash join executor for O(n+m) equi-join performance vs O(n*m) nested loop joins
  • Offloaded io_uring blocking operations to Tokio's blocking thread pool to prevent worker thread starvation
  • Implemented cascading level compaction (L1→L2, L2→L3) beyond just L0→L1
  • Added k-way merge iterator for efficient merging during compaction and scans

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/storage/lsm/merge_iter.rs New k-way merge iterator using min-heap for O(n log k) merging
src/storage/lsm/memtable.rs Replaced BTreeMap+RwLock with lock-free SkipMap for concurrent access
src/storage/lsm/engine.rs Added unified memory budget, dynamic cache rebalancing, and level compaction
src/storage/lsm/compaction.rs Added L1+→L2+ compaction logic and refactored to use merge iterator
src/storage/lsm/block_cache.rs Added dynamic capacity adjustment with set_capacity()
src/io/uring.rs Offloaded blocking io_uring operations to spawn_blocking
src/executor/hash_join.rs New hash join executor with build/probe phases
src/planner/physical.rs Added HashJoin plan node and equi-key extraction logic
src/planner/explain.rs Added EXPLAIN output for HashJoin
src/planner/cost.rs Added cost estimation for HashJoin
src/executor/engine.rs Integrated HashJoin executor
src/raft/lsm_storage.rs Removed unnecessary flush before snapshot and log_mutex from snapshot ops
tests/storage_tests.rs Added tests for unified memory budget
tests/* Updated all LsmConfig instantiations to use Default::default()
src/main.rs Updated LsmConfig usage
src/bin/roodb_init.rs Updated LsmConfig usage
README.md Rewrote to emphasize performance focus
Cargo.toml Added crossbeam-skiplist dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/executor/hash_join.rs
Comment thread src/storage/lsm/engine.rs
Comment thread src/io/uring.rs Outdated
Comment thread src/storage/lsm/compaction.rs
Comment thread src/executor/hash_join.rs
Comment thread src/raft/lsm_storage.rs Outdated
Comment thread src/storage/lsm/memtable.rs
Comment thread src/storage/lsm/compaction.rs
Comment thread src/storage/lsm/engine.rs
Comment thread src/executor/hash_join.rs
jgarzik and others added 8 commits February 13, 2026 18:16
Int(1) == Float(1.0) was true via PartialEq, but they hashed differently
because Hash used std::mem::discriminant (different for Int vs Float).
This caused hash join lookups to silently miss matches on mixed-type
numeric keys. Fix by giving Int and Float a shared discriminant tag and
hashing both as f64 bits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…for UringIO

IoUring 0.7+ explicitly implements Send+Sync, making the unsafe impls
unnecessary. Replace with compile-time static assertions that will catch
any future regression if the io_uring crate ever removes those impls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hash join incorrectly matched NULL keys because Datum::PartialEq treats
NULL == NULL as true. Skip inserting NULL-keyed rows into the build-side
hash table and skip probe lookups for NULL-keyed rows. Outer joins
naturally emit NULL-keyed rows as unmatched with padded NULLs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ity, not snapshot

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update design.md: SkipMap memtable, dynamic flush threshold, 25 executor
operators, smart physical planner, LSM bloom/block_cache/merge_iter/memory
budget, init module, txn purge+timeouts, protocol prepared/metrics/starttls,
scheduler file listing, sql privileges, corrected constants table.

Update LOCKING.md: split compaction mutexes, manifest RwLock, block_cache
Mutex, reader_cache RwLock, scheduler locks, node-ID-based row IDs,
snapshot locking clarification.

Update SQL.md: type aliases, constraint enforcement, expressions/operators,
NULL semantics, functions with unimplemented notes, prepared statements,
query optimizations, system tables, SET statements, limitations, system
variables expansion.

Remove #[allow(dead_code)] from scheduler engine config field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jgarzik jgarzik merged commit b370f89 into main Feb 13, 2026
10 checks passed
@jgarzik jgarzik deleted the updates branch February 13, 2026 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants