Skip to content

Latest commit

 

History

History
54 lines (42 loc) · 3.67 KB

File metadata and controls

54 lines (42 loc) · 3.67 KB

Phase 5: Distributed Optimization

Overview

Phase 5 introduced high-level optimizations to reduce network latency and enable complex multi-shard query patterns.

Key Components

1. Shard Pruning (distributed/distributed_executor.cpp)

Optimized query routing based on partitioning keys.

  • Predicate Analysis: Detects filters on sharding keys (e.g., WHERE id = 100).
  • Targeted Dispatch: Routes fragments only to the specific node owning the shard, avoiding cluster-wide broadcasts.

2. Aggregation Merging

Implemented coordination for distributed analytics.

  • Partial Aggregation: Data nodes compute local counts and sums.
  • Global Merge: The coordinator identifies aggregate functions in the SELECT list and merges partial results from all shards into a final result set.

3. Broadcast Join Orchestration

Developed a prototype for cross-shard JOINs.

  • Table Fetching: Coordinator retrieves full data from a smaller table across all shards.
  • Broadcasting: Pushes the gathered data to the ShuffleBuffer of every node in the cluster.
  • Local Execution: Rewrites the query so each node joins its local shard with the broadcasted buffer data.

4. Shuffle Infrastructure

Enabled inter-node data movement.

  • BufferScanOperator: A physical operator that reads from in-memory shuffle buffers instead of heap files.
  • ClusterManager Buffering: Thread-safe staging area for data received via PushData RPCs.

Lessons Learned

  • Broadcast joins are highly effective for small-to-large table joins but require careful consideration of coordinator memory limits.
  • Merging aggregates at the coordinator is a bottleneck for very large clusters; future work could explore tree-based merging.

Status: 100% Test Pass

All scenarios, including distributed transactions (2PC) and join orchestration, have been verified with automated integration tests.

Phase 5 Completion: QueryExecutor Integration

The vectorized execution engine was wired into QueryExecutor via set_parallel(true) mode, enabling SELECT queries to optionally use the vectorized batch path:

  • QueryExecutor::set_parallel(true) — enables vectorized batch execution
  • QueryExecutor::set_storage_manager() — provides StorageManager for ColumnarTable lookups
  • build_vectorized_plan() — constructs operator tree (Scan → Filter → HashJoin → GroupBy → Project)
  • execute_select() — branches on use_vectorized flag between Volcano (tuple) and vectorized (batch) paths
  • Join type support: VectorizedHashJoinOperator supports INNER, LEFT, RIGHT, and FULL outer joins via JoinType enum. RIGHT and FULL outer joins use right_matched_ bitmap and emit_unmatched_right_rows() to emit unmatched right rows at end of probe.
  • Constraint: Sort/Limit queries fall back to Volcano path since SortOperator/LimitOperator don't inherit from VectorizedOperator

Phase 5 Extension: Cost-Based Volcano/Vectorized Chooser (PR #75)

The flag-based chooser (parallel_ && storage_manager_ && !has_sort_or_limit) was replaced with a cost-based heuristic using per-table statistics:

  • ANALYZE TABLE — single-pass scan collects min/max/NDV/null_count stats stored in ColumnInfo (catalog)
  • RowEstimator (optimizer/row_estimator.cpp) — row count estimation from column statistics
  • kVectorizedRowThreshold = 10000 — heuristic: Vectorized batch execution outperforms Volcano above ~10k rows
  • Chooser guard — checks ExprType::Column to skip estimation for JOINs, subqueries, and aliased tables (Volcano fallback handles these correctly)
  • Text NDV — 64-char prefix truncation limits memory; note that long shared prefixes are underestimated

Status: 100% Test Pass