Skip to content

feat: Add memory-limited execution for NestedLoopJoinExec#21448

Open
viirya wants to merge 3 commits intoapache:mainfrom
viirya:nlj-memory-limited-execution
Open

feat: Add memory-limited execution for NestedLoopJoinExec#21448
viirya wants to merge 3 commits intoapache:mainfrom
viirya:nlj-memory-limited-execution

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented Apr 7, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

NestedLoopJoinExec currently fails with an OOM error when the left (build) side exceeds the memory budget. This PR adds a spill-to-disk fallback so the query can complete instead of crashing.

What changes are included in this PR?

When collect_left_input via OnceFut fails with ResourcesExhausted, each partition independently falls back to a memory-limited multi-pass strategy:

  1. Re-executes the left child to get a fresh stream
  2. Buffers left data in memory-sized chunks
  3. Spills the right side to disk on the first pass, re-reads it for subsequent passes

The fallback is transparent — if memory is sufficient, the existing OnceFut path is used with zero overhead. It is currently gated to join types that don't require global right-side bitmap tracking (INNER, LEFT, LEFT SEMI, LEFT ANTI, LEFT MARK). RIGHT/FULL joins retain the existing OOM behavior until adding a cross-chunk right bitmap.

Are these changes tested?

Unit tests

Are there any user-facing changes?

No

viirya added 3 commits April 7, 2026 12:02
Implement multi-pass execution strategy for NestedLoopJoinExec when the
left (build) side exceeds the memory budget. Instead of failing with OOM,
the operator now:

1. Buffers left-side data in chunks that fit within memory limits
2. Spills the right side to disk on the first pass via SpillManager
3. Re-reads the right side from the spill file for each subsequent
   left chunk

This is enabled automatically when disk spilling is available and the
right side has a single partition. Multi-partition right side falls back
to the existing OnceFut-based path.

Phase 1 supports INNER, LEFT, LEFT SEMI, LEFT ANTI, and LEFT MARK join
types. RIGHT/FULL joins with global right bitmap tracking will follow in
a later phase.

Tracking issue: apache#15760

Co-authored-by: Isaac
RIGHT, FULL, RIGHT SEMI, RIGHT ANTI, and RIGHT MARK joins require
tracking which right-side rows have been matched across ALL left
chunks. The current implementation only tracks right-side matches
per-batch within a single left chunk, which would silently produce
incorrect results in multi-pass mode.

Gate the memory-limited path on `!need_produce_right_in_final()` so
these join types fall back to the standard OnceFut path. A global
right bitmap spanning all left chunks will be added in Phase 3.

Co-authored-by: Isaac
Instead of deciding the execution path at execute() time, always
attempt to load all left data in memory via OnceFut first. If that
fails with ResourcesExhausted, each partition independently falls
back to memory-limited mode by:

1. Re-executing the left child to get a fresh stream
2. Setting up SpillManager for right-side spilling
3. Switching to incremental chunked loading

This removes the right_partition_count == 1 restriction — fallback
now works regardless of how many right partitions exist. Each
partition independently re-executes the left child on OOM.

The fallback is gated on:
- Disk manager supports temp files
- Join type supports multi-pass (!need_produce_right_in_final)

Co-authored-by: Isaac
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 7, 2026
Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you — this idea looks good to me. I’ve left a few comments; this should be ready to go once the repeated left-side evaluation issue is addressed.

// ========================================================================
/// Left input stream for incremental buffering (memory-limited mode only).
/// None when using the standard OnceFut path.
left_stream: Option<SendableRecordBatchStream>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor style point, possibly for follow-up: I noticed in the multiple places that, the memory-limited execution logic is using Option fields to implicitly check the current fallback state.

It might be clearer to introduce a dedicated state enum, for example:

enum NljMemLimitedState {
    Unsupported,      // this join type cannot use memory-limited fallback
    FirstRightPass,
    SubsequentRightPass,
    // ...
}

This would make the execution state explicit, and the Option fields could then be used purely as payload (or for sanity checks), rather than also acting as implicit state flags.

left_reservation: Option<MemoryReservation>,
/// A batch that couldn't be added to the current chunk due to memory limit.
/// It will be the first batch in the next chunk.
left_stashed_batch: Option<RecordBatch>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also optional, I feel it's simpler to eliminate this special case, and directly put this stashed batch to the end of left_pending_batches, since it's already in memory.

/// Incrementally polls the left stream and accumulates batches until:
/// - Memory reservation fails (chunk is full, more data remains)
/// - Left stream is exhausted (this is the last/only chunk)
fn handle_buffering_left_memory_limited(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this issue should be addressed first:

In the regular in-memory path, the left side is evaluated only once, buffered, and shared across all partitions via OnceFut<JoinLeftData>.

In contrast, the memory-limited fallback re-evaluates the left side for each partition. This can be inefficient and may even increase memory pressure. Ideally, we would like the left side to be evaluated only once, similar to the in-memory path. (I can imagine this is tricky due to the use of OnceFut... I'm thinking if there is any easy way to do it)


// ========================================================================
// Memory-limited execution tests
// ========================================================================
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to add some e2e tests: in slt, run NLJ queries with generate_series() table source, and check spill metrics from explain analyze

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants