Skip to content

posix: build the io pool from the clamped size, not the raw request#1827

Open
EylonKrause wants to merge 1 commit into
ai-dynamo:mainfrom
EylonKrause:fix/posix-io-pool-size
Open

posix: build the io pool from the clamped size, not the raw request#1827
EylonKrause wants to merge 1 commit into
ai-dynamo:mainfrom
EylonKrause:fix/posix-io-pool-size

Conversation

@EylonKrause

@EylonKrause EylonKrause commented Jun 24, 2026

Copy link
Copy Markdown

What?

nixlPosixIOQueueImpl (src/plugins/posix/io_queue.h) sized its ios_ pool and
free_ios_ free-list from the raw ios_pool_size constructor argument. The base
class nixlPosixIOQueue clamps that value to [MIN_IOS_POOL_SIZE (64), MAX_IOS_POOL_SIZE] and stores the clamped result in ios_pool_size_. This PR builds
the pool from ios_pool_size_ so the pool and the rest of the class agree.

-          ios_(ios_pool_size) {
-        for (uint32_t i = 0; i < ios_pool_size; i++) {
+          ios_(ios_pool_size_) {
+        for (uint32_t i = 0; i < ios_pool_size_; i++) {

Why?

The io_uring and Linux-AIO completion handlers detect "all I/O finished" with
if (free_ios_.size() == ios_pool_size_) return NIXL_SUCCESS;
(io_uring_io_queue.cpp:137, linux_aio_io_queue.cpp:149,179). ios_pool_size_ is
the clamped value, but the pool was built with the raw count. For a requested
ios_pool_size of 1..63 (settable via the ios_pool_size backend param) the pool
holds fewer than 64 entries, so free_ios_.size() can never reach ios_pool_size_
doCheckCompleted never returns NIXL_SUCCESS and the transfer never completes (hang).

Reproduction

A self-contained model of the base clamp + pool construction + the "all free" check:

buggy:  requested=1  -> pool_entries=1,  ios_pool_size_=64  -> all-free reachable: NO (hang)
        requested=32 -> pool_entries=32, ios_pool_size_=64  -> all-free reachable: NO (hang)
fixed:  requested=1  -> pool_entries=64, ios_pool_size_=64  -> all-free reachable: YES
        requested=32 -> pool_entries=64, ios_pool_size_=64  -> all-free reachable: YES

How (verification)

  • Confirmed the before/after with the model above.
  • Compiled the io-queue consumers (posix_aio_io_queue.cpp, io_uring_io_queue.cpp,
    which instantiate nixlPosixIOQueueImpl) in-tree with
    -Dsanitizer=address,undefined (exit 0; liburing builds via the meson subproject).

Related Issues

None.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed an internal sizing inconsistency in the POSIX I/O queue so its storage now matches the effective pool size used by the rest of the component.
    • Improved reliability when initializing queue resources, reducing the chance of mismatched allocation behavior.

nixlPosixIOQueueImpl sized its ios_ pool and free_ios_ free-list from the
raw ios_pool_size constructor argument, but the base class clamps that
value to [MIN_IOS_POOL_SIZE=64, MAX_IOS_POOL_SIZE] and stores the clamped
result in ios_pool_size_, which the io_uring and Linux-AIO completion
handlers use as the "all ios free" sentinel
(free_ios_.size() == ios_pool_size_).

For a requested ios_pool_size of 1..63 (settable via the ios_pool_size
backend param) the pool holds fewer than 64 entries, so free_ios_.size()
can never equal ios_pool_size_ and doCheckCompleted never returns
NIXL_SUCCESS -- the transfer never completes. Build the pool from the
clamped ios_pool_size_ so the pool size and the completion check agree.

Signed-off-by: Eylon Krause <eylon1909@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown

👋 Hi EylonKrause! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 880f9ace-358a-4d55-9d8f-55e19fe576e4

📥 Commits

Reviewing files that changed from the base of the PR and between a9f456b and 1ae1dff.

📒 Files selected for processing (1)
  • src/plugins/posix/io_queue.h

📝 Walkthrough

Walkthrough

In nixlPosixIOQueueImpl, the constructor initializer list is updated to size the ios_ vector using the already-clamped ios_pool_size_ member (set by the base-class constructor) rather than the raw ios_pool_size constructor argument.

Changes

POSIX IO Queue Pool Size Fix

Layer / File(s) Summary
ios_ vector sized from normalized pool size
src/plugins/posix/io_queue.h
Constructor initializer list changed from the raw ios_pool_size parameter to the clamped ios_pool_size_ member when constructing ios_, aligning vector capacity with the base-class normalization.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

A bunny found a clamp applied with care,
But the vector missed it — sized from thin air!
Now ios_pool_size_ sets the count,
No raw parameter to surmount.
🐇 Consistent sizing, hip hooray! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main fix: using the clamped I/O pool size instead of the raw request.
Description check ✅ Passed The description follows the template with What, Why, and How sections and provides enough context, rationale, and verification details.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant