Skip to content

io_uring: expose multishot recv via IoUringOptions#44670

Draft
aburan28 wants to merge 4 commits intoenvoyproxy:mainfrom
aburan28:multishot-recv/04-config
Draft

io_uring: expose multishot recv via IoUringOptions#44670
aburan28 wants to merge 4 commits intoenvoyproxy:mainfrom
aburan28:multishot-recv/04-config

Conversation

@aburan28
Copy link
Copy Markdown

Summary

Wires the multishot recv read path through the proto config and the bootstrap extension factory so it can be enabled at runtime.

Depends on:

Config surface

default_socket_interface:
  io_uring_options:
    enable_multishot_recv: true
    multishot_recv_buffer_count: 256   # power of two

Each buffer is sized to read_buffer_size, so the per-worker memory cost is multishot_recv_buffer_count * read_buffer_size bytes (default 256 * 8192 = 2 MiB per worker).

Compatibility

Requires kernel 5.19+ for IORING_REGISTER_PBUF_RING and 6.0+ for multishot recv. On older kernels Envoy logs a warning and falls back to the existing readv path, so this option is safe to enable on a heterogeneous fleet.

The factory constructor gains two new positional arguments. All call sites in source/test have been updated. There is no on-the-wire proto change beyond two new optional fields, so existing configs continue to work.

Test plan

  • Existing IoUringWorkerFactoryImplTest.Basic updated for the new constructor args.
  • Existing IoUringSocketHandleImpl integration test updated for the new constructor args.
  • CI on Linux runs the full io_uring integration tests with multishot off (default), exercising the fallback path.
  • (Follow-up) Add a multishot-on integration test once the stack lands.

This is a no-behavior-change preparation step for multishot recv. The
``CompletionCb`` callback type now takes a ``uint32_t flags`` argument
that carries the raw ``cqe->flags`` value from the kernel.

For multishot completions a follow-up change will inspect:
* ``IORING_CQE_F_BUFFER`` — a buffer was selected from a buf-ring; the
  buffer ID is encoded in the upper bits.
* ``IORING_CQE_F_MORE`` — the SQE will produce further completions.

The worker callback ignores ``flags`` for now. Injected completions are
defined to always carry ``flags == 0``.

All ``forEveryCompletion`` callers (worker, impl tests) updated.
``IoUringSocket::on*`` virtual methods are intentionally unchanged in
this commit; only ``onRead`` will need flags, in the multishot recv
change.

Signed-off-by: Adam Buran <a.buran28@gmail.com>
Signed-off-by: Adam Buran <aburan28@gmail.com>
Adds the kernel-managed buffer ring lifecycle and the ``recv`` multishot
opcode to ``IoUringImpl``. This is the plumbing layer for switching the
io_uring socket read path off the per-read ``readv`` allocation; the
worker change comes in a follow-up PR.

New ``IoUring`` virtuals:

* ``setupBufRing(group_id, count, buf_size)`` — register a buffer ring
  with the kernel. The buffers live in a single contiguous allocation
  owned by ``IoUringImpl``. Validates that ``count`` is a non-zero power
  of two and rejects double-setup. Falls back to ``IoUringResult::Failed``
  on kernels that lack ``IORING_REGISTER_PBUF_RING`` (< 5.19).
* ``prepareRecvMultishot(fd, group_id, user_data)`` — submits a recv
  with ``IOSQE_BUFFER_SELECT`` so the kernel pulls a buffer from the
  ring. The same SQE may produce multiple completions, signalled by
  ``IORING_CQE_F_MORE`` in ``cqe->flags``.
* ``getBufferForBid(group_id, bid)`` — look up the storage backing a
  particular kernel-selected buffer; the consumer reads up to ``cqe->res``
  bytes and then recycles.
* ``recycleBuffer(group_id, bid)`` — return a consumed buffer to the
  ring so the kernel can reuse it.

For now only one buf-ring is supported per ``IoUring`` instance.

Test:
* ``SetupBufRingValidatesInputs`` — exercises the rejection paths
  (bad count, bad buf_size, double-setup).
* ``MultishotRecvDeliversBuffersAndStaysArmed`` — end-to-end with a real
  socketpair and a real ring: arm a multishot recv, write twice,
  verify both completions deliver buffers, the bid is in range, the
  data matches, and the SQE stays armed (F_MORE set on the first
  completion). Skips when the kernel lacks buf-ring support.

Signed-off-by: Adam Buran <a.buran28@gmail.com>
Signed-off-by: Adam Buran <aburan28@gmail.com>
When the worker is configured with multishot recv enabled and the
kernel/liburing successfully sets up a buf-ring (5.19+), the
``IoUringServerSocket`` read path replaces the per-read
``readv`` SQE + ``uint8_t[]`` allocation with a single
``IORING_OP_RECV`` multishot SQE that pulls buffers from the kernel-
managed ring. Each completion delivers one kernel-selected buffer; the
``BufferFragment`` wrapping it recycles the buffer back to the ring on
release.

Mechanics:

* New ``Request::RequestType::RecvMultishot`` distinguishes the
  multishot SQE from a plain ``Read``. The worker's completion dispatch
  routes both to ``onRead`` but holds onto the ``Request*`` while
  ``IORING_CQE_F_MORE`` is set (the kernel reuses the same user_data
  for further completions on the same SQE).
* ``IoUringSocket::onRead`` gains a ``uint32_t flags`` argument carrying
  the raw ``cqe->flags``. The buffer ID is in the upper bits when
  ``IORING_CQE_F_BUFFER`` is set; ``F_MORE`` indicates the SQE is still
  armed.
* ``IoUringServerSocket::onRead`` only clears ``read_req_`` when the
  SQE has terminated. While armed, the bottom-of-function
  ``submitReadRequest`` short-circuits because ``read_req_`` is still
  non-null. When ``F_MORE`` clears, ``read_req_`` is freed and a new
  multishot SQE is submitted.
* ``IoUringWorkerImpl::makeMultishotBufferFragment`` wraps the kernel
  buffer with a release callback that calls ``recycleBuffer`` —
  back-pressure / buffer return is driven by the upper-layer drain.
* On older kernels ``setupBufRing`` returns ``Failed`` and the worker
  silently falls back to the existing ``readv`` path, so the feature
  is safe to ship gated behind a config flag.

The worker constructor gains two new defaulted args
(``enable_multishot_recv``, ``multishot_recv_buffer_count``) so all
existing call sites continue to compile unchanged.

Tests:

* ``MultishotRecvSetupAndSubmit`` — buf-ring setup + first submit picks
  the multishot path and produces a ``RecvMultishot`` request.
* ``MultishotRecvFallbackOnUnsupportedKernel`` — when ``setupBufRing``
  fails, the worker falls back to ``prepareReadv``.
* ``MultishotRecvDeliversBufferAndStaysArmed`` — completion with
  ``F_BUFFER | F_MORE`` delivers the buffer to the upper layer and does
  not re-arm the SQE; the buffer is recycled when the upper layer
  drains.
* ``MultishotRecvReArmOnFMoreClear`` — completion with ``F_BUFFER``
  but no ``F_MORE`` triggers a fresh ``prepareRecvMultishot`` to re-arm.

The proto / factory wiring to actually expose this option is in a
follow-up change.

Signed-off-by: Adam Buran <a.buran28@gmail.com>
Signed-off-by: Adam Buran <aburan28@gmail.com>
Wires the per-worker multishot recv plumbing through the proto config
and factory so it can be enabled at runtime.

* New ``IoUringOptions.enable_multishot_recv`` (BoolValue, default
  false) and ``IoUringOptions.multishot_recv_buffer_count`` (UInt32Value,
  default 256, must be a power of two).
* ``IoUringWorkerFactoryImpl`` constructor takes the two new args and
  threads them through to ``IoUringWorkerImpl``.
* ``socket_interface_impl.cc`` reads the proto fields and passes them
  to the factory.
* Changelog entry under ``new_features`` describes the option and the
  kernel/liburing requirements (5.19+ for the buf-ring registration,
  6.0+ for multishot recv itself); older kernels silently fall back
  to ``readv``.

Each multishot recv buffer is sized to ``read_buffer_size``, so the
total per-worker memory is
``multishot_recv_buffer_count * read_buffer_size`` bytes (default
256 * 8192 = 2 MiB).

Signed-off-by: Adam Buran <a.buran28@gmail.com>
Signed-off-by: Adam Buran <aburan28@gmail.com>
@aburan28 aburan28 requested a deployment to external-contributors April 27, 2026 03:14 — with GitHub Actions Waiting
@repokitteh-read-only
Copy link
Copy Markdown

Hi @aburan28, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #44670 was opened by aburan28.

see: more, trace.

@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #44670 was opened by aburan28.

see: more, trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant