sidecar: per-CPU overrides for servicing restore with NVMe keepalive#3166
Open
emirceski wants to merge 6 commits intomicrosoft:mainfrom
Open
sidecar: per-CPU overrides for servicing restore with NVMe keepalive#3166emirceski wants to merge 6 commits intomicrosoft:mainfrom
emirceski wants to merge 6 commits intomicrosoft:mainfrom
Conversation
During servicing restore with NVMe keepalive, sidecar was disabled entirely if any devices had mapped interrupts — all VPs fell back to sequential Linux onlining, even if only a few CPUs had outstanding IO. This change makes it selective: only CPUs with outstanding IO are excluded from sidecar startup (kernel-started instead), while the rest keep sidecar's parallel fan-out. This preserves servicing latency improvements even when NVMe keepalive is active. This PR continues Matt Kurjanowicz's PR 2477, which introduced the per-CPU state concept. `start_aps()` was reworked to map each node's control page individually via scoped mappings, ensuring node-local correctness in multi-NUMA topologies. Changes: - Add `PerCpuState` and `initial_state` to `SidecarParams` in `sidecar_defs`, supporting up to 1000 CPUs (single 4 KiB page). VMs exceeding this disable sidecar entirely. - Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing with per-CPU overrides: IO-busy CPUs are kernel-started, the rest stay sidecar-started. - Update `sidecar.rs` boot CPU generation and `boot_cpus=` command line to respect per-CPU overrides. - Rework `start_aps()` to use scoped per-node control page mappings. - Add test `servicing_keepalive_sidecar_with_outstanding_io`: 24 VPs, 2 NUMA nodes, NVMe fault injection with delayed completions, save with IO in-flight, restore with per-CPU overrides (2 kernel-started, 22 sidecar-started).
This comment was marked as resolved.
This comment was marked as resolved.
gurasinghMS
reviewed
Mar 31, 2026
gurasinghMS
reviewed
Mar 31, 2026
mattkur
reviewed
Mar 31, 2026
mattkur
reviewed
Mar 31, 2026
gurasinghMS
reviewed
Mar 31, 2026
During servicing restore with NVMe keepalive, sidecar was disabled entirely if any devices had mapped interrupts — all VPs fell back to sequential Linux onlining, even if only a few CPUs had outstanding IO. This change makes it selective: only CPUs with outstanding IO are excluded from sidecar startup (kernel-started instead), while the rest keep sidecar's parallel fan-out. This preserves servicing latency improvements even when NVMe keepalive is active. This PR continues Matt Kurjanowicz's PR 2477, which introduced the per-CPU state concept. `start_aps()` was reworked to map each node's control page individually via scoped mappings, ensuring node-local correctness in multi-NUMA topologies. - Add `PerCpuState` and `initial_state` to `SidecarParams` in `sidecar_defs`, supporting up to 400 CPUs within a single 4 KiB page. VMs exceeding this fall back to disabling sidecar entirely. - Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing with per-CPU overrides: only CPUs in `cpus_with_outstanding_io` are kernel-started, the rest stay sidecar-started. - Update `SidecarConfig` and `boot_cpus=` command line generation to respect per-CPU overrides when `per_cpu_state_specified` is set. - Rework `start_aps()` to use scoped per-node control page mappings, skipping REMOVED VPs with a log message. - Add `create_keepalive_test_config_custom` helper for flexible NVMe keepalive test configuration (topology, cmdline, NVMe params). - Add test `servicing_keepalive_sidecar_with_outstanding_io_very_heavy`: 24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed completions, save with IO in-flight, restore exercises per-CPU override path, asserts all VPs online after restore.
This comment was marked as resolved.
This comment was marked as resolved.
During servicing restore with NVMe keepalive, sidecar was disabled entirely if any devices had mapped interrupts — all VPs fell back to sequential Linux onlining, even if only a few CPUs had outstanding IO. This change makes it selective: only CPUs with outstanding IO are excluded from sidecar startup (kernel-started instead), while the rest keep sidecar's parallel fan-out. This preserves servicing latency improvements even when NVMe keepalive is active. This PR continues Matt Kurjanowicz's PR 2477, which introduced the per-CPU state concept. `start_aps()` was reworked to map each node's control page individually via scoped mappings, ensuring node-local correctness in multi-NUMA topologies. - Add `PerCpuState` and `initial_state` to `SidecarParams` in `sidecar_defs`, supporting up to 400 CPUs within a single 4 KiB page. VMs exceeding this fall back to disabling sidecar entirely. - Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing with per-CPU overrides: only CPUs in `cpus_with_outstanding_io` are kernel-started, the rest stay sidecar-started. - Update `SidecarConfig` and `boot_cpus=` command line generation to respect per-CPU overrides when `per_cpu_state_specified` is set. - Rework `start_aps()` to use scoped per-node control page mappings, skipping REMOVED VPs with a log message. - Add `create_keepalive_test_config_custom` helper for flexible NVMe keepalive test configuration (topology, cmdline, NVMe params). - Add test `servicing_keepalive_sidecar_with_outstanding_io_very_heavy`: 24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed completions, save with IO in-flight, restore exercises per-CPU override path, asserts all VPs online after restore.
During servicing restore with NVMe keepalive, sidecar was disabled entirely if any devices had mapped interrupts — all VPs fell back to sequential Linux onlining, even if only a few CPUs had outstanding IO. This change makes it selective: only CPUs with outstanding IO are excluded from sidecar startup (kernel-started instead), while the rest keep sidecar's parallel fan-out. This preserves servicing latency improvements even when NVMe keepalive is active. This PR continues Matt Kurjanowicz's PR 2477, which introduced the per-CPU state concept. `start_aps()` was reworked to map each node's control page individually via scoped mappings, ensuring node-local correctness in multi-NUMA topologies. - Add `PerCpuState` and `initial_state` to `SidecarParams` in `sidecar_defs`, supporting up to 400 CPUs within a single 4 KiB page. VMs exceeding this fall back to disabling sidecar entirely. - Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing with per-CPU overrides: only CPUs in `cpus_with_outstanding_io` are kernel-started, the rest stay sidecar-started. - Update `SidecarConfig` and `boot_cpus=` command line generation to respect per-CPU overrides when `per_cpu_state_specified` is set. - Rework `start_aps()` to use scoped per-node control page mappings, skipping REMOVED VPs with a log message. - Add `create_keepalive_test_config_custom` helper for flexible NVMe keepalive test configuration (topology, cmdline, NVMe params). - Add test `servicing_keepalive_sidecar_with_outstanding_io_very_heavy`: 24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed completions, save with IO in-flight, restore exercises per-CPU override path, asserts all VPs online after restore.
During servicing restore with NVMe keepalive, sidecar was disabled
entirely if any devices had mapped interrupts — all VPs fell back to
sequential Linux onlining, even if only a few CPUs had outstanding IO.
This change makes it selective: only CPUs with outstanding IO are
excluded from sidecar startup (kernel-started instead), while the rest
keep sidecar's parallel fan-out. This preserves servicing latency
improvements even when NVMe keepalive is active.
This PR continues Matt Kurjanowicz's PR 2477, which introduced the
per-CPU state concept. `start_aps()` was reworked to map each node's
control page individually via scoped mappings, ensuring node-local
correctness in multi-NUMA topologies.
- Add `PerCpuState` and `initial_state` to `SidecarParams` in
`sidecar_defs`, supporting up to 400 CPUs within a single 4 KiB
page. VMs exceeding this fall back to disabling sidecar entirely.
- Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing
with per-CPU overrides: only CPUs in `cpus_with_outstanding_io` are
kernel-started, the rest stay sidecar-started.
- Update `SidecarConfig` and `boot_cpus=` command line generation to
respect per-CPU overrides when `per_cpu_state_specified` is set.
- Rework `start_aps()` to use scoped per-node control page mappings,
skipping REMOVED VPs with a log message.
- Add `create_keepalive_test_config_custom` helper for flexible NVMe
keepalive test configuration (topology, cmdline, NVMe params).
- Add test `servicing_keepalive_sidecar_with_outstanding_io_very_heavy`:
24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed
completions, save with IO in-flight, restore exercises per-CPU
override path. Programmatically asserts: per-CPU override fired
(via `inspect_openhcl("vm/runtime_params/bootshim_logs")`), and
all 24 VPs online after restore.
smalis-msft
reviewed
Apr 7, 2026
| pub const MAX_NODES: usize = 128; | ||
| /// The maximum number of supported sidecar CPUs. | ||
| /// Keep small to leave space on the SidecarParams page for future fields. | ||
| /// VMs with more CPUs fall back to disabling sidecar on restore. |
Contributor
There was a problem hiding this comment.
Where do we disable sidecar on restore for too many cpus?
Author
There was a problem hiding this comment.
in openhcl/openhcl_boot/src/host_params/dt/mod.rs the block around 'else' on 996.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sidecar: per-CPU overrides for servicing restore with NVMe keepalive
During servicing restore with NVMe keepalive, sidecar was disabled
entirely if any devices had mapped interrupts — all VPs fell back to
sequential Linux onlining, even if only a few CPUs had outstanding IO.
This change makes it selective: only CPUs with outstanding IO are
excluded from sidecar startup (kernel-started instead), while the rest
keep sidecar's parallel fan-out. This preserves servicing latency
improvements even when NVMe keepalive is active.
This PR continues Matt Kurjanowicz's PR 2477, which introduced the
per-CPU state concept.
start_aps()was reworked to map each node'scontrol page individually via scoped mappings, ensuring node-local
correctness in multi-NUMA topologies.
changes
PerCpuStateandinitial_statetoSidecarParamsinsidecar_defs, supporting up to 400 CPUs within a single 4 KiBpage. VMs exceeding this fall back to disabling sidecar entirely.
openhcl_bootDT parsingwith per-CPU overrides: only CPUs in
cpus_with_outstanding_ioarekernel-started, the rest stay sidecar-started.
SidecarConfigandboot_cpus=command line generation torespect per-CPU overrides when
per_cpu_state_specifiedis set.start_aps()to use scoped per-node control page mappings,skipping REMOVED VPs with a log message.
create_keepalive_test_config_customhelper for flexible NVMekeepalive test configuration (topology, cmdline, NVMe params).
servicing_keepalive_sidecar_with_outstanding_io_very_heavy:24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed
completions, save with IO in-flight, restore exercises per-CPU
override path. Programmatically asserts: per-CPU override fired
(via
inspect_openhcl("vm/runtime_params/bootshim_logs")), andall 24 VPs online after restore.