Skip to content

sidecar: per-CPU overrides for servicing restore with NVMe keepalive#3166

Open
emirceski wants to merge 6 commits intomicrosoft:mainfrom
emirceski:sidecar-opt-v2
Open

sidecar: per-CPU overrides for servicing restore with NVMe keepalive#3166
emirceski wants to merge 6 commits intomicrosoft:mainfrom
emirceski:sidecar-opt-v2

Conversation

@emirceski
Copy link
Copy Markdown

@emirceski emirceski commented Mar 31, 2026

sidecar: per-CPU overrides for servicing restore with NVMe keepalive

During servicing restore with NVMe keepalive, sidecar was disabled
entirely if any devices had mapped interrupts — all VPs fell back to
sequential Linux onlining, even if only a few CPUs had outstanding IO.

This change makes it selective: only CPUs with outstanding IO are
excluded from sidecar startup (kernel-started instead), while the rest
keep sidecar's parallel fan-out. This preserves servicing latency
improvements even when NVMe keepalive is active.

This PR continues Matt Kurjanowicz's PR 2477, which introduced the
per-CPU state concept. start_aps() was reworked to map each node's
control page individually via scoped mappings, ensuring node-local
correctness in multi-NUMA topologies.

changes

  • Add PerCpuState and initial_state to SidecarParams in
    sidecar_defs, supporting up to 400 CPUs within a single 4 KiB
    page. VMs exceeding this fall back to disabling sidecar entirely.
  • Replace all-or-nothing sidecar disable in openhcl_boot DT parsing
    with per-CPU overrides: only CPUs in cpus_with_outstanding_io are
    kernel-started, the rest stay sidecar-started.
  • Update SidecarConfig and boot_cpus= command line generation to
    respect per-CPU overrides when per_cpu_state_specified is set.
  • Rework start_aps() to use scoped per-node control page mappings,
    skipping REMOVED VPs with a log message.
  • Add create_keepalive_test_config_custom helper for flexible NVMe
    keepalive test configuration (topology, cmdline, NVMe params).
  • Add test servicing_keepalive_sidecar_with_outstanding_io_very_heavy:
    24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed
    completions, save with IO in-flight, restore exercises per-CPU
    override path. Programmatically asserts: per-CPU override fired
    (via inspect_openhcl("vm/runtime_params/bootshim_logs")), and
    all 24 VPs online after restore.

During servicing restore with NVMe keepalive, sidecar was disabled
entirely if any devices had mapped interrupts — all VPs fell back to
sequential Linux onlining, even if only a few CPUs had outstanding IO.

This change makes it selective: only CPUs with outstanding IO are
excluded from sidecar startup (kernel-started instead), while the rest
keep sidecar's parallel fan-out. This preserves servicing latency
improvements even when NVMe keepalive is active.

This PR continues Matt Kurjanowicz's PR 2477, which introduced the
per-CPU state concept. `start_aps()` was reworked to map each node's
control page individually via scoped mappings, ensuring node-local
correctness in multi-NUMA topologies.

Changes:

- Add `PerCpuState` and `initial_state` to `SidecarParams` in
  `sidecar_defs`, supporting up to 1000 CPUs (single 4 KiB page).
  VMs exceeding this disable sidecar entirely.
- Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing
  with per-CPU overrides: IO-busy CPUs are kernel-started, the rest
  stay sidecar-started.
- Update `sidecar.rs` boot CPU generation and `boot_cpus=` command
  line to respect per-CPU overrides.
- Rework `start_aps()` to use scoped per-node control page mappings.
- Add test `servicing_keepalive_sidecar_with_outstanding_io`: 24 VPs,
  2 NUMA nodes, NVMe fault injection with delayed completions, save
  with IO in-flight, restore with per-CPU overrides (2 kernel-started,
  22 sidecar-started).
@emirceski emirceski requested a review from a team as a code owner March 31, 2026 18:02
Copilot AI review requested due to automatic review settings March 31, 2026 18:02
@github-actions github-actions bot added the unsafe Related to unsafe code label Mar 31, 2026
@github-actions

This comment was marked as resolved.

This comment was marked as resolved.

During servicing restore with NVMe keepalive, sidecar was disabled
entirely if any devices had mapped interrupts — all VPs fell back to
sequential Linux onlining, even if only a few CPUs had outstanding IO.

This change makes it selective: only CPUs with outstanding IO are
excluded from sidecar startup (kernel-started instead), while the rest
keep sidecar's parallel fan-out. This preserves servicing latency
improvements even when NVMe keepalive is active.

This PR continues Matt Kurjanowicz's PR 2477, which introduced the
per-CPU state concept. `start_aps()` was reworked to map each node's
control page individually via scoped mappings, ensuring node-local
correctness in multi-NUMA topologies.

- Add `PerCpuState` and `initial_state` to `SidecarParams` in
  `sidecar_defs`, supporting up to 400 CPUs within a single 4 KiB
  page. VMs exceeding this fall back to disabling sidecar entirely.
- Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing
  with per-CPU overrides: only CPUs in `cpus_with_outstanding_io` are
  kernel-started, the rest stay sidecar-started.
- Update `SidecarConfig` and `boot_cpus=` command line generation to
  respect per-CPU overrides when `per_cpu_state_specified` is set.
- Rework `start_aps()` to use scoped per-node control page mappings,
  skipping REMOVED VPs with a log message.
- Add `create_keepalive_test_config_custom` helper for flexible NVMe
  keepalive test configuration (topology, cmdline, NVMe params).
- Add test `servicing_keepalive_sidecar_with_outstanding_io_very_heavy`:
  24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed
  completions, save with IO in-flight, restore exercises per-CPU
  override path, asserts all VPs online after restore.
emirceski

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings April 1, 2026 21:14
@emirceski

This comment was marked as resolved.

This comment was marked as resolved.

During servicing restore with NVMe keepalive, sidecar was disabled
entirely if any devices had mapped interrupts — all VPs fell back to
sequential Linux onlining, even if only a few CPUs had outstanding IO.

This change makes it selective: only CPUs with outstanding IO are
excluded from sidecar startup (kernel-started instead), while the rest
keep sidecar's parallel fan-out. This preserves servicing latency
improvements even when NVMe keepalive is active.

This PR continues Matt Kurjanowicz's PR 2477, which introduced the
per-CPU state concept. `start_aps()` was reworked to map each node's
control page individually via scoped mappings, ensuring node-local
correctness in multi-NUMA topologies.

- Add `PerCpuState` and `initial_state` to `SidecarParams` in
  `sidecar_defs`, supporting up to 400 CPUs within a single 4 KiB
  page. VMs exceeding this fall back to disabling sidecar entirely.
- Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing
  with per-CPU overrides: only CPUs in `cpus_with_outstanding_io` are
  kernel-started, the rest stay sidecar-started.
- Update `SidecarConfig` and `boot_cpus=` command line generation to
  respect per-CPU overrides when `per_cpu_state_specified` is set.
- Rework `start_aps()` to use scoped per-node control page mappings,
  skipping REMOVED VPs with a log message.
- Add `create_keepalive_test_config_custom` helper for flexible NVMe
  keepalive test configuration (topology, cmdline, NVMe params).
- Add test `servicing_keepalive_sidecar_with_outstanding_io_very_heavy`:
  24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed
  completions, save with IO in-flight, restore exercises per-CPU
  override path, asserts all VPs online after restore.
During servicing restore with NVMe keepalive, sidecar was disabled
entirely if any devices had mapped interrupts — all VPs fell back to
sequential Linux onlining, even if only a few CPUs had outstanding IO.

This change makes it selective: only CPUs with outstanding IO are
excluded from sidecar startup (kernel-started instead), while the rest
keep sidecar's parallel fan-out. This preserves servicing latency
improvements even when NVMe keepalive is active.

This PR continues Matt Kurjanowicz's PR 2477, which introduced the
per-CPU state concept. `start_aps()` was reworked to map each node's
control page individually via scoped mappings, ensuring node-local
correctness in multi-NUMA topologies.

- Add `PerCpuState` and `initial_state` to `SidecarParams` in
  `sidecar_defs`, supporting up to 400 CPUs within a single 4 KiB
  page. VMs exceeding this fall back to disabling sidecar entirely.
- Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing
  with per-CPU overrides: only CPUs in `cpus_with_outstanding_io` are
  kernel-started, the rest stay sidecar-started.
- Update `SidecarConfig` and `boot_cpus=` command line generation to
  respect per-CPU overrides when `per_cpu_state_specified` is set.
- Rework `start_aps()` to use scoped per-node control page mappings,
  skipping REMOVED VPs with a log message.
- Add `create_keepalive_test_config_custom` helper for flexible NVMe
  keepalive test configuration (topology, cmdline, NVMe params).
- Add test `servicing_keepalive_sidecar_with_outstanding_io_very_heavy`:
  24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed
  completions, save with IO in-flight, restore exercises per-CPU
  override path, asserts all VPs online after restore.
Copilot AI review requested due to automatic review settings April 1, 2026 21:29

This comment was marked as resolved.

During servicing restore with NVMe keepalive, sidecar was disabled
entirely if any devices had mapped interrupts — all VPs fell back to
sequential Linux onlining, even if only a few CPUs had outstanding IO.

This change makes it selective: only CPUs with outstanding IO are
excluded from sidecar startup (kernel-started instead), while the rest
keep sidecar's parallel fan-out. This preserves servicing latency
improvements even when NVMe keepalive is active.

This PR continues Matt Kurjanowicz's PR 2477, which introduced the
per-CPU state concept. `start_aps()` was reworked to map each node's
control page individually via scoped mappings, ensuring node-local
correctness in multi-NUMA topologies.

- Add `PerCpuState` and `initial_state` to `SidecarParams` in
  `sidecar_defs`, supporting up to 400 CPUs within a single 4 KiB
  page. VMs exceeding this fall back to disabling sidecar entirely.
- Replace all-or-nothing sidecar disable in `openhcl_boot` DT parsing
  with per-CPU overrides: only CPUs in `cpus_with_outstanding_io` are
  kernel-started, the rest stay sidecar-started.
- Update `SidecarConfig` and `boot_cpus=` command line generation to
  respect per-CPU overrides when `per_cpu_state_specified` is set.
- Rework `start_aps()` to use scoped per-node control page mappings,
  skipping REMOVED VPs with a log message.
- Add `create_keepalive_test_config_custom` helper for flexible NVMe
  keepalive test configuration (topology, cmdline, NVMe params).
- Add test `servicing_keepalive_sidecar_with_outstanding_io_very_heavy`:
  24 VPs, 2 NUMA nodes, NVMe fault injection with 10s delayed
  completions, save with IO in-flight, restore exercises per-CPU
  override path. Programmatically asserts: per-CPU override fired
  (via `inspect_openhcl("vm/runtime_params/bootshim_logs")`), and
  all 24 VPs online after restore.
@emirceski emirceski changed the title wip: sidecar: per-CPU overrides for servicing restore with NVMe keepalive sidecar: per-CPU overrides for servicing restore with NVMe keepalive Apr 2, 2026
pub const MAX_NODES: usize = 128;
/// The maximum number of supported sidecar CPUs.
/// Keep small to leave space on the SidecarParams page for future fields.
/// VMs with more CPUs fall back to disabling sidecar on restore.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we disable sidecar on restore for too many cpus?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in openhcl/openhcl_boot/src/host_params/dt/mod.rs the block around 'else' on 996.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

unsafe Related to unsafe code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants