Skip to content

ofi: Add per-operation NIC Tx load balancing (round-robin and random)#1224

Draft
bcmIntc wants to merge 23 commits intoSandia-OpenSHMEM:mainfrom
bcmIntc:bcm_TxLoadBalancing
Draft

ofi: Add per-operation NIC Tx load balancing (round-robin and random)#1224
bcmIntc wants to merge 23 commits intoSandia-OpenSHMEM:mainfrom
bcmIntc:bcm_TxLoadBalancing

Conversation

@bcmIntc
Copy link
Copy Markdown
Collaborator

@bcmIntc bcmIntc commented Mar 31, 2026

This PR extends the OFI transport to support transmit-side NIC
multiplexing: rather than assigning each PE a single NIC at startup,
PEs can now spread individual put/get operations across a set of available
NICs at runtime. Two mutually exclusive policies are provided via new
configure flags:

  • --enable-ofi-tx-load-balancing-round-robin: cycles through NICs
    uniformly; staggered per-PE start position avoids thundering-herd on
    NIC[0].
  • --enable-ofi-tx-load-balancing-random: selects a NIC randomly
    on each operation.

Both modes require --with-hwloc and configure will error out if hwloc
is absent or if both flags are given simultaneously.

Known Issues:

  • Fortran support not fully complete. Fortran is de-featured, no planned fix.

Test Coverage:

philipmarshall21 and others added 23 commits June 11, 2024 14:07
Resolved conflicts by taking TxMuxing (nic_idx) changes from pr-1126
while preserving upstream bug fixes and improvements from HEAD:
- atomic_nbi_c.c4: use shmem_internal_atomic_fetch with nic_idx
- collectives.c: pass count*type_size + nic_idx to atomicv
- data_c.c4: keep team PE translation AND add nic_idx
- shmem_comm.h: use len+nic_idx signature, restore type_size/count
  computation, and preserve shmem_transport_get_wait bug fix
- shmem_team.c: use PE_stride/parent_stride + nic_idx
- transport_ofi.c/h: use per-NIC array architecture from pr-1126

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- transport_ofi.h: add nic_idx to shmem_transport_atomic_fetch_nbi
- shmem_comm.h: add nic_idx to shmem_internal_atomic_fetch_nbi
- collectives.c: add nic_idx to scan_linear and scan_ring functions
- collectives_c.c4: pass nic_idx to team_choose_psync for SCAN ops
- data_c.c4: restore missing shmem_internal_atomic/atomic_set calls
  in shmemx_ctx_signal_add/set and add nic_idx to both
- transport_ofi.c: use array indexing for single-EP counter close

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- atomic_nbi_c.c4: restore non-blocking semantics in SHMEM_DEF_FETCH_NBI
  by using shmem_internal_atomic_fetch_nbi (not blocking _fetch).
  HEAD's bug-fix commit had changed it to _nbi; this preserves that
  while also threading nic_idx from pr-1126.
- shmem_comm.h: pass nic_idx to shmem_transport_get_wait in
  DISABLE_NONFETCH_AMO path of shmem_internal_atomicv
- transport_ofi.c: remove unused close_default_ctx and info variables

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Remove double-close in shmem_transport_fini(): the single-EP block
   that closed put_cntr[0]/get_cntr[0] was redundant after
   shmem_transport_ctx_destroy(&shmem_transport_ctx_default) already
   closes all NIC counters via the per-NIC loop in ctx_destroy.

2. Use calloc (instead of malloc) in shmem_transport_ofi_ctx_init() for
   the per-NIC pointer arrays (fabric, domain, av, ep, put_cntr,
   get_cntr, cq), so partial-init cleanup paths in
   shmem_transport_ctx_destroy will see NULL rather than garbage.

3. Set each pointer to NULL immediately after fi_close in
   shmem_transport_ctx_destroy(), making repeated calls idempotent and
   preventing accidental double-close.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
shmem_transport_ofi_single_ep was introduced in PR Sandia-OpenSHMEM#1200 after pr-1126
diverged, so the merge brought it in without the required per-NIC
adaptations. Three code paths were broken:

1. shmem_transport_init(): force single_ep=0 when num_nics>1 -- TxMuxing
   (multiple NICs) and single-endpoint mode are mutually exclusive.

2. shmem_transport_ofi_ctx_init(): for single-ep default context (idx=0),
   skip opening per-NIC fabric/domain/av/cq/ep; instead open put/get
   counters on the global domainfd, bind them directly to target_ep, and
   share target_ep/target_cq as ctx->ep[0]/ctx->cq[0].  Skip the
   stx_allocate + bind_enable_ep_resources calls for this index since
   target_ep is already set up by shmem_transport_ofi_target_ep_init().

3. shmem_transport_ctx_destroy(): skip fi_close on ep[0] and cq[0] for
   the single-ep default context -- they are shared pointers to target_ep
   and target_cq, which are closed later in shmem_transport_fini().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce a new configure flag --enable-ofi-tx-load-balancing that
controls whether the OFI transport randomly selects among available
NICs for transmit operations. Previously this behavior was always
enabled when building with OFI (USE_OFI). Now it is gated behind the
USE_OFI_TX_LOAD_BALANCING preprocessor define, which is only set when
the new configure option is explicitly requested.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduces two new configure options for per-operation NIC selection
when multiple NICs are available:

  --enable-ofi-tx-load-balancing
    Round-robin selection across close NICs. Each PE starts at a
    different NIC offset (my_pe % num_nics) to avoid all PEs
    beginning on NIC 0. Uses a lightweight counter with no PRNG
    overhead.

  --enable-ofi-tx-load-balancing-random
    Random selection via rand_r. Implies the above option (same
    multi-NIC pool setup). Uses the existing per-PE rand seed.

Without either flag, the base multi-rail behavior is unchanged:
each PE is assigned exactly one topologically-close NIC via HWLOC
and nic_idx is always 0.

Adds shmem_internal_nic_rr_idx (declared in shmem_internal.h,
defined and zeroed in init.c, staggered by PE in
shmem_transport_startup after num_nics is finalized).
shmem_internal.h:
- SHMEM_GET_TRANSMIT_NIC_IDX (random): replace float-multiply with
  modulo to eliminate OOB index when rand_r() returns RAND_MAX
  (normalized==1.0 → idx==num_nics)
- SHMEM_GET_TRANSMIT_NIC_IDX (round-robin): use __atomic_fetch_add
  under SHMEM_THREAD_MULTIPLE to prevent a data race on
  shmem_internal_nic_rr_idx; plain increment otherwise

transport_ofi.c (assign_nic_with_hwloc):
- Guard cur_prov->nic and ->bus_attr against NULL before dereferencing
- Add hwloc_bitmap_free(bindset) on all early-return fallback paths
  that were previously leaking the bitmap allocation
- Add NULL checks after every malloc() of provider_list; call
  RAISE_ERROR_MSG on failure
- Add shmem_transport_ofi_close_provs global to track the head of the
  fi_dupinfo'd close-NIC chain built in success paths; set it in both
  the TX-LB and base multi-rail success branches

transport_ofi.c (shmem_transport_ofi_ctx_init):
- Add a single consolidated NULL check covering all per-NIC
  calloc/malloc allocations (fabric, domain, av, ep, put_cntr,
  get_cntr, pending_*_cntr, cq)

transport_ofi.c (shmem_transport_fini):
- Call fi_freeinfo(shmem_transport_ofi_close_provs) to release the
  dup'd NIC chain (fallback paths leave the pointer NULL, avoiding a
  double-free against the fabrics chain freed by fi_freeinfo(fabrics))
- Free provider_list array after use

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename --enable-ofi-tx-load-balancing to
--enable-ofi-tx-load-balancing-round-robin to make the two TX load
balancing modes (round-robin and random) explicit, symmetric, and
mutually exclusive. Both modes now require --with-hwloc and configure
errors out if hwloc is absent or if both flags are given simultaneously.

Replace shmem_internal_my_pe with the intra-node rank (local_rank via
shmem_runtime_get_node_rank) everywhere NICs are assigned during startup.
Using the global PE index caused PEs on different nodes with the same
global rank to always land on the same NIC; local rank distributes them
correctly within each node. The no-hwloc fallback path is also corrected
to allocate a single-entry provider_list and assign one NIC per PE.

Preserve the staggered round-robin start set in shmem_transport_startup
by skipping the nic_rr_idx reset in shmem_internal_randr_init when
either TX load balancing mode is active.

Add a DEBUG_MSG in shmem_transport_ofi_ctx_init to log the domain name
selected for each context and NIC index.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants