ofi: Add per-operation NIC Tx load balancing (round-robin and random)#1224
Draft
bcmIntc wants to merge 23 commits intoSandia-OpenSHMEM:mainfrom
Draft
ofi: Add per-operation NIC Tx load balancing (round-robin and random)#1224bcmIntc wants to merge 23 commits intoSandia-OpenSHMEM:mainfrom
bcmIntc wants to merge 23 commits intoSandia-OpenSHMEM:mainfrom
Conversation
Resolved conflicts by taking TxMuxing (nic_idx) changes from pr-1126 while preserving upstream bug fixes and improvements from HEAD: - atomic_nbi_c.c4: use shmem_internal_atomic_fetch with nic_idx - collectives.c: pass count*type_size + nic_idx to atomicv - data_c.c4: keep team PE translation AND add nic_idx - shmem_comm.h: use len+nic_idx signature, restore type_size/count computation, and preserve shmem_transport_get_wait bug fix - shmem_team.c: use PE_stride/parent_stride + nic_idx - transport_ofi.c/h: use per-NIC array architecture from pr-1126 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- transport_ofi.h: add nic_idx to shmem_transport_atomic_fetch_nbi - shmem_comm.h: add nic_idx to shmem_internal_atomic_fetch_nbi - collectives.c: add nic_idx to scan_linear and scan_ring functions - collectives_c.c4: pass nic_idx to team_choose_psync for SCAN ops - data_c.c4: restore missing shmem_internal_atomic/atomic_set calls in shmemx_ctx_signal_add/set and add nic_idx to both - transport_ofi.c: use array indexing for single-EP counter close Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- atomic_nbi_c.c4: restore non-blocking semantics in SHMEM_DEF_FETCH_NBI by using shmem_internal_atomic_fetch_nbi (not blocking _fetch). HEAD's bug-fix commit had changed it to _nbi; this preserves that while also threading nic_idx from pr-1126. - shmem_comm.h: pass nic_idx to shmem_transport_get_wait in DISABLE_NONFETCH_AMO path of shmem_internal_atomicv - transport_ofi.c: remove unused close_default_ctx and info variables Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Remove double-close in shmem_transport_fini(): the single-EP block that closed put_cntr[0]/get_cntr[0] was redundant after shmem_transport_ctx_destroy(&shmem_transport_ctx_default) already closes all NIC counters via the per-NIC loop in ctx_destroy. 2. Use calloc (instead of malloc) in shmem_transport_ofi_ctx_init() for the per-NIC pointer arrays (fabric, domain, av, ep, put_cntr, get_cntr, cq), so partial-init cleanup paths in shmem_transport_ctx_destroy will see NULL rather than garbage. 3. Set each pointer to NULL immediately after fi_close in shmem_transport_ctx_destroy(), making repeated calls idempotent and preventing accidental double-close. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
shmem_transport_ofi_single_ep was introduced in PR Sandia-OpenSHMEM#1200 after pr-1126 diverged, so the merge brought it in without the required per-NIC adaptations. Three code paths were broken: 1. shmem_transport_init(): force single_ep=0 when num_nics>1 -- TxMuxing (multiple NICs) and single-endpoint mode are mutually exclusive. 2. shmem_transport_ofi_ctx_init(): for single-ep default context (idx=0), skip opening per-NIC fabric/domain/av/cq/ep; instead open put/get counters on the global domainfd, bind them directly to target_ep, and share target_ep/target_cq as ctx->ep[0]/ctx->cq[0]. Skip the stx_allocate + bind_enable_ep_resources calls for this index since target_ep is already set up by shmem_transport_ofi_target_ep_init(). 3. shmem_transport_ctx_destroy(): skip fi_close on ep[0] and cq[0] for the single-ep default context -- they are shared pointers to target_ep and target_cq, which are closed later in shmem_transport_fini(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce a new configure flag --enable-ofi-tx-load-balancing that controls whether the OFI transport randomly selects among available NICs for transmit operations. Previously this behavior was always enabled when building with OFI (USE_OFI). Now it is gated behind the USE_OFI_TX_LOAD_BALANCING preprocessor define, which is only set when the new configure option is explicitly requested. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduces two new configure options for per-operation NIC selection
when multiple NICs are available:
--enable-ofi-tx-load-balancing
Round-robin selection across close NICs. Each PE starts at a
different NIC offset (my_pe % num_nics) to avoid all PEs
beginning on NIC 0. Uses a lightweight counter with no PRNG
overhead.
--enable-ofi-tx-load-balancing-random
Random selection via rand_r. Implies the above option (same
multi-NIC pool setup). Uses the existing per-PE rand seed.
Without either flag, the base multi-rail behavior is unchanged:
each PE is assigned exactly one topologically-close NIC via HWLOC
and nic_idx is always 0.
Adds shmem_internal_nic_rr_idx (declared in shmem_internal.h,
defined and zeroed in init.c, staggered by PE in
shmem_transport_startup after num_nics is finalized).
shmem_internal.h: - SHMEM_GET_TRANSMIT_NIC_IDX (random): replace float-multiply with modulo to eliminate OOB index when rand_r() returns RAND_MAX (normalized==1.0 → idx==num_nics) - SHMEM_GET_TRANSMIT_NIC_IDX (round-robin): use __atomic_fetch_add under SHMEM_THREAD_MULTIPLE to prevent a data race on shmem_internal_nic_rr_idx; plain increment otherwise transport_ofi.c (assign_nic_with_hwloc): - Guard cur_prov->nic and ->bus_attr against NULL before dereferencing - Add hwloc_bitmap_free(bindset) on all early-return fallback paths that were previously leaking the bitmap allocation - Add NULL checks after every malloc() of provider_list; call RAISE_ERROR_MSG on failure - Add shmem_transport_ofi_close_provs global to track the head of the fi_dupinfo'd close-NIC chain built in success paths; set it in both the TX-LB and base multi-rail success branches transport_ofi.c (shmem_transport_ofi_ctx_init): - Add a single consolidated NULL check covering all per-NIC calloc/malloc allocations (fabric, domain, av, ep, put_cntr, get_cntr, pending_*_cntr, cq) transport_ofi.c (shmem_transport_fini): - Call fi_freeinfo(shmem_transport_ofi_close_provs) to release the dup'd NIC chain (fallback paths leave the pointer NULL, avoiding a double-free against the fabrics chain freed by fi_freeinfo(fabrics)) - Free provider_list array after use Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename --enable-ofi-tx-load-balancing to --enable-ofi-tx-load-balancing-round-robin to make the two TX load balancing modes (round-robin and random) explicit, symmetric, and mutually exclusive. Both modes now require --with-hwloc and configure errors out if hwloc is absent or if both flags are given simultaneously. Replace shmem_internal_my_pe with the intra-node rank (local_rank via shmem_runtime_get_node_rank) everywhere NICs are assigned during startup. Using the global PE index caused PEs on different nodes with the same global rank to always land on the same NIC; local rank distributes them correctly within each node. The no-hwloc fallback path is also corrected to allocate a single-entry provider_list and assign one NIC per PE. Preserve the staggered round-robin start set in shmem_transport_startup by skipping the nic_rr_idx reset in shmem_internal_randr_init when either TX load balancing mode is active. Add a DEBUG_MSG in shmem_transport_ofi_ctx_init to log the domain name selected for each context and NIC index.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR extends the OFI transport to support transmit-side NIC
multiplexing: rather than assigning each PE a single NIC at startup,
PEs can now spread individual put/get operations across a set of available
NICs at runtime. Two mutually exclusive policies are provided via new
configure flags:
--enable-ofi-tx-load-balancing-round-robin: cycles through NICsuniformly; staggered per-PE start position avoids thundering-herd on
NIC[0].
--enable-ofi-tx-load-balancing-random: selects a NIC randomlyon each operation.
Both modes require
--with-hwlocand configure will error out if hwlocis absent or if both flags are given simultaneously.
Known Issues:
Test Coverage: