RFC: Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring
Author: Panagiotis Kourdis <panagiotis.kourdis@intel.com>
Date: 2026-05-19
Implementation: meta-pytorch/torchcomms#2633
Summary
This RFC presents the design considerations and rationale for reengineering the performance benchmarking framework. The changes introduce:
- A
CommAdapter abstraction that enables side-by-side benchmarking of torchcomms, c10d, and c10d_torchcomms (c10d backed by torchcomms) across all supported collectives.
- A shared
run_bench_sweep() driver that replaces per-collective timing loops with a single measurement path.
- Structured CSV output with environment metadata on every row.
- A multi-reduce-op sweep covering
SUM, MIN, MAX, AVG, and PRODUCT.
- A new
analyze_perf.py tool for summary reports, side-by-side plots, and automated regression and improvement detection.
The primary motivation is two-fold:
- Provide the data and tooling required to validate the deprecation of
c10d in favor of torchcomms.
- Establish a performance monitoring foundation for
torchcomms after c10d is fully retired enabling data-driven optimization decisions over the lifetime of the library.
Motivation and Context
c10d Deprecation
torch.distributed (c10d) has served as PyTorch's canonical distributed communication layer. torchcomms is its intended replacement: an experimental communications API decoupled from PyTorch core to enable fast prototyping of new collectives and out-of-tree backends, with explicit goals of scaling to 100K+ GPUs via eager initialization and model-specific hints, first-class support for heterogeneous hardware (NCCL/NCCLX, RCCL/RCCLX, XCCL, Gloo), fault-tolerant process groups, one-sided (RDMA) communication, and device-centric collectives.
During the migration period:
- Existing user code may still call
dist.all_reduce() / dist.barrier() etc.
- A compatibility shim (
c10d_torchcomms) routes those calls through torchcomms internally while preserving the torch.distributed APIs.
- New code can also call
torchcomms directly.
Adoption requires demonstrating that neither the migration path (c10d_torchcomms) nor the direct path (torchcomms) introduces performance regressions relative to the baseline that users are migrating away from (c10d).
Without a unified benchmarking harness, this comparison is error-prone: separate timing loops, different warm-up logic, or inconsistent barrier placement can skew results and generate misleading conclusions.
Where the Previous Benchmarking Code Could Be Improved
Each per-collective benchmark (all_reduce_perf.py, all_gather_perf.py, etc.) carried its own timing loop, and that duplication showed up in a few ways:
- Collective calls were wired directly to
torchcomms.TorchComm, so swapping in dist.* meant editing every file by hand. → addressed by CommAdapter.
- Running the c10d or c10d_torchcomms paths needed a separate script, which made head-to-head comparisons harder than they should be. → addressed by
--all-comm-adapters and CommAdapter.
PerfResult exposed only avg_time_us; surfacing min/max would need per-iteration sync, which hadn't been added yet. → addressed by BenchMetrics plus --sync-interval modes.
- Results were printed to stdout in a fixed-width format, so there was no easy path for post-processing or for comparing runs across commits or backends. → addressed by structured CSV output and
analyze_perf.py.
sync_device(device) asked the caller for a device argument instead of using the newer accelerator abstraction. → addressed by sync_device() switching to torch.accelerator.
Design
CommAdapter — Unified Communication Abstraction
CommAdapter(name: "torchcomms" | "c10d" | "c10d_torchcomms")
.init(backend, device, hints)
.run_collective(fn_name, *args, **kwargs, async_op=...)
.get_rank() / get_size() / get_backend() / get_backend_version()
.get_reduce_op(op_name)
.finalize()
The adapter normalizes the behavioral differences between the two APIs:
| Operation |
torchcomms |
c10d |
| Init |
torchcomms.new_comm(backend, device, ...) |
dist.init_process_group(backend, ...) |
| Naming differences |
get_size, all_gather_single, reduce_scatter_single |
get_world_size, all_gather_into_tensor, reduce_scatter_tensor |
gather(output, input) |
(output_list, input_tensor) |
(tensor, gather_list) |
send / recv async |
async_op=True |
isend / irecv |
| Finalize |
comm.finalize() |
dist.destroy_process_group() |
c10d_torchcomms |
— |
sets dist_config.use_torchcomms = True before init |
BenchRecord Hierarchy
The flat PerfParams / PerfResult pair is replaced with a four-level hierarchy that separates concerns cleanly:
BenchRecord
├── EnvInfo — device, torch/torchcomms versions, adapter, backend
├── BenchConfig — message-size range, CSV path, quiet flag
├── BenchParams — collective, ranks, dtype, op, iterations, dispatch mode
└── BenchMetrics — message size, avg/min/max latency, bus bandwidth
EnvInfo is designed for extension. Additional fields such as GPU interconnect topology, network adapter info, driver versions, operating system, etc. can be added without changing the rest of the schema or touching the run path.
BenchRecord is passed through the entire call stack (parse → validate → run_bench_sweep → log_perf_result) as a single argument, eliminating the growing parameter lists that were accumulating in the old code.
Shared run_bench_sweep() Driver
run_bench_sweep(
comm,
record,
device,
collective_name,
setup_fn, # (num_elements, rank, num_ranks, device, dtype) -> (args, kwargs)
bus_bw_fn, # (num_elements, element_size, num_ranks, avg_time_us) -> gbps
)
Every collective delegates its timing loop to this single function. Each collective provides only two narrow callbacks:
setup_fn: allocates the tensors and maps collective arguments.
bus_bw_fn: encodes the bus-bandwidth formula for that specific collective.
For example, the all_reduce callback captures the standard 2(n-1)/n ring-algorithm formula:
def bus_bw(num_elements, element_size, num_ranks, avg_time_us):
algo_bw = (num_elements * element_size) / avg_time_us
return algo_bw * 2.0 * (num_ranks - 1) / num_ranks / 1000.0
This removes the duplicated warmup/measure/print logic from every per-collective file and ensures that every collective uses identical barrier placement, synchronization strategy, and timer arithmetic.
The driver supports three synchronization modes, controlled by sync_interval:
sync_interval == 0: single bulk start/stop, lowest overhead.
sync_interval == 1: per-iteration sync, yields true min/max latency.
sync_interval == N > 1: windowed sync, trades accuracy for lower overhead.
Argument Validation
The CLI parser fails fast on misuse rather than running with surprising defaults:
- Unknown flags or positionals (e.g.
--reduce_op instead of --reduce-op, or all_redcue instead of all_reduce) are rejected at parse time.
--reduce-op <op> is validated against the supported set (sum, min, max, avg, product, plus all to sweep every supported op) and rejected when applied to a non-reduction collective. Errors show the offending collective name and the full list of valid ops.
--reduce-op with the all collective is rejected as redundant — all already sweeps every reduction op for every reduction collective.
--quiet / -q requires --csv; otherwise results would be silently discarded.
Structured CSV Output and analyze_perf.py
Each benchmark can now write a structured CSV. The schema covers test identity, benchmark configuration, results, and full environment metadata, and is defined once in _CSV_COLUMNS so the terminal printer and the CSV writer stay in sync.
analyze_perf.py consumes these CSVs and writes:
<outdir>/
├── perf_summary_<adapter>.{txt,csv} # per-adapter
├── perf_comparison.{txt,csv} # absolute values + % change vs baseline
├── plots/<collective>/ # latency / bus-bandwidth PNGs per collective;
│ └── ... # relative (% change) plots also written here when --baseline is set
└── <adapter>_vs_<baseline>/ # one subdir per pairwise comparison
├── perf_regress.{txt,csv} # rows exceeding --regression-threshold
├── perf_improve.{txt,csv} # rows exceeding --improvement-threshold
└── perf_highlights.{txt,csv} # counts/averages of improved/regressed/neutral points
The pairwise subdir layout keeps the top level uncluttered when comparing multiple adapters at once: with torchcomms, c10d, and c10d_torchcomms all compared against the c10d baseline, two pair directories appear instead of six flat report files.
Impact on the c10d → torchcomms Transition
What This Enables
Apples-to-apples comparison. The CommAdapter ensures that the warmup count, measurement iterations, barrier placement, and synchronization strategy are identical for torchcomms, c10d, and c10d_torchcomms. Earlier, any performance comparison between the two APIs would have been confounded by implementation differences in the test harnesses themselves.
Regression gate. The analyze_perf.py regression detection can be run in CI against a fixed c10d baseline. A PR that regresses torchcomms latency beyond the threshold will produce a perf_regress_torchcomms_vs_c10d.txt artifact, making the regression actionable before the change lands.
Migration-path validation. Most teams will migrate by switching their c10d calls to route through c10d_torchcomms (the compatibility shim) before — if ever — porting code to the native torchcomms API. The shim is therefore the load-bearing path for the transition, and its performance vs. plain c10d is what determines whether migration is a no-cost drop-in. The shim is expected to perform at least as well as c10d — and often better, since it benefits from the same backend improvements as native torchcomms — so the comparison is primarily a regression check: if c10d_torchcomms is no slower than c10d across the sweep, migration is safe; any per-size regression shows up immediately in the relative plots. The native torchcomms numbers serve as a separate signal: the additional headroom available to teams that take the further step of porting to the new API.
Environment traceability. Every CSV row records the device name, torch version, torchcomms version, and backend version. This is essential when comparing numbers collected on different nodes, at different points in the release cycle, or after a backend (NCCL, oneCCL) upgrade.
Reduce-op sweep. The --reduce-op flag and the multi-op sweep in all mode ensure that SUM, MIN, MAX, AVG, and PRODUCT are all covered. Some backends optimize specific reduce operations; this sweep surfaces any asymmetry.
Post-Deprecation: Ongoing Performance Monitoring
Once c10d is fully retired the benchmarking framework does not become obsolete. It becomes the primary instrument for tracking torchcomms performance over time. The design choices (structured CSV output, full environment metadata per row, analyze_perf.py regression detection) were made with this long-term role explicitly in mind.
Longitudinal tracking. Because every CSV row captures the torchcomms version, backend version, device name, and torch version alongside the timing numbers, a historical archive of benchmark runs can be built up across releases. analyze_perf.py --baseline can compare any two runs in that archive, not just torchcomms vs c10d. A performance regression introduced between torchcomms v1.3 and v1.4 will be just as visible as one introduced during the initial migration.
Data-driven optimization. The bus-bandwidth and latency curves produced per collective and per message size directly guide where optimization effort is worth spending. A low bus-bandwidth reading on all_reduce at small message sizes points to latency-bound behavior and motivates kernel fusion or pipelining work. A degraded reduce_scatter curve after a backend upgrade pinpoints the scope of a regression before it reaches users. Without a structured, reproducible benchmark suite this kind of targeted diagnosis requires manual, ad-hoc measurement.
Multi-reduce-op coverage. The multi-op sweep (SUM, MIN, MAX, AVG, PRODUCT) is particularly important post-deprecation: as torchcomms gains new backend-specific fast paths for individual ops, the sweep ensures that an optimization for one op does not silently degrade another.
CLI Changes
The only renamed flag is --window → --sync-interval <n>; job scripts and CI configurations using --window must be updated. Everything else is additive: --reduce-op, --c10d, --c10d-torchcomms, --all-comm-adapters, --csv, and --quiet / -q require no changes to existing invocations, and the positional all argument is still supported.
API Changes Affecting Collective Benchmark Files
Each collective benchmark's public entry point changed signature:
# Before
def run_all_reduce_perf(
comm: torchcomms.TorchComm,
params: PerfParams,
device: torch.device,
) -> None: ...
# After
def run_all_reduce_perf(
comm: CommAdapter,
record: BenchRecord,
device: torch.device,
) -> None: ...
Out-of-tree code that imports and calls these functions directly must be updated. The CommAdapter wrapping of the TorchComm object is a one-line change at the call site:
# Before
comm = torchcomms.new_comm(backend, device)
run_all_reduce_perf(comm, params, device)
# After
comm = CommAdapter("torchcomms")
comm.init(backend, device)
run_all_reduce_perf(comm, record, device)
Deprecation of validate_params / print_perf_header / print_perf_result
The following public helpers in perf_test_helpers.py were removed or renamed:
| Removed |
Replacement |
validate_params(collective, params) |
validate_bench_params(record) |
print_perf_header(rank) |
log_perf_header(record, rank) |
print_perf_result(result, rank) |
log_perf_result(record, rank) |
sync_device(device) |
sync_device() (no argument; uses torch.accelerator) |
PerfResult |
BenchMetrics (part of BenchRecord) |
PerfParams |
BenchParams + BenchConfig (separated concerns) |
Recommended Adoption Steps
-
Update job scripts. Replace --window with --sync-interval. Add --csv to any benchmark invocations whose output is consumed downstream.
-
Establish a c10d baseline. Run the full suite with --c10d --csv on the target hardware before merging any torchcomms changes. The benchmark writes to ./c10d/ by default; archive that directory as the reference baseline for analyze_perf.py --baseline c10d.
-
Add a CI comparison job. After each PR that touches collective implementation, run torchcomms against the archived c10d baseline from step 2:
TEST_BACKEND=nccl torchrun --nproc_per_node=8 \
collective_perf_test.py all --csv --quiet
python analyze_perf.py \
--dir torchcomms:./torchcomms \
--dir c10d:./c10d_baseline \
--baseline c10d \
--regression-threshold 3.0
Fail the job if perf_regress_torchcomms_vs_c10d.{txt,csv} is non-empty.
-
Validate the shim. For each hardware target, collect a c10d_torchcomms run alongside c10d and confirm that the relative latency plots stay within run-to-run noise across all message sizes. If establishing a tighter bound, specify the methodology (e.g. median of N runs) so the threshold is reproducible.
RFC: Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring
Author: Panagiotis Kourdis <panagiotis.kourdis@intel.com>
Date: 2026-05-19
Implementation: meta-pytorch/torchcomms#2633
Summary
This RFC presents the design considerations and rationale for reengineering the performance benchmarking framework. The changes introduce:
CommAdapterabstraction that enables side-by-side benchmarking oftorchcomms,c10d, andc10d_torchcomms(c10d backed by torchcomms) across all supported collectives.run_bench_sweep()driver that replaces per-collective timing loops with a single measurement path.SUM,MIN,MAX,AVG, andPRODUCT.analyze_perf.pytool for summary reports, side-by-side plots, and automated regression and improvement detection.The primary motivation is two-fold:
c10din favor oftorchcomms.torchcommsafterc10dis fully retired enabling data-driven optimization decisions over the lifetime of the library.Motivation and Context
c10d Deprecation
torch.distributed(c10d) has served as PyTorch's canonical distributed communication layer.torchcommsis its intended replacement: an experimental communications API decoupled from PyTorch core to enable fast prototyping of new collectives and out-of-tree backends, with explicit goals of scaling to 100K+ GPUs via eager initialization and model-specific hints, first-class support for heterogeneous hardware (NCCL/NCCLX, RCCL/RCCLX, XCCL, Gloo), fault-tolerant process groups, one-sided (RDMA) communication, and device-centric collectives.During the migration period:
dist.all_reduce()/dist.barrier()etc.c10d_torchcomms) routes those calls through torchcomms internally while preserving thetorch.distributedAPIs.torchcommsdirectly.Adoption requires demonstrating that neither the migration path (
c10d_torchcomms) nor the direct path (torchcomms) introduces performance regressions relative to the baseline that users are migrating away from (c10d).Without a unified benchmarking harness, this comparison is error-prone: separate timing loops, different warm-up logic, or inconsistent barrier placement can skew results and generate misleading conclusions.
Where the Previous Benchmarking Code Could Be Improved
Each per-collective benchmark (
all_reduce_perf.py,all_gather_perf.py, etc.) carried its own timing loop, and that duplication showed up in a few ways:torchcomms.TorchComm, so swapping indist.*meant editing every file by hand. → addressed byCommAdapter.--all-comm-adaptersandCommAdapter.PerfResultexposed onlyavg_time_us; surfacing min/max would need per-iteration sync, which hadn't been added yet. → addressed byBenchMetricsplus--sync-intervalmodes.analyze_perf.py.sync_device(device)asked the caller for adeviceargument instead of using the newer accelerator abstraction. → addressed bysync_device()switching totorch.accelerator.Design
CommAdapter— Unified Communication AbstractionThe adapter normalizes the behavioral differences between the two APIs:
torchcomms.new_comm(backend, device, ...)dist.init_process_group(backend, ...)get_size,all_gather_single,reduce_scatter_singleget_world_size,all_gather_into_tensor,reduce_scatter_tensorgather(output, input)(output_list, input_tensor)(tensor, gather_list)send/recvasyncasync_op=Trueisend/irecvcomm.finalize()dist.destroy_process_group()c10d_torchcommsdist_config.use_torchcomms = Truebefore initBenchRecordHierarchyThe flat
PerfParams/PerfResultpair is replaced with a four-level hierarchy that separates concerns cleanly:EnvInfois designed for extension. Additional fields such as GPU interconnect topology, network adapter info, driver versions, operating system, etc. can be added without changing the rest of the schema or touching the run path.BenchRecordis passed through the entire call stack (parse → validate →run_bench_sweep→log_perf_result) as a single argument, eliminating the growing parameter lists that were accumulating in the old code.Shared
run_bench_sweep()DriverEvery collective delegates its timing loop to this single function. Each collective provides only two narrow callbacks:
setup_fn: allocates the tensors and maps collective arguments.bus_bw_fn: encodes the bus-bandwidth formula for that specific collective.For example, the
all_reducecallback captures the standard2(n-1)/nring-algorithm formula:This removes the duplicated warmup/measure/print logic from every per-collective file and ensures that every collective uses identical barrier placement, synchronization strategy, and timer arithmetic.
The driver supports three synchronization modes, controlled by
sync_interval:sync_interval == 0: single bulk start/stop, lowest overhead.sync_interval == 1: per-iteration sync, yields true min/max latency.sync_interval == N > 1: windowed sync, trades accuracy for lower overhead.Argument Validation
The CLI parser fails fast on misuse rather than running with surprising defaults:
--reduce_opinstead of--reduce-op, orall_redcueinstead ofall_reduce) are rejected at parse time.--reduce-op <op>is validated against the supported set (sum,min,max,avg,product, plusallto sweep every supported op) and rejected when applied to a non-reduction collective. Errors show the offending collective name and the full list of valid ops.--reduce-opwith theallcollective is rejected as redundant —allalready sweeps every reduction op for every reduction collective.--quiet/-qrequires--csv; otherwise results would be silently discarded.Structured CSV Output and
analyze_perf.pyEach benchmark can now write a structured CSV. The schema covers test identity, benchmark configuration, results, and full environment metadata, and is defined once in
_CSV_COLUMNSso the terminal printer and the CSV writer stay in sync.analyze_perf.pyconsumes these CSVs and writes:The pairwise subdir layout keeps the top level uncluttered when comparing multiple adapters at once: with
torchcomms,c10d, andc10d_torchcommsall compared against the c10d baseline, two pair directories appear instead of six flat report files.Impact on the c10d → torchcomms Transition
What This Enables
Apples-to-apples comparison. The
CommAdapterensures that the warmup count, measurement iterations, barrier placement, and synchronization strategy are identical fortorchcomms,c10d, andc10d_torchcomms. Earlier, any performance comparison between the two APIs would have been confounded by implementation differences in the test harnesses themselves.Regression gate. The
analyze_perf.pyregression detection can be run in CI against a fixed c10d baseline. A PR that regressestorchcommslatency beyond the threshold will produce aperf_regress_torchcomms_vs_c10d.txtartifact, making the regression actionable before the change lands.Migration-path validation. Most teams will migrate by switching their
c10dcalls to route throughc10d_torchcomms(the compatibility shim) before — if ever — porting code to the nativetorchcommsAPI. The shim is therefore the load-bearing path for the transition, and its performance vs. plainc10dis what determines whether migration is a no-cost drop-in. The shim is expected to perform at least as well asc10d— and often better, since it benefits from the same backend improvements as nativetorchcomms— so the comparison is primarily a regression check: ifc10d_torchcommsis no slower thanc10dacross the sweep, migration is safe; any per-size regression shows up immediately in the relative plots. The nativetorchcommsnumbers serve as a separate signal: the additional headroom available to teams that take the further step of porting to the new API.Environment traceability. Every CSV row records the device name, torch version, torchcomms version, and backend version. This is essential when comparing numbers collected on different nodes, at different points in the release cycle, or after a backend (NCCL, oneCCL) upgrade.
Reduce-op sweep. The
--reduce-opflag and the multi-op sweep inallmode ensure thatSUM,MIN,MAX,AVG, andPRODUCTare all covered. Some backends optimize specific reduce operations; this sweep surfaces any asymmetry.Post-Deprecation: Ongoing Performance Monitoring
Once c10d is fully retired the benchmarking framework does not become obsolete. It becomes the primary instrument for tracking torchcomms performance over time. The design choices (structured CSV output, full environment metadata per row,
analyze_perf.pyregression detection) were made with this long-term role explicitly in mind.Longitudinal tracking. Because every CSV row captures the torchcomms version, backend version, device name, and torch version alongside the timing numbers, a historical archive of benchmark runs can be built up across releases.
analyze_perf.py --baselinecan compare any two runs in that archive, not just torchcomms vs c10d. A performance regression introduced between torchcomms v1.3 and v1.4 will be just as visible as one introduced during the initial migration.Data-driven optimization. The bus-bandwidth and latency curves produced per collective and per message size directly guide where optimization effort is worth spending. A low bus-bandwidth reading on
all_reduceat small message sizes points to latency-bound behavior and motivates kernel fusion or pipelining work. A degradedreduce_scattercurve after a backend upgrade pinpoints the scope of a regression before it reaches users. Without a structured, reproducible benchmark suite this kind of targeted diagnosis requires manual, ad-hoc measurement.Multi-reduce-op coverage. The multi-op sweep (
SUM,MIN,MAX,AVG,PRODUCT) is particularly important post-deprecation: as torchcomms gains new backend-specific fast paths for individual ops, the sweep ensures that an optimization for one op does not silently degrade another.CLI Changes
The only renamed flag is
--window→--sync-interval <n>; job scripts and CI configurations using--windowmust be updated. Everything else is additive:--reduce-op,--c10d,--c10d-torchcomms,--all-comm-adapters,--csv, and--quiet/-qrequire no changes to existing invocations, and the positionalallargument is still supported.API Changes Affecting Collective Benchmark Files
Each collective benchmark's public entry point changed signature:
Out-of-tree code that imports and calls these functions directly must be updated. The
CommAdapterwrapping of theTorchCommobject is a one-line change at the call site:Deprecation of
validate_params/print_perf_header/print_perf_resultThe following public helpers in
perf_test_helpers.pywere removed or renamed:validate_params(collective, params)validate_bench_params(record)print_perf_header(rank)log_perf_header(record, rank)print_perf_result(result, rank)log_perf_result(record, rank)sync_device(device)sync_device()(no argument; usestorch.accelerator)PerfResultBenchMetrics(part ofBenchRecord)PerfParamsBenchParams+BenchConfig(separated concerns)Recommended Adoption Steps
Update job scripts. Replace
--windowwith--sync-interval. Add--csvto any benchmark invocations whose output is consumed downstream.Establish a c10d baseline. Run the full suite with
--c10d --csvon the target hardware before merging any torchcomms changes. The benchmark writes to./c10d/by default; archive that directory as the reference baseline foranalyze_perf.py --baseline c10d.Add a CI comparison job. After each PR that touches collective implementation, run torchcomms against the archived c10d baseline from step 2:
TEST_BACKEND=nccl torchrun --nproc_per_node=8 \ collective_perf_test.py all --csv --quiet python analyze_perf.py \ --dir torchcomms:./torchcomms \ --dir c10d:./c10d_baseline \ --baseline c10d \ --regression-threshold 3.0Fail the job if
perf_regress_torchcomms_vs_c10d.{txt,csv}is non-empty.Validate the shim. For each hardware target, collect a
c10d_torchcommsrun alongsidec10dand confirm that the relative latency plots stay within run-to-run noise across all message sizes. If establishing a tighter bound, specify the methodology (e.g. median of N runs) so the threshold is reproducible.