Skip to content

[RFC] Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring #2634

@pkourdis

Description

@pkourdis

RFC: Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring

Author: Panagiotis Kourdis <panagiotis.kourdis@intel.com>

Date: 2026-05-19

Implementation: meta-pytorch/torchcomms#2633


Summary

This RFC presents the design considerations and rationale for reengineering the performance benchmarking framework. The changes introduce:

  • A CommAdapter abstraction that enables side-by-side benchmarking of torchcomms, c10d, and c10d_torchcomms (c10d backed by torchcomms) across all supported collectives.
  • A shared run_bench_sweep() driver that replaces per-collective timing loops with a single measurement path.
  • Structured CSV output with environment metadata on every row.
  • A multi-reduce-op sweep covering SUM, MIN, MAX, AVG, and PRODUCT.
  • A new analyze_perf.py tool for summary reports, side-by-side plots, and automated regression and improvement detection.

The primary motivation is two-fold:

  • Provide the data and tooling required to validate the deprecation of c10d in favor of torchcomms.
  • Establish a performance monitoring foundation for torchcomms after c10d is fully retired enabling data-driven optimization decisions over the lifetime of the library.

Motivation and Context

c10d Deprecation

torch.distributed (c10d) has served as PyTorch's canonical distributed communication layer. torchcomms is its intended replacement: an experimental communications API decoupled from PyTorch core to enable fast prototyping of new collectives and out-of-tree backends, with explicit goals of scaling to 100K+ GPUs via eager initialization and model-specific hints, first-class support for heterogeneous hardware (NCCL/NCCLX, RCCL/RCCLX, XCCL, Gloo), fault-tolerant process groups, one-sided (RDMA) communication, and device-centric collectives.

During the migration period:

  • Existing user code may still call dist.all_reduce() / dist.barrier() etc.
  • A compatibility shim (c10d_torchcomms) routes those calls through torchcomms internally while preserving the torch.distributed APIs.
  • New code can also call torchcomms directly.

Adoption requires demonstrating that neither the migration path (c10d_torchcomms) nor the direct path (torchcomms) introduces performance regressions relative to the baseline that users are migrating away from (c10d).

Without a unified benchmarking harness, this comparison is error-prone: separate timing loops, different warm-up logic, or inconsistent barrier placement can skew results and generate misleading conclusions.

Where the Previous Benchmarking Code Could Be Improved

Each per-collective benchmark (all_reduce_perf.py, all_gather_perf.py, etc.) carried its own timing loop, and that duplication showed up in a few ways:

  • Collective calls were wired directly to torchcomms.TorchComm, so swapping in dist.* meant editing every file by hand. → addressed by CommAdapter.
  • Running the c10d or c10d_torchcomms paths needed a separate script, which made head-to-head comparisons harder than they should be. → addressed by --all-comm-adapters and CommAdapter.
  • PerfResult exposed only avg_time_us; surfacing min/max would need per-iteration sync, which hadn't been added yet. → addressed by BenchMetrics plus --sync-interval modes.
  • Results were printed to stdout in a fixed-width format, so there was no easy path for post-processing or for comparing runs across commits or backends. → addressed by structured CSV output and analyze_perf.py.
  • sync_device(device) asked the caller for a device argument instead of using the newer accelerator abstraction. → addressed by sync_device() switching to torch.accelerator.

Design

CommAdapter — Unified Communication Abstraction

CommAdapter(name: "torchcomms" | "c10d" | "c10d_torchcomms")
    .init(backend, device, hints)
    .run_collective(fn_name, *args, **kwargs, async_op=...)
    .get_rank() / get_size() / get_backend() / get_backend_version()
    .get_reduce_op(op_name)
    .finalize()

The adapter normalizes the behavioral differences between the two APIs:

Operation torchcomms c10d
Init torchcomms.new_comm(backend, device, ...) dist.init_process_group(backend, ...)
Naming differences get_size, all_gather_single, reduce_scatter_single get_world_size, all_gather_into_tensor, reduce_scatter_tensor
gather(output, input) (output_list, input_tensor) (tensor, gather_list)
send / recv async async_op=True isend / irecv
Finalize comm.finalize() dist.destroy_process_group()
c10d_torchcomms sets dist_config.use_torchcomms = True before init

BenchRecord Hierarchy

The flat PerfParams / PerfResult pair is replaced with a four-level hierarchy that separates concerns cleanly:

BenchRecord
├── EnvInfo       — device, torch/torchcomms versions, adapter, backend
├── BenchConfig   — message-size range, CSV path, quiet flag
├── BenchParams   — collective, ranks, dtype, op, iterations, dispatch mode
└── BenchMetrics  — message size, avg/min/max latency, bus bandwidth

EnvInfo is designed for extension. Additional fields such as GPU interconnect topology, network adapter info, driver versions, operating system, etc. can be added without changing the rest of the schema or touching the run path.

BenchRecord is passed through the entire call stack (parse → validate → run_bench_sweeplog_perf_result) as a single argument, eliminating the growing parameter lists that were accumulating in the old code.

Shared run_bench_sweep() Driver

run_bench_sweep(
    comm,
    record,
    device,
    collective_name,
    setup_fn,    # (num_elements, rank, num_ranks, device, dtype) -> (args, kwargs)
    bus_bw_fn,   # (num_elements, element_size, num_ranks, avg_time_us) -> gbps
)

Every collective delegates its timing loop to this single function. Each collective provides only two narrow callbacks:

  • setup_fn: allocates the tensors and maps collective arguments.
  • bus_bw_fn: encodes the bus-bandwidth formula for that specific collective.

For example, the all_reduce callback captures the standard 2(n-1)/n ring-algorithm formula:

def bus_bw(num_elements, element_size, num_ranks, avg_time_us):
    algo_bw = (num_elements * element_size) / avg_time_us
    return algo_bw * 2.0 * (num_ranks - 1) / num_ranks / 1000.0

This removes the duplicated warmup/measure/print logic from every per-collective file and ensures that every collective uses identical barrier placement, synchronization strategy, and timer arithmetic.

The driver supports three synchronization modes, controlled by sync_interval:

  • sync_interval == 0: single bulk start/stop, lowest overhead.
  • sync_interval == 1: per-iteration sync, yields true min/max latency.
  • sync_interval == N > 1: windowed sync, trades accuracy for lower overhead.

Argument Validation

The CLI parser fails fast on misuse rather than running with surprising defaults:

  • Unknown flags or positionals (e.g. --reduce_op instead of --reduce-op, or all_redcue instead of all_reduce) are rejected at parse time.
  • --reduce-op <op> is validated against the supported set (sum, min, max, avg, product, plus all to sweep every supported op) and rejected when applied to a non-reduction collective. Errors show the offending collective name and the full list of valid ops.
  • --reduce-op with the all collective is rejected as redundant — all already sweeps every reduction op for every reduction collective.
  • --quiet / -q requires --csv; otherwise results would be silently discarded.

Structured CSV Output and analyze_perf.py

Each benchmark can now write a structured CSV. The schema covers test identity, benchmark configuration, results, and full environment metadata, and is defined once in _CSV_COLUMNS so the terminal printer and the CSV writer stay in sync.

analyze_perf.py consumes these CSVs and writes:

<outdir>/
├── perf_summary_<adapter>.{txt,csv}     # per-adapter
├── perf_comparison.{txt,csv}            # absolute values + % change vs baseline
├── plots/<collective>/                  # latency / bus-bandwidth PNGs per collective;
│   └── ...                              #   relative (% change) plots also written here when --baseline is set
└── <adapter>_vs_<baseline>/             # one subdir per pairwise comparison
    ├── perf_regress.{txt,csv}           # rows exceeding --regression-threshold
    ├── perf_improve.{txt,csv}           # rows exceeding --improvement-threshold
    └── perf_highlights.{txt,csv}        # counts/averages of improved/regressed/neutral points

The pairwise subdir layout keeps the top level uncluttered when comparing multiple adapters at once: with torchcomms, c10d, and c10d_torchcomms all compared against the c10d baseline, two pair directories appear instead of six flat report files.


Impact on the c10d → torchcomms Transition

What This Enables

Apples-to-apples comparison. The CommAdapter ensures that the warmup count, measurement iterations, barrier placement, and synchronization strategy are identical for torchcomms, c10d, and c10d_torchcomms. Earlier, any performance comparison between the two APIs would have been confounded by implementation differences in the test harnesses themselves.

Regression gate. The analyze_perf.py regression detection can be run in CI against a fixed c10d baseline. A PR that regresses torchcomms latency beyond the threshold will produce a perf_regress_torchcomms_vs_c10d.txt artifact, making the regression actionable before the change lands.

Migration-path validation. Most teams will migrate by switching their c10d calls to route through c10d_torchcomms (the compatibility shim) before — if ever — porting code to the native torchcomms API. The shim is therefore the load-bearing path for the transition, and its performance vs. plain c10d is what determines whether migration is a no-cost drop-in. The shim is expected to perform at least as well as c10d — and often better, since it benefits from the same backend improvements as native torchcomms — so the comparison is primarily a regression check: if c10d_torchcomms is no slower than c10d across the sweep, migration is safe; any per-size regression shows up immediately in the relative plots. The native torchcomms numbers serve as a separate signal: the additional headroom available to teams that take the further step of porting to the new API.

Environment traceability. Every CSV row records the device name, torch version, torchcomms version, and backend version. This is essential when comparing numbers collected on different nodes, at different points in the release cycle, or after a backend (NCCL, oneCCL) upgrade.

Reduce-op sweep. The --reduce-op flag and the multi-op sweep in all mode ensure that SUM, MIN, MAX, AVG, and PRODUCT are all covered. Some backends optimize specific reduce operations; this sweep surfaces any asymmetry.

Post-Deprecation: Ongoing Performance Monitoring

Once c10d is fully retired the benchmarking framework does not become obsolete. It becomes the primary instrument for tracking torchcomms performance over time. The design choices (structured CSV output, full environment metadata per row, analyze_perf.py regression detection) were made with this long-term role explicitly in mind.

Longitudinal tracking. Because every CSV row captures the torchcomms version, backend version, device name, and torch version alongside the timing numbers, a historical archive of benchmark runs can be built up across releases. analyze_perf.py --baseline can compare any two runs in that archive, not just torchcomms vs c10d. A performance regression introduced between torchcomms v1.3 and v1.4 will be just as visible as one introduced during the initial migration.

Data-driven optimization. The bus-bandwidth and latency curves produced per collective and per message size directly guide where optimization effort is worth spending. A low bus-bandwidth reading on all_reduce at small message sizes points to latency-bound behavior and motivates kernel fusion or pipelining work. A degraded reduce_scatter curve after a backend upgrade pinpoints the scope of a regression before it reaches users. Without a structured, reproducible benchmark suite this kind of targeted diagnosis requires manual, ad-hoc measurement.

Multi-reduce-op coverage. The multi-op sweep (SUM, MIN, MAX, AVG, PRODUCT) is particularly important post-deprecation: as torchcomms gains new backend-specific fast paths for individual ops, the sweep ensures that an optimization for one op does not silently degrade another.

CLI Changes

The only renamed flag is --window--sync-interval <n>; job scripts and CI configurations using --window must be updated. Everything else is additive: --reduce-op, --c10d, --c10d-torchcomms, --all-comm-adapters, --csv, and --quiet / -q require no changes to existing invocations, and the positional all argument is still supported.

API Changes Affecting Collective Benchmark Files

Each collective benchmark's public entry point changed signature:

# Before
def run_all_reduce_perf(
    comm: torchcomms.TorchComm,
    params: PerfParams,
    device: torch.device,
) -> None: ...

# After
def run_all_reduce_perf(
    comm: CommAdapter,
    record: BenchRecord,
    device: torch.device,
) -> None: ...

Out-of-tree code that imports and calls these functions directly must be updated. The CommAdapter wrapping of the TorchComm object is a one-line change at the call site:

# Before
comm = torchcomms.new_comm(backend, device)
run_all_reduce_perf(comm, params, device)

# After
comm = CommAdapter("torchcomms")
comm.init(backend, device)
run_all_reduce_perf(comm, record, device)

Deprecation of validate_params / print_perf_header / print_perf_result

The following public helpers in perf_test_helpers.py were removed or renamed:

Removed Replacement
validate_params(collective, params) validate_bench_params(record)
print_perf_header(rank) log_perf_header(record, rank)
print_perf_result(result, rank) log_perf_result(record, rank)
sync_device(device) sync_device() (no argument; uses torch.accelerator)
PerfResult BenchMetrics (part of BenchRecord)
PerfParams BenchParams + BenchConfig (separated concerns)

Recommended Adoption Steps

  1. Update job scripts. Replace --window with --sync-interval. Add --csv to any benchmark invocations whose output is consumed downstream.

  2. Establish a c10d baseline. Run the full suite with --c10d --csv on the target hardware before merging any torchcomms changes. The benchmark writes to ./c10d/ by default; archive that directory as the reference baseline for analyze_perf.py --baseline c10d.

  3. Add a CI comparison job. After each PR that touches collective implementation, run torchcomms against the archived c10d baseline from step 2:

    TEST_BACKEND=nccl torchrun --nproc_per_node=8 \
        collective_perf_test.py all --csv --quiet
    python analyze_perf.py \
        --dir torchcomms:./torchcomms \
        --dir c10d:./c10d_baseline \
        --baseline c10d \
        --regression-threshold 3.0

    Fail the job if perf_regress_torchcomms_vs_c10d.{txt,csv} is non-empty.

  4. Validate the shim. For each hardware target, collect a c10d_torchcomms run alongside c10d and confirm that the relative latency plots stay within run-to-run noise across all message sizes. If establishing a tighter bound, specify the methodology (e.g. median of N runs) so the threshold is reproducible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions