[RFC] Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring

# RFC: Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring

**Author:** Panagiotis Kourdis \<panagiotis.kourdis@intel.com\>

**Date:** 2026-05-19

**Implementation:** [meta-pytorch/torchcomms#2633](https://github.com/meta-pytorch/torchcomms/pull/2633)

---

## Summary

This RFC presents the design considerations and rationale for reengineering the performance benchmarking framework. The changes introduce:

- A `CommAdapter` abstraction that enables side-by-side benchmarking of `torchcomms`, `c10d`, and `c10d_torchcomms` (c10d backed by torchcomms) across all supported collectives.
- A shared `run_bench_sweep()` driver that replaces per-collective timing loops with a single measurement path.
- Structured CSV output with environment metadata on every row.
- A multi-reduce-op sweep covering `SUM`, `MIN`, `MAX`, `AVG`, and `PRODUCT`.
- A new `analyze_perf.py` tool for summary reports, side-by-side plots, and automated regression and improvement detection.

The primary motivation is two-fold:

- Provide the data and tooling required to validate the deprecation of `c10d` in favor of `torchcomms`.
- Establish a performance monitoring foundation for `torchcomms` after `c10d` is fully retired enabling data-driven optimization decisions over the lifetime of the library.

---

## Motivation and Context

### c10d Deprecation

`torch.distributed` (`c10d`) has served as PyTorch's canonical distributed communication layer. `torchcomms` is its intended replacement: an experimental communications API decoupled from PyTorch core to enable fast prototyping of new collectives and out-of-tree backends, with explicit goals of scaling to 100K+ GPUs via eager initialization and model-specific hints, first-class support for heterogeneous hardware (NCCL/NCCLX, RCCL/RCCLX, XCCL, Gloo), fault-tolerant process groups, one-sided (RDMA) communication, and device-centric collectives.

During the migration period:

- Existing user code may still call `dist.all_reduce()` / `dist.barrier()` etc.
- A compatibility shim (`c10d_torchcomms`) routes those calls through torchcomms internally while preserving the `torch.distributed` APIs.
- New code can also call `torchcomms` directly.

Adoption requires demonstrating that neither the migration path (`c10d_torchcomms`) nor the direct path (`torchcomms`) introduces performance regressions relative to the baseline that users are migrating away from (`c10d`).

Without a unified benchmarking harness, this comparison is error-prone: separate timing loops, different warm-up logic, or inconsistent barrier placement can skew results and generate misleading conclusions.

### Where the Previous Benchmarking Code Could Be Improved

Each per-collective benchmark (`all_reduce_perf.py`, `all_gather_perf.py`, etc.) carried its own timing loop, and that duplication showed up in a few ways:

- Collective calls were wired directly to `torchcomms.TorchComm`, so swapping in `dist.*` meant editing every file by hand. → addressed by `CommAdapter`.
- Running the c10d or c10d_torchcomms paths needed a separate script, which made head-to-head comparisons harder than they should be. → addressed by `--all-comm-adapters` and `CommAdapter`.
- `PerfResult` exposed only `avg_time_us`; surfacing min/max would need per-iteration sync, which hadn't been added yet. → addressed by `BenchMetrics` plus `--sync-interval` modes.
- Results were printed to stdout in a fixed-width format, so there was no easy path for post-processing or for comparing runs across commits or backends. → addressed by structured CSV output and `analyze_perf.py`.
- `sync_device(device)` asked the caller for a `device` argument instead of using the newer accelerator abstraction. → addressed by `sync_device()` switching to `torch.accelerator`.

---

## Design

### `CommAdapter` — Unified Communication Abstraction

```
CommAdapter(name: "torchcomms" | "c10d" | "c10d_torchcomms")
    .init(backend, device, hints)
    .run_collective(fn_name, *args, **kwargs, async_op=...)
    .get_rank() / get_size() / get_backend() / get_backend_version()
    .get_reduce_op(op_name)
    .finalize()
```

The adapter normalizes the behavioral differences between the two APIs:

| Operation | torchcomms | c10d |
|---|---|---|
| Init | `torchcomms.new_comm(backend, device, ...)` | `dist.init_process_group(backend, ...)` |
| Naming differences | `get_size`, `all_gather_single`, `reduce_scatter_single` | `get_world_size`, `all_gather_into_tensor`, `reduce_scatter_tensor` |
| `gather(output, input)` | `(output_list, input_tensor)` | `(tensor, gather_list)` |
| `send` / `recv` async | `async_op=True` | `isend` / `irecv` |
| Finalize | `comm.finalize()` | `dist.destroy_process_group()` |
| `c10d_torchcomms` | — | sets `dist_config.use_torchcomms = True` before init |

### `BenchRecord` Hierarchy

The flat `PerfParams` / `PerfResult` pair is replaced with a four-level hierarchy that separates concerns cleanly:

```
BenchRecord
├── EnvInfo       — device, torch/torchcomms versions, adapter, backend
├── BenchConfig   — message-size range, CSV path, quiet flag
├── BenchParams   — collective, ranks, dtype, op, iterations, dispatch mode
└── BenchMetrics  — message size, avg/min/max latency, bus bandwidth
```

`EnvInfo` is designed for extension. Additional fields such as GPU interconnect topology, network adapter info, driver versions, operating system, etc. can be added without changing the rest of the schema or touching the run path.

`BenchRecord` is passed through the entire call stack (parse → validate → `run_bench_sweep` → `log_perf_result`) as a single argument, eliminating the growing parameter lists that were accumulating in the old code.

### Shared `run_bench_sweep()` Driver

```python
run_bench_sweep(
    comm,
    record,
    device,
    collective_name,
    setup_fn,    # (num_elements, rank, num_ranks, device, dtype) -> (args, kwargs)
    bus_bw_fn,   # (num_elements, element_size, num_ranks, avg_time_us) -> gbps
)
```

Every collective delegates its timing loop to this single function. Each collective provides only two narrow callbacks:

- `setup_fn`: allocates the tensors and maps collective arguments.
- `bus_bw_fn`: encodes the bus-bandwidth formula for that specific collective.

For example, the `all_reduce` callback captures the standard `2(n-1)/n` ring-algorithm formula:

```python
def bus_bw(num_elements, element_size, num_ranks, avg_time_us):
    algo_bw = (num_elements * element_size) / avg_time_us
    return algo_bw * 2.0 * (num_ranks - 1) / num_ranks / 1000.0
```

This removes the duplicated warmup/measure/print logic from every per-collective file and ensures that every collective uses identical barrier placement, synchronization strategy, and timer arithmetic.

The driver supports three synchronization modes, controlled by `sync_interval`:

- `sync_interval == 0`: single bulk start/stop, lowest overhead.
- `sync_interval == 1`: per-iteration sync, yields true min/max latency.
- `sync_interval == N > 1`: windowed sync, trades accuracy for lower overhead.

### Argument Validation

The CLI parser fails fast on misuse rather than running with surprising defaults:

- Unknown flags or positionals (e.g. `--reduce_op` instead of `--reduce-op`, or `all_redcue` instead of `all_reduce`) are rejected at parse time.
- `--reduce-op <op>` is validated against the supported set (`sum`,  `min`, `max`, `avg`, `product`, plus `all` to sweep every supported op) and rejected when applied to a non-reduction collective. Errors show the offending collective name and the full list of valid ops.
- `--reduce-op` with the `all` collective is rejected as redundant —  `all` already sweeps every reduction op for every reduction collective.
- `--quiet` / `-q` requires `--csv`; otherwise results would be silently discarded.

### Structured CSV Output and `analyze_perf.py`

Each benchmark can now write a structured CSV. The schema covers test identity, benchmark configuration, results, and full environment metadata, and is defined once in `_CSV_COLUMNS` so the terminal printer and the CSV writer stay in sync.

`analyze_perf.py` consumes these CSVs and writes:

```
<outdir>/
├── perf_summary_<adapter>.{txt,csv}     # per-adapter
├── perf_comparison.{txt,csv}            # absolute values + % change vs baseline
├── plots/<collective>/                  # latency / bus-bandwidth PNGs per collective;
│   └── ...                              #   relative (% change) plots also written here when --baseline is set
└── <adapter>_vs_<baseline>/             # one subdir per pairwise comparison
    ├── perf_regress.{txt,csv}           # rows exceeding --regression-threshold
    ├── perf_improve.{txt,csv}           # rows exceeding --improvement-threshold
    └── perf_highlights.{txt,csv}        # counts/averages of improved/regressed/neutral points
```

The pairwise subdir layout keeps the top level uncluttered when comparing multiple adapters at once: with `torchcomms`, `c10d`, and `c10d_torchcomms` all compared against the c10d baseline, two pair directories appear instead of six flat report files.

---

## Impact on the c10d → torchcomms Transition

### What This Enables

**Apples-to-apples comparison.** The `CommAdapter` ensures that the warmup count, measurement iterations, barrier placement, and synchronization strategy are identical for `torchcomms`, `c10d`, and `c10d_torchcomms`. Earlier, any performance comparison between the two APIs would have been confounded by implementation differences in the test harnesses themselves.

**Regression gate.** The `analyze_perf.py` regression detection can be run in CI against a fixed c10d baseline. A PR that regresses `torchcomms` latency beyond the threshold will produce a `perf_regress_torchcomms_vs_c10d.txt` artifact, making the regression actionable before the change lands.

**Migration-path validation.** Most teams will migrate by switching their `c10d` calls to route through `c10d_torchcomms` (the compatibility shim) before — if ever — porting code to the native `torchcomms` API. The shim is therefore the load-bearing path for the transition, and its performance vs. plain `c10d` is what determines whether migration is a no-cost drop-in. The shim is expected to perform at least as well as `c10d` — and often better, since it benefits from the same backend improvements as native `torchcomms` — so the comparison is primarily a regression check: if `c10d_torchcomms` is no slower than `c10d` across the sweep, migration is safe; any per-size regression shows up immediately in the relative plots. The native `torchcomms` numbers serve as a separate signal: the additional headroom available to teams that take the further step of porting to the new API.

**Environment traceability.** Every CSV row records the device name, torch version, torchcomms version, and backend version. This is essential when comparing numbers collected on different nodes, at different points in the release cycle, or after a backend (NCCL, oneCCL) upgrade.

**Reduce-op sweep.** The `--reduce-op` flag and the multi-op sweep in `all` mode ensure that `SUM`, `MIN`, `MAX`, `AVG`, and `PRODUCT` are all covered. Some backends optimize specific reduce operations; this sweep surfaces any asymmetry.

### Post-Deprecation: Ongoing Performance Monitoring

Once c10d is fully retired the benchmarking framework does not become obsolete. It becomes the primary instrument for tracking torchcomms performance over time. The design choices (structured CSV output, full environment metadata per row, `analyze_perf.py` regression detection) were made with this long-term role explicitly in mind.

**Longitudinal tracking.** Because every CSV row captures the torchcomms version, backend version, device name, and torch version alongside the timing numbers, a historical archive of benchmark runs can be built up across releases. `analyze_perf.py --baseline` can compare any two runs in that archive, not just torchcomms vs c10d. A performance regression introduced between torchcomms v1.3 and v1.4 will be just as visible as one introduced during the initial migration.

**Data-driven optimization.** The bus-bandwidth and latency curves produced per collective and per message size directly guide where optimization effort is worth spending. A low bus-bandwidth reading on `all_reduce` at small message sizes points to latency-bound behavior and motivates kernel fusion or pipelining work. A degraded `reduce_scatter` curve after a backend upgrade pinpoints the scope of a regression before it reaches users. Without a structured, reproducible benchmark suite this kind of targeted diagnosis requires manual, ad-hoc measurement.

**Multi-reduce-op coverage.** The multi-op sweep (`SUM`, `MIN`, `MAX`, `AVG`, `PRODUCT`) is particularly important post-deprecation: as torchcomms gains new backend-specific fast paths for individual ops, the sweep ensures that an optimization for one op does not silently degrade another.

### CLI Changes

The only renamed flag is `--window` → `--sync-interval <n>`; job scripts and CI configurations using `--window` must be updated. Everything else is additive: `--reduce-op`, `--c10d`, `--c10d-torchcomms`, `--all-comm-adapters`, `--csv`, and `--quiet` / `-q` require no changes to existing invocations, and the positional `all` argument is still supported.

### API Changes Affecting Collective Benchmark Files

Each collective benchmark's public entry point changed signature:

```python
# Before
def run_all_reduce_perf(
    comm: torchcomms.TorchComm,
    params: PerfParams,
    device: torch.device,
) -> None: ...

# After
def run_all_reduce_perf(
    comm: CommAdapter,
    record: BenchRecord,
    device: torch.device,
) -> None: ...
```

Out-of-tree code that imports and calls these functions directly must be updated. The `CommAdapter` wrapping of the `TorchComm` object is a one-line change at the call site:

```python
# Before
comm = torchcomms.new_comm(backend, device)
run_all_reduce_perf(comm, params, device)

# After
comm = CommAdapter("torchcomms")
comm.init(backend, device)
run_all_reduce_perf(comm, record, device)
```

### Deprecation of `validate_params` / `print_perf_header` / `print_perf_result`

The following public helpers in `perf_test_helpers.py` were removed or renamed:

| Removed | Replacement |
|---|---|
| `validate_params(collective, params)` | `validate_bench_params(record)` |
| `print_perf_header(rank)` | `log_perf_header(record, rank)` |
| `print_perf_result(result, rank)` | `log_perf_result(record, rank)` |
| `sync_device(device)` | `sync_device()` (no argument; uses `torch.accelerator`) |
| `PerfResult` | `BenchMetrics` (part of `BenchRecord`) |
| `PerfParams` | `BenchParams` + `BenchConfig` (separated concerns) |

---

## Recommended Adoption Steps

1. **Update job scripts.** Replace `--window` with `--sync-interval`. Add    `--csv` to any benchmark invocations whose output is consumed downstream.

2. **Establish a c10d baseline.** Run the full suite with `--c10d --csv`  on the target hardware before merging any torchcomms changes. The benchmark writes to `./c10d/` by default; archive that directory as the reference baseline for `analyze_perf.py --baseline c10d`.

3. **Add a CI comparison job.** After each PR that touches collective implementation, run torchcomms against the archived c10d baseline from step 2:
   ```bash
   TEST_BACKEND=nccl torchrun --nproc_per_node=8 \
       collective_perf_test.py all --csv --quiet
   python analyze_perf.py \
       --dir torchcomms:./torchcomms \
       --dir c10d:./c10d_baseline \
       --baseline c10d \
       --regression-threshold 3.0
   ```
   Fail the job if `perf_regress_torchcomms_vs_c10d.{txt,csv}` is non-empty.

4. **Validate the shim.** For each hardware target, collect a `c10d_torchcomms` run alongside `c10d` and confirm that the relative latency plots stay within run-to-run noise across all message sizes. If establishing a tighter bound, specify the methodology (e.g. median of N runs) so the threshold is reproducible.

Operation	torchcomms	c10d
Init	`torchcomms.new_comm(backend, device, ...)`	`dist.init_process_group(backend, ...)`
Naming differences	`get_size`, `all_gather_single`, `reduce_scatter_single`	`get_world_size`, `all_gather_into_tensor`, `reduce_scatter_tensor`
`gather(output, input)`	`(output_list, input_tensor)`	`(tensor, gather_list)`
`send` / `recv` async	`async_op=True`	`isend` / `irecv`
Finalize	`comm.finalize()`	`dist.destroy_process_group()`
`c10d_torchcomms`	—	sets `dist_config.use_torchcomms = True` before init

Removed	Replacement
`validate_params(collective, params)`	`validate_bench_params(record)`
`print_perf_header(rank)`	`log_perf_header(record, rank)`
`print_perf_result(result, rank)`	`log_perf_result(record, rank)`
`sync_device(device)`	`sync_device()` (no argument; uses `torch.accelerator`)
`PerfResult`	`BenchMetrics` (part of `BenchRecord`)
`PerfParams`	`BenchParams` + `BenchConfig` (separated concerns)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring #2634

RFC: Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring

Summary

Motivation and Context

c10d Deprecation

Where the Previous Benchmarking Code Could Be Improved

Design

`CommAdapter` — Unified Communication Abstraction

`BenchRecord` Hierarchy

Shared `run_bench_sweep()` Driver

Argument Validation

Structured CSV Output and `analyze_perf.py`

Impact on the c10d → torchcomms Transition

What This Enables

Post-Deprecation: Ongoing Performance Monitoring

CLI Changes

API Changes Affecting Collective Benchmark Files

Deprecation of `validate_params` / `print_perf_header` / `print_perf_result`

Recommended Adoption Steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC] Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring #2634

Description

RFC: Torchcomms Performance Benchmarking: Migration Validation and Long-Term Monitoring

Summary

Motivation and Context

c10d Deprecation

Where the Previous Benchmarking Code Could Be Improved

Design

CommAdapter — Unified Communication Abstraction

BenchRecord Hierarchy

Shared run_bench_sweep() Driver

Argument Validation

Structured CSV Output and analyze_perf.py

Impact on the c10d → torchcomms Transition

What This Enables

Post-Deprecation: Ongoing Performance Monitoring

CLI Changes

API Changes Affecting Collective Benchmark Files

Deprecation of validate_params / print_perf_header / print_perf_result

Recommended Adoption Steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`CommAdapter` — Unified Communication Abstraction

`BenchRecord` Hierarchy

Shared `run_bench_sweep()` Driver

Structured CSV Output and `analyze_perf.py`

Deprecation of `validate_params` / `print_perf_header` / `print_perf_result`