perf improvements

Had a bit of fun with GPT-5.2 Pro, Codex 5.3 xhigh, Gemini, and Claude (and my own brain) over the past couple days to eek out performance (for a weekend project idea that didn't end up panning out). No hard feelings if you reject the drive-by PRs! 

Allocations, as mentioned in #160, did not seem to make a huge difference as compared to the SIMD side of things, especially after the other optimizations. May do a PR to address anyway since it can be useful.  

Recommended merge order if you want any/all of these PRs: 

1. #197  
2. #199  
3. #198  
4. #204 
5. #201  
6. #200  after this, update avx512-binary: I can do a PR to change symbol.rs is_empty from #[cfg(feature = "benchmarking")] to  [allow(dead_code)] 
7. #202  
8. #203  after this, I can rebase perf/skip-hdpc-decode onto perf/enc-indices-callback
9. #206  after this, I can rebase the slab allocation
10. #208 - may need some rebasing on 209 to merge...  
11. #209
12. #211 
13. #213 
14. #212

# Full combined benchmarks: 

Baseline: `fork/master` (`e777861`). Combined: `perf/all-combined` (`95b3e3e`).

## Zen 4 — AMD EPYC 9654P 96-Core (bare metal, 256-bit AVX-512 µops)

### codec_benchmark (criterion, symbol_size=512)

| Metric | master | combined | Delta |
|---|---|---|---|
| Symbol += | 22.46 ns | 17.26 ns | **-23.3%** |
| Symbol FMA | 23.18 ns | 23.9 ns | +3.1% (noise?) |
| encode 10KB | 37.00 µs | 10.2 µs | **-72.4%** |
| roundtrip 10KB | 38.56 µs | 11.31 µs | **-70.7%** |
| roundtrip repair 10KB | 80.59 µs | 43.49 µs | **-46.0%** |

### decode_benchmark (Mbit/s, symbol_size=1280)

| K | master 0% | combined 0% | Delta | master 5% | combined 5% | Delta |
|---|---|---|---|---|---|---|
| 10 | 2,573 | 2,745 | +6.7% | 2,528 | 2,753 | +8.9% |
| 100 | 3,218 | 3,529 | +9.7% | 3,139 | 3,493 | +11.3% |
| 250 | 3,479 | 3,667 | +5.4% | 3,343 | 3,846 | **+15.0%** |
| 500 | 3,658 | 3,738 | +2.2% | 3,593 | 4,703 | **+30.9%** |
| 1,000 | 3,576 | 3,734 | +4.4% | 3,455 | 4,617 | **+33.6%** |
| 2,000 | 3,287 | 3,478 | +5.8% | 3,330 | 4,322 | **+29.8%** |
| 5,000 | 2,839 | 3,150 | **+11.0%** | 2,889 | 3,800 | **+31.5%** |
| 10,000 | 2,570 | 2,736 | +6.4% | 2,336 | 3,100 | **+32.7%** |
| 20,000 | 1,957 | 2,082 | +6.4% | 1,860 | 2,282 | **+22.7%** |
| 50,000 | 1,377 | 1,455 | +5.7% | 1,175 | 1,451 | **+23.5%** |

### encode_benchmark (Mbit/s, symbol_size=1280)

Master uses `SourceBlockEncoder::new()` (no plan). Combined uses global plan cache (PR #200).

| K | master | combined | Delta |
|---|---|---|---|
| 10 | 4,785 | 7,529 | **+57.3%** |
| 100 | 5,985 | 9,936 | **+66.0%** |
| 1,000 | 5,345 | 9,150 | **+71.2%** |
| 10,000 | 3,617 | 5,549 | **+53.4%** |
| 50,000 | 2,146 | 2,096 | -2.3% (noise?) |

---

## Zen 5 — AMD EPYC 9B45 (GCP c4d-standard-4, native 512-bit AVX-512)

### codec_benchmark (criterion, symbol_size=512)

| Metric | master | combined | Delta |
|---|---|---|---|
| Symbol mulassign_scalar | 16.93 ns | 14.35 ns | **-15.3%** |
| Symbol += | 21.22 ns | 16.83 ns | **-20.7%** |
| Symbol FMA | 20.40 ns | 15.79 ns | **-22.6%** |
| encode 10KB | 30.82 µs | 7.60 µs | **-75.3%** |
| roundtrip 10KB | 31.62 µs | 7.16 µs | **-77.4%** |
| roundtrip repair 10KB | 65.76 µs | 31.69 µs | **-51.8%** |

### decode_benchmark (Mbit/s, symbol_size=1280)

| K | master 0% | combined 0% | Delta | master 5% | combined 5% | Delta |
|---|---|---|---|---|---|---|
| 10 | 3,180 | 4,571 | **+43.7%** | 3,170 | 4,112 | **+29.7%** |
| 100 | 4,077 | 6,020 | **+47.6%** | 4,045 | 5,654 | **+39.8%** |
| 250 | 4,506 | 5,500 | **+22.0%** | 4,448 | 5,412 | **+21.7%** |
| 500 | 4,703 | 6,003 | **+27.6%** | 4,703 | 6,542 | **+39.1%** |
| 1,000 | 4,659 | 5,738 | **+23.2%** | 4,514 | 6,428 | **+42.4%** |
| 2,000 | 4,435 | 5,460 | **+23.1%** | 4,435 | 5,974 | **+34.7%** |
| 5,000 | 4,086 | 4,932 | **+20.7%** | 3,938 | 5,279 | **+34.1%** |
| 10,000 | 3,617 | 4,500 | **+24.4%** | 3,255 | 4,321 | **+32.7%** |
| 20,000 | 2,968 | 3,658 | **+23.2%** | 2,646 | 3,288 | **+24.2%** |
| 50,000 | 1,981 | 2,584 | **+30.4%** | 1,669 | 2,190 | **+31.2%** |

### encode_benchmark (Mbit/s, symbol_size=1280)

Master uses `SourceBlockEncoder::new()` (no plan). Combined uses global plan cache (PR #200).

| K | master (no plan) | combined (plan cache) | Delta |
|---|---|---|---|
| 10 | 4,633 | 18,617 | **+301.8%** |
| 100 | 5,848 | 17,646 | **+201.7%** |
| 250 | 5,982 | 15,738 | **+163.1%** |
| 500 | 5,516 | 15,232 | **+176.1%** |
| 1,000 | 5,317 | 14,719 | **+176.8%** |
| 2,000 | 5,129 | 13,542 | **+164.0%** |
| 5,000 | 4,741 | 15,751 | **+232.3%** |
| 10,000 | 3,938 | 9,669 | **+145.5%** |
| 20,000 | 3,451 | 7,234 | **+109.6%** |
| 50,000 | 2,460 | 3,906 | **+58.8%** |

With explicit pre-built plan on both sides:

| K | master | combined | Delta |
|---|---|---|---|
| 10 | 8,904 | 13,127 | **+47.4%** |
| 100 | 13,291 | 17,646 | **+32.8%** |
| 250 | 12,325 | 15,984 | **+29.7%** |
| 500 | 12,295 | 21,713 | **+76.6%** |
| 1,000 | 12,236 | 14,719 | **+20.3%** |
| 2,000 | 11,674 | 14,106 | **+20.8%** |
| 5,000 | 10,389 | 12,683 | **+22.1%** |
| 10,000 | 8,719 | 10,973 | **+25.8%** |
| 20,000 | 6,829 | 8,878 | **+30.0%** |
| 50,000 | 4,191 | 6,028 | **+43.8%** |

---

## Intel Emerald Rapids — Xeon Platinum 8581C (GCP c4-standard-4)

### codec_benchmark (criterion, symbol_size=512)

| Metric | master | combined | Delta |
|---|---|---|---|
| Symbol mulassign_scalar | 26.68 ns | 20.17 ns | **-24.4%** |
| Symbol += | 23.61 ns | 18.26 ns | **-22.7%** |
| Symbol FMA | 30.47 ns | 22.99 ns | **-24.5%** |
| encode 10KB | 44.87 µs | 12.38 µs | **-72.4%** |
| roundtrip 10KB | 45.74 µs | 13.93 µs | **-69.5%** |
| roundtrip repair 10KB | 94.72 µs | 48.57 µs | **-48.7%** |

### decode_benchmark (Mbit/s, symbol_size=1280)

| K | master 0% | combined 0% | Delta | master 5% | combined 5% | Delta |
|---|---|---|---|---|---|---|
| 10 | 2,081 | 2,884 | **+38.6%** | 2,081 | 2,553 | **+22.7%** |
| 100 | 2,804 | 3,848 | **+37.2%** | 2,804 | 3,554 | **+26.7%** |
| 250 | 2,991 | 3,527 | **+17.9%** | 2,931 | 3,540 | **+20.8%** |
| 500 | 3,209 | 3,836 | **+19.6%** | 3,179 | 4,200 | **+32.1%** |
| 1,000 | 3,106 | 3,666 | **+18.0%** | 2,996 | 4,095 | **+36.7%** |
| 2,000 | 2,877 | 3,408 | **+18.5%** | 2,894 | 3,762 | **+30.0%** |
| 5,000 | 2,543 | 3,014 | **+18.5%** | 2,485 | 3,171 | **+27.6%** |
| 10,000 | 2,282 | 2,713 | **+18.9%** | 2,109 | 2,743 | **+30.1%** |
| 20,000 | 1,949 | 2,320 | **+19.0%** | 1,792 | 2,175 | **+21.4%** |
| 50,000 | 1,464 | 1,710 | **+16.8%** | 1,179 | 1,473 | **+24.9%** |

### encode_benchmark (Mbit/s, symbol_size=1280)

Master uses `SourceBlockEncoder::new()` (no plan). Combined uses global plan cache (PR #200).

| K | master (no plan) | combined (plan cache) | Delta |
|---|---|---|---|
| 10 | 2,917 | 10,038 | **+244.1%** |
| 100 | 3,776 | 9,655 | **+155.7%** |
| 250 | 3,540 | 8,973 | **+153.5%** |
| 500 | 3,698 | 8,576 | **+131.9%** |
| 1,000 | 3,589 | 7,997 | **+122.8%** |
| 2,000 | 3,397 | 7,254 | **+113.6%** |
| 5,000 | 3,120 | 7,288 | **+133.6%** |
| 10,000 | 2,751 | 5,336 | **+94.0%** |
| 20,000 | 2,435 | 4,283 | **+75.9%** |
| 50,000 | 1,695 | 2,550 | **+50.4%** |

With explicit pre-built plan on both sides:

| K | master | combined | Delta |
|---|---|---|---|
| 10 | 5,595 | 7,366 | **+31.7%** |
| 100 | 8,458 | 9,841 | **+16.3%** |
| 250 | 7,869 | 8,973 | **+14.0%** |
| 500 | 7,731 | 12,006 | **+55.3%** |
| 1,000 | 7,360 | 8,190 | **+11.3%** |
| 2,000 | 6,862 | 7,468 | +8.8% |
| 5,000 | 5,919 | 6,735 | **+13.8%** |
| 10,000 | 5,140 | 5,848 | **+13.8%** |
| 20,000 | 4,672 | 5,279 | **+13.0%** |
| 50,000 | 3,090 | 4,138 | **+33.9%** |

---

## Axion V2 — Google Axion / Neoverse V2 (GCP c4a-standard-4, aarch64)

### codec_benchmark (criterion, symbol_size=512)

| Metric | master | combined | Delta |
|---|---|---|---|
| Symbol mulassign_scalar | 36.42 ns | 31.05 ns | **-14.8%** |
| Symbol += | 31.53 ns | 28.84 ns | **-8.5%** |
| Symbol FMA | 37.77 ns | 36.42 ns | -3.6% |
| encode 10KB | 45.00 µs | 17.34 µs | **-61.5%** |
| roundtrip 10KB | 46.23 µs | 18.64 µs | **-59.7%** |
| roundtrip repair 10KB | 95.43 µs | 57.81 µs | **-39.4%** |

### decode_benchmark (Mbit/s, symbol_size=1280)

| K | master 0% | combined 0% | Delta | master 5% | combined 5% | Delta |
|---|---|---|---|---|---|---|
| 10 | 2,156 | 2,438 | **+13.1%** | 2,124 | 2,426 | **+14.2%** |
| 100 | 2,827 | 3,312 | **+17.2%** | 2,796 | 3,270 | **+16.9%** |
| 250 | 2,931 | 3,279 | **+11.9%** | 2,956 | 3,577 | **+21.0%** |
| 500 | 2,958 | 3,303 | **+11.6%** | 2,949 | 4,132 | **+40.1%** |
| 1,000 | 2,806 | 3,214 | **+14.6%** | 2,821 | 3,666 | **+30.0%** |
| 2,000 | 2,329 | 2,790 | **+19.8%** | 2,430 | 3,068 | **+26.3%** |
| 5,000 | 1,832 | 2,353 | **+28.4%** | 1,949 | 2,298 | **+17.9%** |
| 10,000 | 1,432 | 1,989 | **+38.9%** | 1,585 | 1,710 | +7.9% |
| 20,000 | 1,126 | 1,628 | **+44.5%** | 1,206 | 1,313 | +8.9% |
| 50,000 | 821 | 1,188 | **+44.6%** | 686 | 955 | **+39.1%** |

### encode_benchmark (Mbit/s, symbol_size=1280)

Master uses `SourceBlockEncoder::new()` (no plan). Combined uses global plan cache (PR #200).

| K | master (no plan) | combined (plan cache) | Delta |
|---|---|---|---|
| 10 | 2,716 | 4,697 | **+72.9%** |
| 100 | 3,566 | 7,010 | **+96.6%** |
| 250 | 3,527 | 6,516 | **+84.7%** |
| 500 | 3,368 | 6,378 | **+89.4%** |
| 1,000 | 3,087 | 6,082 | **+97.0%** |
| 2,000 | 2,790 | 5,345 | **+91.6%** |
| 5,000 | 2,282 | 4,302 | **+88.5%** |
| 10,000 | 1,915 | 3,685 | **+92.5%** |
| 20,000 | 1,588 | 2,872 | **+80.9%** |
| 50,000 | 1,097 | 1,874 | **+70.8%** |

With explicit pre-built plan on both sides:

| K | master | combined | Delta |
|---|---|---|---|
| 10 | 4,096 | 4,654 | **+13.6%** |
| 100 | 6,240 | 7,058 | **+13.1%** |
| 250 | 5,779 | 6,516 | **+12.7%** |
| 500 | 5,701 | 6,378 | **+11.9%** |
| 1,000 | 5,235 | 6,155 | **+17.6%** |
| 2,000 | 4,435 | 5,460 | **+23.1%** |
| 5,000 | 3,191 | 4,173 | **+30.8%** |
| 10,000 | 2,611 | 3,130 | **+19.9%** |
| 20,000 | 2,209 | 2,705 | **+22.4%** |
| 50,000 | 1,738 | 2,194 | **+26.3%** |

---

## Axion N3 — Google Axion / Neoverse N3 (GCP n4a-standard-4, aarch64)

### codec_benchmark (criterion, symbol_size=512)

| Metric | master | combined | Delta |
|---|---|---|---|
| Symbol mulassign_scalar | 48.32 ns | 42.89 ns | **-11.2%** |
| Symbol += | 33.19 ns | 29.80 ns | **-10.2%** |
| Symbol FMA | 50.78 ns | 49.06 ns | -3.4% |
| encode 10KB | 50.12 µs | 21.66 µs | **-56.8%** |
| roundtrip 10KB | 52.03 µs | 23.20 µs | **-55.4%** |
| roundtrip repair 10KB | 107.42 µs | 69.62 µs | **-35.2%** |

### decode_benchmark (Mbit/s, symbol_size=1280)

| K | master 0% | combined 0% | Delta | master 5% | combined 5% | Delta |
|---|---|---|---|---|---|---|
| 10 | 1,724 | 1,928 | **+11.9%** | 1,724 | 1,921 | **+11.4%** |
| 100 | 2,466 | 2,781 | **+12.8%** | 2,454 | 2,759 | **+12.4%** |
| 250 | 2,551 | 2,787 | +9.3% | 2,551 | 3,354 | **+31.5%** |
| 500 | 2,597 | 2,811 | +8.3% | 2,558 | 4,098 | **+60.2%** |
| 1,000 | 2,539 | 2,745 | +8.1% | 2,483 | 3,891 | **+56.7%** |
| 2,000 | 2,208 | 2,436 | **+10.3%** | 2,198 | 3,443 | **+56.6%** |
| 5,000 | 1,938 | 2,123 | +9.6% | 1,885 | 2,864 | **+51.9%** |
| 10,000 | 1,741 | 1,900 | +9.1% | 1,604 | 2,365 | **+47.5%** |
| 20,000 | 1,455 | 1,593 | +9.5% | 1,318 | 1,725 | **+30.9%** |
| 50,000 | 1,051 | 1,188 | **+13.0%** | 864 | 1,197 | **+38.6%** |

### encode_benchmark (Mbit/s, symbol_size=1280)

Master uses `SourceBlockEncoder::new()` (no plan). Combined uses global plan cache (PR #200).

| K | master (no plan) | combined (plan cache) | Delta |
|---|---|---|---|
| 10 | 2,064 | 3,150 | **+52.6%** |
| 100 | 3,028 | 5,169 | **+70.7%** |
| 250 | 3,000 | 4,918 | **+63.9%** |
| 500 | 2,949 | 4,906 | **+66.3%** |
| 1,000 | 2,910 | 4,724 | **+62.3%** |
| 2,000 | 2,673 | 4,197 | **+57.0%** |
| 5,000 | 2,336 | 3,426 | **+46.7%** |
| 10,000 | 2,043 | 3,255 | **+59.3%** |
| 20,000 | 1,769 | 2,597 | **+46.8%** |
| 50,000 | 1,270 | 1,678 | **+32.1%** |

With explicit pre-built plan on both sides:

| K | master | combined | Delta |
|---|---|---|---|
| 10 | 2,892 | 3,141 | +8.6% |
| 100 | 4,760 | 5,169 | +8.6% |
| 250 | 4,506 | 4,894 | +8.6% |
| 500 | 4,496 | 4,883 | +8.6% |
| 1,000 | 4,435 | 4,813 | +8.5% |
| 2,000 | 3,998 | 4,267 | +6.7% |
| 5,000 | 3,513 | 3,617 | +3.0% |
| 10,000 | 3,130 | 3,277 | +4.7% |
| 20,000 | 2,646 | 2,942 | **+11.1%** |
| 50,000 | 1,993 | 2,230 | **+11.9%** |

---

## Zen 4 Symbol-Size Sweep (decode Mbit/s)

Verifies gains hold across the full range of symbol sizes (64-8192 bytes). No regressions observed.

### K=100

| Symbol Size | master 0% | combined 0% | Delta | master 5% | combined 5% | Delta |
|---|---|---|---|---|---|---|
| 64 | 234 | 274 | **+17.1%** | 230 | 269 | **+17.0%** |
| 128 | 458 | 530 | **+15.7%** | 451 | 523 | **+16.0%** |
| 256 | 882 | 1,004 | **+13.8%** | 869 | 991 | **+14.0%** |
| 512 | 1,643 | 1,815 | **+10.5%** | 1,605 | 1,781 | **+11.0%** |
| 1024 | 2,490 | 3,028 | **+21.6%** | 2,722 | 2,984 | +9.6% |
| 1280 | 3,259 | 3,554 | +9.0% | 3,188 | 3,529 | **+10.7%** |
| 2048 | 4,431 | 4,782 | +7.9% | 4,127 | 4,716 | **+14.3%** |
| 4096 | 6,269 | 6,551 | +4.5% | 6,156 | 6,509 | +5.7% |
| 8192 | 7,075 | 7,174 | +1.4% | 7,837 | 8,085 | +3.2% |

### K=1,000

| Symbol Size | master 0% | combined 0% | Delta | master 5% | combined 5% | Delta |
|---|---|---|---|---|---|---|
| 64 | 285 | 316 | **+10.9%** | 280 | 383 | **+36.8%** |
| 128 | 549 | 608 | **+10.7%** | 540 | 730 | **+35.2%** |
| 256 | 1,054 | 1,145 | +8.6% | 1,027 | 1,376 | **+34.0%** |
| 512 | 1,895 | 1,995 | +5.3% | 1,837 | 2,454 | **+33.6%** |
| 1024 | 3,111 | 3,001 | -3.5% | 3,037 | 3,906 | **+28.6%** |
| 1280 | 3,576 | 3,748 | +4.8% | 3,502 | 4,554 | **+30.1%** |
| 2048 | 4,638 | 4,702 | +1.4% | 4,267 | 5,837 | **+36.8%** |
| 4096 | 6,061 | 6,211 | +2.5% | 6,211 | 8,197 | **+32.0%** |
| 8192 | 6,369 | 6,329 | -0.6% | 7,407 | 9,524 | **+28.6%** |

### K=10,000

| Symbol Size | master 0% | combined 0% | Delta | master 5% | combined 5% | Delta |
|---|---|---|---|---|---|---|
| 64 | 245 | 270 | **+10.2%** | 208 | 281 | **+35.1%** |
| 128 | 464 | 518 | **+11.6%** | 404 | 551 | **+36.4%** |
| 256 | 871 | 940 | +7.9% | 752 | 1,002 | **+33.2%** |
| 512 | 1,530 | 1,580 | +3.3% | 1,347 | 1,763 | **+30.9%** |
| 1024 | 2,367 | 2,436 | +2.9% | 2,125 | 2,701 | **+27.1%** |
| 1280 | 2,342 | 2,782 | **+18.8%** | 2,399 | 3,005 | **+25.3%** |
| 2048 | 2,948 | 3,189 | +8.2% | 2,921 | 3,422 | **+17.1%** |
| 4096 | 3,244 | 3,485 | +7.4% | 3,472 | 3,662 | +5.5% |
| 8192 | 3,205 | 3,492 | +9.0% | 4,058 | 4,195 | +3.4% |

---

## Summary

| Workload | Zen 4 | Zen 5 | Intel EMR | Axion V2 | Axion N3 |
|---|---|---|---|---|---|
| Codec encode 10KB | **-72.4%** latency | **-75.3%** latency | **-72.4%** latency | **-61.5%** latency | **-56.8%** latency |
| Codec roundtrip 10KB | **-70.7%** latency | **-77.4%** latency | **-69.5%** latency | **-59.7%** latency | **-55.4%** latency |
| Codec repair 10KB | **-46.0%** latency | **-51.8%** latency | **-48.7%** latency | **-39.4%** latency | **-35.2%** latency |
| Decode 0% overhead | +2-11% tp | **+21-48%** tp | **+17-39%** tp | **+12-45%** tp | +8-13% tp |
| Decode 5% overhead | **+9-34%** tp | **+22-42%** tp | **+21-37%** tp | **+8-40%** tp | **+11-60%** tp |
| Encode (no plan vs plan cache) | **+53-71%** tp | **+59-302%** tp | **+50-244%** tp | **+71-97%** tp | **+32-71%** tp |
| Encode (pre-built plan) | N/A | **+20-77%** tp | **+9-55%** tp | **+12-31%** tp | +3-12% tp |


Workload	Zen 4	Zen 5	Intel EMR	Axion V2	Axion N3
Codec encode 10KB	-72.4% latency	-75.3% latency	-72.4% latency	-61.5% latency	-56.8% latency
Codec roundtrip 10KB	-70.7% latency	-77.4% latency	-69.5% latency	-59.7% latency	-55.4% latency
Codec repair 10KB	-46.0% latency	-51.8% latency	-48.7% latency	-39.4% latency	-35.2% latency
Decode 0% overhead	+2-11% tp	+21-48% tp	+17-39% tp	+12-45% tp	+8-13% tp
Decode 5% overhead	+9-34% tp	+22-42% tp	+21-37% tp	+8-40% tp	+11-60% tp
Encode (no plan vs plan cache)	+53-71% tp	+59-302% tp	+50-244% tp	+71-97% tp	+32-71% tp
Encode (pre-built plan)	N/A	+20-77% tp	+9-55% tp	+12-31% tp	+3-12% tp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf improvements #207

Full combined benchmarks:

Zen 4 — AMD EPYC 9654P 96-Core (bare metal, 256-bit AVX-512 µops)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Zen 5 — AMD EPYC 9B45 (GCP c4d-standard-4, native 512-bit AVX-512)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Intel Emerald Rapids — Xeon Platinum 8581C (GCP c4-standard-4)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Axion V2 — Google Axion / Neoverse V2 (GCP c4a-standard-4, aarch64)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Axion N3 — Google Axion / Neoverse N3 (GCP n4a-standard-4, aarch64)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Zen 4 Symbol-Size Sweep (decode Mbit/s)

K=100

K=1,000

K=10,000

Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	master	combined	Delta
Symbol +=	22.46 ns	17.26 ns	-23.3%
Symbol FMA	23.18 ns	23.9 ns	+3.1% (noise?)
encode 10KB	37.00 µs	10.2 µs	-72.4%
roundtrip 10KB	38.56 µs	11.31 µs	-70.7%
roundtrip repair 10KB	80.59 µs	43.49 µs	-46.0%

K	master 0%	combined 0%	Delta	master 5%	combined 5%	Delta
10	2,573	2,745	+6.7%	2,528	2,753	+8.9%
100	3,218	3,529	+9.7%	3,139	3,493	+11.3%
250	3,479	3,667	+5.4%	3,343	3,846	+15.0%
500	3,658	3,738	+2.2%	3,593	4,703	+30.9%
1,000	3,576	3,734	+4.4%	3,455	4,617	+33.6%
2,000	3,287	3,478	+5.8%	3,330	4,322	+29.8%
5,000	2,839	3,150	+11.0%	2,889	3,800	+31.5%
10,000	2,570	2,736	+6.4%	2,336	3,100	+32.7%
20,000	1,957	2,082	+6.4%	1,860	2,282	+22.7%
50,000	1,377	1,455	+5.7%	1,175	1,451	+23.5%

K	master	combined	Delta
10	4,785	7,529	+57.3%
100	5,985	9,936	+66.0%
1,000	5,345	9,150	+71.2%
10,000	3,617	5,549	+53.4%
50,000	2,146	2,096	-2.3% (noise?)

Metric	master	combined	Delta
Symbol mulassign_scalar	16.93 ns	14.35 ns	-15.3%
Symbol +=	21.22 ns	16.83 ns	-20.7%
Symbol FMA	20.40 ns	15.79 ns	-22.6%
encode 10KB	30.82 µs	7.60 µs	-75.3%
roundtrip 10KB	31.62 µs	7.16 µs	-77.4%
roundtrip repair 10KB	65.76 µs	31.69 µs	-51.8%

K	master 0%	combined 0%	Delta	master 5%	combined 5%	Delta
10	3,180	4,571	+43.7%	3,170	4,112	+29.7%
100	4,077	6,020	+47.6%	4,045	5,654	+39.8%
250	4,506	5,500	+22.0%	4,448	5,412	+21.7%
500	4,703	6,003	+27.6%	4,703	6,542	+39.1%
1,000	4,659	5,738	+23.2%	4,514	6,428	+42.4%
2,000	4,435	5,460	+23.1%	4,435	5,974	+34.7%
5,000	4,086	4,932	+20.7%	3,938	5,279	+34.1%
10,000	3,617	4,500	+24.4%	3,255	4,321	+32.7%
20,000	2,968	3,658	+23.2%	2,646	3,288	+24.2%
50,000	1,981	2,584	+30.4%	1,669	2,190	+31.2%

K	master (no plan)	combined (plan cache)	Delta
10	4,633	18,617	+301.8%
100	5,848	17,646	+201.7%
250	5,982	15,738	+163.1%
500	5,516	15,232	+176.1%
1,000	5,317	14,719	+176.8%
2,000	5,129	13,542	+164.0%
5,000	4,741	15,751	+232.3%
10,000	3,938	9,669	+145.5%
20,000	3,451	7,234	+109.6%
50,000	2,460	3,906	+58.8%

K	master	combined	Delta
10	8,904	13,127	+47.4%
100	13,291	17,646	+32.8%
250	12,325	15,984	+29.7%
500	12,295	21,713	+76.6%
1,000	12,236	14,719	+20.3%
2,000	11,674	14,106	+20.8%
5,000	10,389	12,683	+22.1%
10,000	8,719	10,973	+25.8%
20,000	6,829	8,878	+30.0%
50,000	4,191	6,028	+43.8%

Metric	master	combined	Delta
Symbol mulassign_scalar	26.68 ns	20.17 ns	-24.4%
Symbol +=	23.61 ns	18.26 ns	-22.7%
Symbol FMA	30.47 ns	22.99 ns	-24.5%
encode 10KB	44.87 µs	12.38 µs	-72.4%
roundtrip 10KB	45.74 µs	13.93 µs	-69.5%
roundtrip repair 10KB	94.72 µs	48.57 µs	-48.7%

K	master 0%	combined 0%	Delta	master 5%	combined 5%	Delta
10	2,081	2,884	+38.6%	2,081	2,553	+22.7%
100	2,804	3,848	+37.2%	2,804	3,554	+26.7%
250	2,991	3,527	+17.9%	2,931	3,540	+20.8%
500	3,209	3,836	+19.6%	3,179	4,200	+32.1%
1,000	3,106	3,666	+18.0%	2,996	4,095	+36.7%
2,000	2,877	3,408	+18.5%	2,894	3,762	+30.0%
5,000	2,543	3,014	+18.5%	2,485	3,171	+27.6%
10,000	2,282	2,713	+18.9%	2,109	2,743	+30.1%
20,000	1,949	2,320	+19.0%	1,792	2,175	+21.4%
50,000	1,464	1,710	+16.8%	1,179	1,473	+24.9%

K	master (no plan)	combined (plan cache)	Delta
10	2,917	10,038	+244.1%
100	3,776	9,655	+155.7%
250	3,540	8,973	+153.5%
500	3,698	8,576	+131.9%
1,000	3,589	7,997	+122.8%
2,000	3,397	7,254	+113.6%
5,000	3,120	7,288	+133.6%
10,000	2,751	5,336	+94.0%
20,000	2,435	4,283	+75.9%
50,000	1,695	2,550	+50.4%

K	master	combined	Delta
10	5,595	7,366	+31.7%
100	8,458	9,841	+16.3%
250	7,869	8,973	+14.0%
500	7,731	12,006	+55.3%
1,000	7,360	8,190	+11.3%
2,000	6,862	7,468	+8.8%
5,000	5,919	6,735	+13.8%
10,000	5,140	5,848	+13.8%
20,000	4,672	5,279	+13.0%
50,000	3,090	4,138	+33.9%

Metric	master	combined	Delta
Symbol mulassign_scalar	36.42 ns	31.05 ns	-14.8%
Symbol +=	31.53 ns	28.84 ns	-8.5%
Symbol FMA	37.77 ns	36.42 ns	-3.6%
encode 10KB	45.00 µs	17.34 µs	-61.5%
roundtrip 10KB	46.23 µs	18.64 µs	-59.7%
roundtrip repair 10KB	95.43 µs	57.81 µs	-39.4%

K	master 0%	combined 0%	Delta	master 5%	combined 5%	Delta
10	2,156	2,438	+13.1%	2,124	2,426	+14.2%
100	2,827	3,312	+17.2%	2,796	3,270	+16.9%
250	2,931	3,279	+11.9%	2,956	3,577	+21.0%
500	2,958	3,303	+11.6%	2,949	4,132	+40.1%
1,000	2,806	3,214	+14.6%	2,821	3,666	+30.0%
2,000	2,329	2,790	+19.8%	2,430	3,068	+26.3%
5,000	1,832	2,353	+28.4%	1,949	2,298	+17.9%
10,000	1,432	1,989	+38.9%	1,585	1,710	+7.9%
20,000	1,126	1,628	+44.5%	1,206	1,313	+8.9%
50,000	821	1,188	+44.6%	686	955	+39.1%

K	master (no plan)	combined (plan cache)	Delta
10	2,716	4,697	+72.9%
100	3,566	7,010	+96.6%
250	3,527	6,516	+84.7%
500	3,368	6,378	+89.4%
1,000	3,087	6,082	+97.0%
2,000	2,790	5,345	+91.6%
5,000	2,282	4,302	+88.5%
10,000	1,915	3,685	+92.5%
20,000	1,588	2,872	+80.9%
50,000	1,097	1,874	+70.8%

K	master	combined	Delta
10	4,096	4,654	+13.6%
100	6,240	7,058	+13.1%
250	5,779	6,516	+12.7%
500	5,701	6,378	+11.9%
1,000	5,235	6,155	+17.6%
2,000	4,435	5,460	+23.1%
5,000	3,191	4,173	+30.8%
10,000	2,611	3,130	+19.9%
20,000	2,209	2,705	+22.4%
50,000	1,738	2,194	+26.3%

Metric	master	combined	Delta
Symbol mulassign_scalar	48.32 ns	42.89 ns	-11.2%
Symbol +=	33.19 ns	29.80 ns	-10.2%
Symbol FMA	50.78 ns	49.06 ns	-3.4%
encode 10KB	50.12 µs	21.66 µs	-56.8%
roundtrip 10KB	52.03 µs	23.20 µs	-55.4%
roundtrip repair 10KB	107.42 µs	69.62 µs	-35.2%

K	master 0%	combined 0%	Delta	master 5%	combined 5%	Delta
10	1,724	1,928	+11.9%	1,724	1,921	+11.4%
100	2,466	2,781	+12.8%	2,454	2,759	+12.4%
250	2,551	2,787	+9.3%	2,551	3,354	+31.5%
500	2,597	2,811	+8.3%	2,558	4,098	+60.2%
1,000	2,539	2,745	+8.1%	2,483	3,891	+56.7%
2,000	2,208	2,436	+10.3%	2,198	3,443	+56.6%
5,000	1,938	2,123	+9.6%	1,885	2,864	+51.9%
10,000	1,741	1,900	+9.1%	1,604	2,365	+47.5%
20,000	1,455	1,593	+9.5%	1,318	1,725	+30.9%
50,000	1,051	1,188	+13.0%	864	1,197	+38.6%

K	master (no plan)	combined (plan cache)	Delta
10	2,064	3,150	+52.6%
100	3,028	5,169	+70.7%
250	3,000	4,918	+63.9%
500	2,949	4,906	+66.3%
1,000	2,910	4,724	+62.3%
2,000	2,673	4,197	+57.0%
5,000	2,336	3,426	+46.7%
10,000	2,043	3,255	+59.3%
20,000	1,769	2,597	+46.8%
50,000	1,270	1,678	+32.1%

K	master	combined	Delta
10	2,892	3,141	+8.6%
100	4,760	5,169	+8.6%
250	4,506	4,894	+8.6%
500	4,496	4,883	+8.6%
1,000	4,435	4,813	+8.5%
2,000	3,998	4,267	+6.7%
5,000	3,513	3,617	+3.0%
10,000	3,130	3,277	+4.7%
20,000	2,646	2,942	+11.1%
50,000	1,993	2,230	+11.9%

Symbol Size	master 0%	combined 0%	Delta	master 5%	combined 5%	Delta
64	234	274	+17.1%	230	269	+17.0%
128	458	530	+15.7%	451	523	+16.0%
256	882	1,004	+13.8%	869	991	+14.0%
512	1,643	1,815	+10.5%	1,605	1,781	+11.0%
1024	2,490	3,028	+21.6%	2,722	2,984	+9.6%
1280	3,259	3,554	+9.0%	3,188	3,529	+10.7%
2048	4,431	4,782	+7.9%	4,127	4,716	+14.3%
4096	6,269	6,551	+4.5%	6,156	6,509	+5.7%
8192	7,075	7,174	+1.4%	7,837	8,085	+3.2%

perf improvements #207

Description

Full combined benchmarks:

Zen 4 — AMD EPYC 9654P 96-Core (bare metal, 256-bit AVX-512 µops)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Zen 5 — AMD EPYC 9B45 (GCP c4d-standard-4, native 512-bit AVX-512)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Intel Emerald Rapids — Xeon Platinum 8581C (GCP c4-standard-4)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Axion V2 — Google Axion / Neoverse V2 (GCP c4a-standard-4, aarch64)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Axion N3 — Google Axion / Neoverse N3 (GCP n4a-standard-4, aarch64)

codec_benchmark (criterion, symbol_size=512)

decode_benchmark (Mbit/s, symbol_size=1280)

encode_benchmark (Mbit/s, symbol_size=1280)

Zen 4 Symbol-Size Sweep (decode Mbit/s)

K=100

K=1,000

K=10,000

Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Symbol Size	master 0%	combined 0%	Delta	master 5%	combined 5%	Delta
64	285	316	+10.9%	280	383	+36.8%
128	549	608	+10.7%	540	730	+35.2%
256	1,054	1,145	+8.6%	1,027	1,376	+34.0%
512	1,895	1,995	+5.3%	1,837	2,454	+33.6%
1024	3,111	3,001	-3.5%	3,037	3,906	+28.6%
1280	3,576	3,748	+4.8%	3,502	4,554	+30.1%
2048	4,638	4,702	+1.4%	4,267	5,837	+36.8%
4096	6,061	6,211	+2.5%	6,211	8,197	+32.0%
8192	6,369	6,329	-0.6%	7,407	9,524	+28.6%

Symbol Size	master 0%	combined 0%	Delta	master 5%	combined 5%	Delta
64	245	270	+10.2%	208	281	+35.1%
128	464	518	+11.6%	404	551	+36.4%
256	871	940	+7.9%	752	1,002	+33.2%
512	1,530	1,580	+3.3%	1,347	1,763	+30.9%
1024	2,367	2,436	+2.9%	2,125	2,701	+27.1%
1280	2,342	2,782	+18.8%	2,399	3,005	+25.3%
2048	2,948	3,189	+8.2%	2,921	3,422	+17.1%
4096	3,244	3,485	+7.4%	3,472	3,662	+5.5%
8192	3,205	3,492	+9.0%	4,058	4,195	+3.4%