Skip to content

perf improvements #207

@virtuallynathan

Description

@virtuallynathan

Had a bit of fun with GPT-5.2 Pro, Codex 5.3 xhigh, Gemini, and Claude (and my own brain) over the past couple days to eek out performance (for a weekend project idea that didn't end up panning out). No hard feelings if you reject the drive-by PRs!

Allocations, as mentioned in #160, did not seem to make a huge difference as compared to the SIMD side of things, especially after the other optimizations. May do a PR to address anyway since it can be useful.

Recommended merge order if you want any/all of these PRs:

  1. perf: optimize build profiles, upgrade toolchain to 1.93, fix clippy warnings #197
  2. fix: correct assertion in enc() — a1 < p1, not a < w #199
  3. perf: eliminate Vec allocation in enc_indices via callback pattern #198
  4. perf: add #[inline] annotations to hot Octet/Symbol/octets operations #204
  5. perf: replace get_both_indices with unsafe pointer ops in hot loops #201
  6. perf: add global plan cache for SourceBlockEncoder::new() #200 after this, update avx512-binary: I can do a PR to change symbol.rs is_empty from #[cfg(feature = "benchmarking")] to [allow(dead_code)]
  7. perf: add AVX-512 dispatch for all SIMD operations #202
  8. bench: add criterion roundtrip benchmark at various symbol counts #203 after this, I can rebase perf/skip-hdpc-decode onto perf/enc-indices-callback
  9. perf: skip HDPC rows during decode when overhead is sufficient #206 after this, I can rebase the slab allocation
  10. perf: use contiguous SymbolSlab storage in encoder/decoder hot paths #208 - may need some rebasing on 209 to merge...
  11. perf: reuse scratch buffers for matrix row and column queries #209
  12. perf: use native SymbolSlab in no-HDPC decode path #211
  13. perf: stabilize SymbolSlab inlining + reduce encode allocations #213
  14. Update README #212

Full combined benchmarks:

Baseline: fork/master (e777861). Combined: perf/all-combined (95b3e3e).

Zen 4 — AMD EPYC 9654P 96-Core (bare metal, 256-bit AVX-512 µops)

codec_benchmark (criterion, symbol_size=512)

Metric master combined Delta
Symbol += 22.46 ns 17.26 ns -23.3%
Symbol FMA 23.18 ns 23.9 ns +3.1% (noise?)
encode 10KB 37.00 µs 10.2 µs -72.4%
roundtrip 10KB 38.56 µs 11.31 µs -70.7%
roundtrip repair 10KB 80.59 µs 43.49 µs -46.0%

decode_benchmark (Mbit/s, symbol_size=1280)

K master 0% combined 0% Delta master 5% combined 5% Delta
10 2,573 2,745 +6.7% 2,528 2,753 +8.9%
100 3,218 3,529 +9.7% 3,139 3,493 +11.3%
250 3,479 3,667 +5.4% 3,343 3,846 +15.0%
500 3,658 3,738 +2.2% 3,593 4,703 +30.9%
1,000 3,576 3,734 +4.4% 3,455 4,617 +33.6%
2,000 3,287 3,478 +5.8% 3,330 4,322 +29.8%
5,000 2,839 3,150 +11.0% 2,889 3,800 +31.5%
10,000 2,570 2,736 +6.4% 2,336 3,100 +32.7%
20,000 1,957 2,082 +6.4% 1,860 2,282 +22.7%
50,000 1,377 1,455 +5.7% 1,175 1,451 +23.5%

encode_benchmark (Mbit/s, symbol_size=1280)

Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200).

K master combined Delta
10 4,785 7,529 +57.3%
100 5,985 9,936 +66.0%
1,000 5,345 9,150 +71.2%
10,000 3,617 5,549 +53.4%
50,000 2,146 2,096 -2.3% (noise?)

Zen 5 — AMD EPYC 9B45 (GCP c4d-standard-4, native 512-bit AVX-512)

codec_benchmark (criterion, symbol_size=512)

Metric master combined Delta
Symbol mulassign_scalar 16.93 ns 14.35 ns -15.3%
Symbol += 21.22 ns 16.83 ns -20.7%
Symbol FMA 20.40 ns 15.79 ns -22.6%
encode 10KB 30.82 µs 7.60 µs -75.3%
roundtrip 10KB 31.62 µs 7.16 µs -77.4%
roundtrip repair 10KB 65.76 µs 31.69 µs -51.8%

decode_benchmark (Mbit/s, symbol_size=1280)

K master 0% combined 0% Delta master 5% combined 5% Delta
10 3,180 4,571 +43.7% 3,170 4,112 +29.7%
100 4,077 6,020 +47.6% 4,045 5,654 +39.8%
250 4,506 5,500 +22.0% 4,448 5,412 +21.7%
500 4,703 6,003 +27.6% 4,703 6,542 +39.1%
1,000 4,659 5,738 +23.2% 4,514 6,428 +42.4%
2,000 4,435 5,460 +23.1% 4,435 5,974 +34.7%
5,000 4,086 4,932 +20.7% 3,938 5,279 +34.1%
10,000 3,617 4,500 +24.4% 3,255 4,321 +32.7%
20,000 2,968 3,658 +23.2% 2,646 3,288 +24.2%
50,000 1,981 2,584 +30.4% 1,669 2,190 +31.2%

encode_benchmark (Mbit/s, symbol_size=1280)

Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200).

K master (no plan) combined (plan cache) Delta
10 4,633 18,617 +301.8%
100 5,848 17,646 +201.7%
250 5,982 15,738 +163.1%
500 5,516 15,232 +176.1%
1,000 5,317 14,719 +176.8%
2,000 5,129 13,542 +164.0%
5,000 4,741 15,751 +232.3%
10,000 3,938 9,669 +145.5%
20,000 3,451 7,234 +109.6%
50,000 2,460 3,906 +58.8%

With explicit pre-built plan on both sides:

K master combined Delta
10 8,904 13,127 +47.4%
100 13,291 17,646 +32.8%
250 12,325 15,984 +29.7%
500 12,295 21,713 +76.6%
1,000 12,236 14,719 +20.3%
2,000 11,674 14,106 +20.8%
5,000 10,389 12,683 +22.1%
10,000 8,719 10,973 +25.8%
20,000 6,829 8,878 +30.0%
50,000 4,191 6,028 +43.8%

Intel Emerald Rapids — Xeon Platinum 8581C (GCP c4-standard-4)

codec_benchmark (criterion, symbol_size=512)

Metric master combined Delta
Symbol mulassign_scalar 26.68 ns 20.17 ns -24.4%
Symbol += 23.61 ns 18.26 ns -22.7%
Symbol FMA 30.47 ns 22.99 ns -24.5%
encode 10KB 44.87 µs 12.38 µs -72.4%
roundtrip 10KB 45.74 µs 13.93 µs -69.5%
roundtrip repair 10KB 94.72 µs 48.57 µs -48.7%

decode_benchmark (Mbit/s, symbol_size=1280)

K master 0% combined 0% Delta master 5% combined 5% Delta
10 2,081 2,884 +38.6% 2,081 2,553 +22.7%
100 2,804 3,848 +37.2% 2,804 3,554 +26.7%
250 2,991 3,527 +17.9% 2,931 3,540 +20.8%
500 3,209 3,836 +19.6% 3,179 4,200 +32.1%
1,000 3,106 3,666 +18.0% 2,996 4,095 +36.7%
2,000 2,877 3,408 +18.5% 2,894 3,762 +30.0%
5,000 2,543 3,014 +18.5% 2,485 3,171 +27.6%
10,000 2,282 2,713 +18.9% 2,109 2,743 +30.1%
20,000 1,949 2,320 +19.0% 1,792 2,175 +21.4%
50,000 1,464 1,710 +16.8% 1,179 1,473 +24.9%

encode_benchmark (Mbit/s, symbol_size=1280)

Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200).

K master (no plan) combined (plan cache) Delta
10 2,917 10,038 +244.1%
100 3,776 9,655 +155.7%
250 3,540 8,973 +153.5%
500 3,698 8,576 +131.9%
1,000 3,589 7,997 +122.8%
2,000 3,397 7,254 +113.6%
5,000 3,120 7,288 +133.6%
10,000 2,751 5,336 +94.0%
20,000 2,435 4,283 +75.9%
50,000 1,695 2,550 +50.4%

With explicit pre-built plan on both sides:

K master combined Delta
10 5,595 7,366 +31.7%
100 8,458 9,841 +16.3%
250 7,869 8,973 +14.0%
500 7,731 12,006 +55.3%
1,000 7,360 8,190 +11.3%
2,000 6,862 7,468 +8.8%
5,000 5,919 6,735 +13.8%
10,000 5,140 5,848 +13.8%
20,000 4,672 5,279 +13.0%
50,000 3,090 4,138 +33.9%

Axion V2 — Google Axion / Neoverse V2 (GCP c4a-standard-4, aarch64)

codec_benchmark (criterion, symbol_size=512)

Metric master combined Delta
Symbol mulassign_scalar 36.42 ns 31.05 ns -14.8%
Symbol += 31.53 ns 28.84 ns -8.5%
Symbol FMA 37.77 ns 36.42 ns -3.6%
encode 10KB 45.00 µs 17.34 µs -61.5%
roundtrip 10KB 46.23 µs 18.64 µs -59.7%
roundtrip repair 10KB 95.43 µs 57.81 µs -39.4%

decode_benchmark (Mbit/s, symbol_size=1280)

K master 0% combined 0% Delta master 5% combined 5% Delta
10 2,156 2,438 +13.1% 2,124 2,426 +14.2%
100 2,827 3,312 +17.2% 2,796 3,270 +16.9%
250 2,931 3,279 +11.9% 2,956 3,577 +21.0%
500 2,958 3,303 +11.6% 2,949 4,132 +40.1%
1,000 2,806 3,214 +14.6% 2,821 3,666 +30.0%
2,000 2,329 2,790 +19.8% 2,430 3,068 +26.3%
5,000 1,832 2,353 +28.4% 1,949 2,298 +17.9%
10,000 1,432 1,989 +38.9% 1,585 1,710 +7.9%
20,000 1,126 1,628 +44.5% 1,206 1,313 +8.9%
50,000 821 1,188 +44.6% 686 955 +39.1%

encode_benchmark (Mbit/s, symbol_size=1280)

Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200).

K master (no plan) combined (plan cache) Delta
10 2,716 4,697 +72.9%
100 3,566 7,010 +96.6%
250 3,527 6,516 +84.7%
500 3,368 6,378 +89.4%
1,000 3,087 6,082 +97.0%
2,000 2,790 5,345 +91.6%
5,000 2,282 4,302 +88.5%
10,000 1,915 3,685 +92.5%
20,000 1,588 2,872 +80.9%
50,000 1,097 1,874 +70.8%

With explicit pre-built plan on both sides:

K master combined Delta
10 4,096 4,654 +13.6%
100 6,240 7,058 +13.1%
250 5,779 6,516 +12.7%
500 5,701 6,378 +11.9%
1,000 5,235 6,155 +17.6%
2,000 4,435 5,460 +23.1%
5,000 3,191 4,173 +30.8%
10,000 2,611 3,130 +19.9%
20,000 2,209 2,705 +22.4%
50,000 1,738 2,194 +26.3%

Axion N3 — Google Axion / Neoverse N3 (GCP n4a-standard-4, aarch64)

codec_benchmark (criterion, symbol_size=512)

Metric master combined Delta
Symbol mulassign_scalar 48.32 ns 42.89 ns -11.2%
Symbol += 33.19 ns 29.80 ns -10.2%
Symbol FMA 50.78 ns 49.06 ns -3.4%
encode 10KB 50.12 µs 21.66 µs -56.8%
roundtrip 10KB 52.03 µs 23.20 µs -55.4%
roundtrip repair 10KB 107.42 µs 69.62 µs -35.2%

decode_benchmark (Mbit/s, symbol_size=1280)

K master 0% combined 0% Delta master 5% combined 5% Delta
10 1,724 1,928 +11.9% 1,724 1,921 +11.4%
100 2,466 2,781 +12.8% 2,454 2,759 +12.4%
250 2,551 2,787 +9.3% 2,551 3,354 +31.5%
500 2,597 2,811 +8.3% 2,558 4,098 +60.2%
1,000 2,539 2,745 +8.1% 2,483 3,891 +56.7%
2,000 2,208 2,436 +10.3% 2,198 3,443 +56.6%
5,000 1,938 2,123 +9.6% 1,885 2,864 +51.9%
10,000 1,741 1,900 +9.1% 1,604 2,365 +47.5%
20,000 1,455 1,593 +9.5% 1,318 1,725 +30.9%
50,000 1,051 1,188 +13.0% 864 1,197 +38.6%

encode_benchmark (Mbit/s, symbol_size=1280)

Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200).

K master (no plan) combined (plan cache) Delta
10 2,064 3,150 +52.6%
100 3,028 5,169 +70.7%
250 3,000 4,918 +63.9%
500 2,949 4,906 +66.3%
1,000 2,910 4,724 +62.3%
2,000 2,673 4,197 +57.0%
5,000 2,336 3,426 +46.7%
10,000 2,043 3,255 +59.3%
20,000 1,769 2,597 +46.8%
50,000 1,270 1,678 +32.1%

With explicit pre-built plan on both sides:

K master combined Delta
10 2,892 3,141 +8.6%
100 4,760 5,169 +8.6%
250 4,506 4,894 +8.6%
500 4,496 4,883 +8.6%
1,000 4,435 4,813 +8.5%
2,000 3,998 4,267 +6.7%
5,000 3,513 3,617 +3.0%
10,000 3,130 3,277 +4.7%
20,000 2,646 2,942 +11.1%
50,000 1,993 2,230 +11.9%

Zen 4 Symbol-Size Sweep (decode Mbit/s)

Verifies gains hold across the full range of symbol sizes (64-8192 bytes). No regressions observed.

K=100

Symbol Size master 0% combined 0% Delta master 5% combined 5% Delta
64 234 274 +17.1% 230 269 +17.0%
128 458 530 +15.7% 451 523 +16.0%
256 882 1,004 +13.8% 869 991 +14.0%
512 1,643 1,815 +10.5% 1,605 1,781 +11.0%
1024 2,490 3,028 +21.6% 2,722 2,984 +9.6%
1280 3,259 3,554 +9.0% 3,188 3,529 +10.7%
2048 4,431 4,782 +7.9% 4,127 4,716 +14.3%
4096 6,269 6,551 +4.5% 6,156 6,509 +5.7%
8192 7,075 7,174 +1.4% 7,837 8,085 +3.2%

K=1,000

Symbol Size master 0% combined 0% Delta master 5% combined 5% Delta
64 285 316 +10.9% 280 383 +36.8%
128 549 608 +10.7% 540 730 +35.2%
256 1,054 1,145 +8.6% 1,027 1,376 +34.0%
512 1,895 1,995 +5.3% 1,837 2,454 +33.6%
1024 3,111 3,001 -3.5% 3,037 3,906 +28.6%
1280 3,576 3,748 +4.8% 3,502 4,554 +30.1%
2048 4,638 4,702 +1.4% 4,267 5,837 +36.8%
4096 6,061 6,211 +2.5% 6,211 8,197 +32.0%
8192 6,369 6,329 -0.6% 7,407 9,524 +28.6%

K=10,000

Symbol Size master 0% combined 0% Delta master 5% combined 5% Delta
64 245 270 +10.2% 208 281 +35.1%
128 464 518 +11.6% 404 551 +36.4%
256 871 940 +7.9% 752 1,002 +33.2%
512 1,530 1,580 +3.3% 1,347 1,763 +30.9%
1024 2,367 2,436 +2.9% 2,125 2,701 +27.1%
1280 2,342 2,782 +18.8% 2,399 3,005 +25.3%
2048 2,948 3,189 +8.2% 2,921 3,422 +17.1%
4096 3,244 3,485 +7.4% 3,472 3,662 +5.5%
8192 3,205 3,492 +9.0% 4,058 4,195 +3.4%

Summary

Workload Zen 4 Zen 5 Intel EMR Axion V2 Axion N3
Codec encode 10KB -72.4% latency -75.3% latency -72.4% latency -61.5% latency -56.8% latency
Codec roundtrip 10KB -70.7% latency -77.4% latency -69.5% latency -59.7% latency -55.4% latency
Codec repair 10KB -46.0% latency -51.8% latency -48.7% latency -39.4% latency -35.2% latency
Decode 0% overhead +2-11% tp +21-48% tp +17-39% tp +12-45% tp +8-13% tp
Decode 5% overhead +9-34% tp +22-42% tp +21-37% tp +8-40% tp +11-60% tp
Encode (no plan vs plan cache) +53-71% tp +59-302% tp +50-244% tp +71-97% tp +32-71% tp
Encode (pre-built plan) N/A +20-77% tp +9-55% tp +12-31% tp +3-12% tp

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions