Had a bit of fun with GPT-5.2 Pro, Codex 5.3 xhigh, Gemini, and Claude (and my own brain) over the past couple days to eek out performance (for a weekend project idea that didn't end up panning out). No hard feelings if you reject the drive-by PRs!
Allocations, as mentioned in #160 , did not seem to make a huge difference as compared to the SIMD side of things, especially after the other optimizations. May do a PR to address anyway since it can be useful.
Recommended merge order if you want any/all of these PRs:
perf: optimize build profiles, upgrade toolchain to 1.93, fix clippy warnings #197
fix: correct assertion in enc() — a1 < p1, not a < w #199
perf: eliminate Vec allocation in enc_indices via callback pattern #198
perf: add #[inline] annotations to hot Octet/Symbol/octets operations #204
perf: replace get_both_indices with unsafe pointer ops in hot loops #201
perf: add global plan cache for SourceBlockEncoder::new() #200 after this, update avx512-binary: I can do a PR to change symbol.rs is_empty from #[cfg(feature = "benchmarking")] to [allow(dead_code)]
perf: add AVX-512 dispatch for all SIMD operations #202
bench: add criterion roundtrip benchmark at various symbol counts #203 after this, I can rebase perf/skip-hdpc-decode onto perf/enc-indices-callback
perf: skip HDPC rows during decode when overhead is sufficient #206 after this, I can rebase the slab allocation
perf: use contiguous SymbolSlab storage in encoder/decoder hot paths #208 - may need some rebasing on 209 to merge...
perf: reuse scratch buffers for matrix row and column queries #209
perf: use native SymbolSlab in no-HDPC decode path #211
perf: stabilize SymbolSlab inlining + reduce encode allocations #213
Update README #212
Full combined benchmarks:
Baseline: fork/master (e777861). Combined: perf/all-combined (95b3e3e).
Zen 4 — AMD EPYC 9654P 96-Core (bare metal, 256-bit AVX-512 µops)
codec_benchmark (criterion, symbol_size=512)
Metric
master
combined
Delta
Symbol +=
22.46 ns
17.26 ns
-23.3%
Symbol FMA
23.18 ns
23.9 ns
+3.1% (noise?)
encode 10KB
37.00 µs
10.2 µs
-72.4%
roundtrip 10KB
38.56 µs
11.31 µs
-70.7%
roundtrip repair 10KB
80.59 µs
43.49 µs
-46.0%
decode_benchmark (Mbit/s, symbol_size=1280)
K
master 0%
combined 0%
Delta
master 5%
combined 5%
Delta
10
2,573
2,745
+6.7%
2,528
2,753
+8.9%
100
3,218
3,529
+9.7%
3,139
3,493
+11.3%
250
3,479
3,667
+5.4%
3,343
3,846
+15.0%
500
3,658
3,738
+2.2%
3,593
4,703
+30.9%
1,000
3,576
3,734
+4.4%
3,455
4,617
+33.6%
2,000
3,287
3,478
+5.8%
3,330
4,322
+29.8%
5,000
2,839
3,150
+11.0%
2,889
3,800
+31.5%
10,000
2,570
2,736
+6.4%
2,336
3,100
+32.7%
20,000
1,957
2,082
+6.4%
1,860
2,282
+22.7%
50,000
1,377
1,455
+5.7%
1,175
1,451
+23.5%
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200 ).
K
master
combined
Delta
10
4,785
7,529
+57.3%
100
5,985
9,936
+66.0%
1,000
5,345
9,150
+71.2%
10,000
3,617
5,549
+53.4%
50,000
2,146
2,096
-2.3% (noise?)
Zen 5 — AMD EPYC 9B45 (GCP c4d-standard-4, native 512-bit AVX-512)
codec_benchmark (criterion, symbol_size=512)
Metric
master
combined
Delta
Symbol mulassign_scalar
16.93 ns
14.35 ns
-15.3%
Symbol +=
21.22 ns
16.83 ns
-20.7%
Symbol FMA
20.40 ns
15.79 ns
-22.6%
encode 10KB
30.82 µs
7.60 µs
-75.3%
roundtrip 10KB
31.62 µs
7.16 µs
-77.4%
roundtrip repair 10KB
65.76 µs
31.69 µs
-51.8%
decode_benchmark (Mbit/s, symbol_size=1280)
K
master 0%
combined 0%
Delta
master 5%
combined 5%
Delta
10
3,180
4,571
+43.7%
3,170
4,112
+29.7%
100
4,077
6,020
+47.6%
4,045
5,654
+39.8%
250
4,506
5,500
+22.0%
4,448
5,412
+21.7%
500
4,703
6,003
+27.6%
4,703
6,542
+39.1%
1,000
4,659
5,738
+23.2%
4,514
6,428
+42.4%
2,000
4,435
5,460
+23.1%
4,435
5,974
+34.7%
5,000
4,086
4,932
+20.7%
3,938
5,279
+34.1%
10,000
3,617
4,500
+24.4%
3,255
4,321
+32.7%
20,000
2,968
3,658
+23.2%
2,646
3,288
+24.2%
50,000
1,981
2,584
+30.4%
1,669
2,190
+31.2%
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200 ).
K
master (no plan)
combined (plan cache)
Delta
10
4,633
18,617
+301.8%
100
5,848
17,646
+201.7%
250
5,982
15,738
+163.1%
500
5,516
15,232
+176.1%
1,000
5,317
14,719
+176.8%
2,000
5,129
13,542
+164.0%
5,000
4,741
15,751
+232.3%
10,000
3,938
9,669
+145.5%
20,000
3,451
7,234
+109.6%
50,000
2,460
3,906
+58.8%
With explicit pre-built plan on both sides:
K
master
combined
Delta
10
8,904
13,127
+47.4%
100
13,291
17,646
+32.8%
250
12,325
15,984
+29.7%
500
12,295
21,713
+76.6%
1,000
12,236
14,719
+20.3%
2,000
11,674
14,106
+20.8%
5,000
10,389
12,683
+22.1%
10,000
8,719
10,973
+25.8%
20,000
6,829
8,878
+30.0%
50,000
4,191
6,028
+43.8%
Intel Emerald Rapids — Xeon Platinum 8581C (GCP c4-standard-4)
codec_benchmark (criterion, symbol_size=512)
Metric
master
combined
Delta
Symbol mulassign_scalar
26.68 ns
20.17 ns
-24.4%
Symbol +=
23.61 ns
18.26 ns
-22.7%
Symbol FMA
30.47 ns
22.99 ns
-24.5%
encode 10KB
44.87 µs
12.38 µs
-72.4%
roundtrip 10KB
45.74 µs
13.93 µs
-69.5%
roundtrip repair 10KB
94.72 µs
48.57 µs
-48.7%
decode_benchmark (Mbit/s, symbol_size=1280)
K
master 0%
combined 0%
Delta
master 5%
combined 5%
Delta
10
2,081
2,884
+38.6%
2,081
2,553
+22.7%
100
2,804
3,848
+37.2%
2,804
3,554
+26.7%
250
2,991
3,527
+17.9%
2,931
3,540
+20.8%
500
3,209
3,836
+19.6%
3,179
4,200
+32.1%
1,000
3,106
3,666
+18.0%
2,996
4,095
+36.7%
2,000
2,877
3,408
+18.5%
2,894
3,762
+30.0%
5,000
2,543
3,014
+18.5%
2,485
3,171
+27.6%
10,000
2,282
2,713
+18.9%
2,109
2,743
+30.1%
20,000
1,949
2,320
+19.0%
1,792
2,175
+21.4%
50,000
1,464
1,710
+16.8%
1,179
1,473
+24.9%
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200 ).
K
master (no plan)
combined (plan cache)
Delta
10
2,917
10,038
+244.1%
100
3,776
9,655
+155.7%
250
3,540
8,973
+153.5%
500
3,698
8,576
+131.9%
1,000
3,589
7,997
+122.8%
2,000
3,397
7,254
+113.6%
5,000
3,120
7,288
+133.6%
10,000
2,751
5,336
+94.0%
20,000
2,435
4,283
+75.9%
50,000
1,695
2,550
+50.4%
With explicit pre-built plan on both sides:
K
master
combined
Delta
10
5,595
7,366
+31.7%
100
8,458
9,841
+16.3%
250
7,869
8,973
+14.0%
500
7,731
12,006
+55.3%
1,000
7,360
8,190
+11.3%
2,000
6,862
7,468
+8.8%
5,000
5,919
6,735
+13.8%
10,000
5,140
5,848
+13.8%
20,000
4,672
5,279
+13.0%
50,000
3,090
4,138
+33.9%
Axion V2 — Google Axion / Neoverse V2 (GCP c4a-standard-4, aarch64)
codec_benchmark (criterion, symbol_size=512)
Metric
master
combined
Delta
Symbol mulassign_scalar
36.42 ns
31.05 ns
-14.8%
Symbol +=
31.53 ns
28.84 ns
-8.5%
Symbol FMA
37.77 ns
36.42 ns
-3.6%
encode 10KB
45.00 µs
17.34 µs
-61.5%
roundtrip 10KB
46.23 µs
18.64 µs
-59.7%
roundtrip repair 10KB
95.43 µs
57.81 µs
-39.4%
decode_benchmark (Mbit/s, symbol_size=1280)
K
master 0%
combined 0%
Delta
master 5%
combined 5%
Delta
10
2,156
2,438
+13.1%
2,124
2,426
+14.2%
100
2,827
3,312
+17.2%
2,796
3,270
+16.9%
250
2,931
3,279
+11.9%
2,956
3,577
+21.0%
500
2,958
3,303
+11.6%
2,949
4,132
+40.1%
1,000
2,806
3,214
+14.6%
2,821
3,666
+30.0%
2,000
2,329
2,790
+19.8%
2,430
3,068
+26.3%
5,000
1,832
2,353
+28.4%
1,949
2,298
+17.9%
10,000
1,432
1,989
+38.9%
1,585
1,710
+7.9%
20,000
1,126
1,628
+44.5%
1,206
1,313
+8.9%
50,000
821
1,188
+44.6%
686
955
+39.1%
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200 ).
K
master (no plan)
combined (plan cache)
Delta
10
2,716
4,697
+72.9%
100
3,566
7,010
+96.6%
250
3,527
6,516
+84.7%
500
3,368
6,378
+89.4%
1,000
3,087
6,082
+97.0%
2,000
2,790
5,345
+91.6%
5,000
2,282
4,302
+88.5%
10,000
1,915
3,685
+92.5%
20,000
1,588
2,872
+80.9%
50,000
1,097
1,874
+70.8%
With explicit pre-built plan on both sides:
K
master
combined
Delta
10
4,096
4,654
+13.6%
100
6,240
7,058
+13.1%
250
5,779
6,516
+12.7%
500
5,701
6,378
+11.9%
1,000
5,235
6,155
+17.6%
2,000
4,435
5,460
+23.1%
5,000
3,191
4,173
+30.8%
10,000
2,611
3,130
+19.9%
20,000
2,209
2,705
+22.4%
50,000
1,738
2,194
+26.3%
Axion N3 — Google Axion / Neoverse N3 (GCP n4a-standard-4, aarch64)
codec_benchmark (criterion, symbol_size=512)
Metric
master
combined
Delta
Symbol mulassign_scalar
48.32 ns
42.89 ns
-11.2%
Symbol +=
33.19 ns
29.80 ns
-10.2%
Symbol FMA
50.78 ns
49.06 ns
-3.4%
encode 10KB
50.12 µs
21.66 µs
-56.8%
roundtrip 10KB
52.03 µs
23.20 µs
-55.4%
roundtrip repair 10KB
107.42 µs
69.62 µs
-35.2%
decode_benchmark (Mbit/s, symbol_size=1280)
K
master 0%
combined 0%
Delta
master 5%
combined 5%
Delta
10
1,724
1,928
+11.9%
1,724
1,921
+11.4%
100
2,466
2,781
+12.8%
2,454
2,759
+12.4%
250
2,551
2,787
+9.3%
2,551
3,354
+31.5%
500
2,597
2,811
+8.3%
2,558
4,098
+60.2%
1,000
2,539
2,745
+8.1%
2,483
3,891
+56.7%
2,000
2,208
2,436
+10.3%
2,198
3,443
+56.6%
5,000
1,938
2,123
+9.6%
1,885
2,864
+51.9%
10,000
1,741
1,900
+9.1%
1,604
2,365
+47.5%
20,000
1,455
1,593
+9.5%
1,318
1,725
+30.9%
50,000
1,051
1,188
+13.0%
864
1,197
+38.6%
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses SourceBlockEncoder::new() (no plan). Combined uses global plan cache (PR #200 ).
K
master (no plan)
combined (plan cache)
Delta
10
2,064
3,150
+52.6%
100
3,028
5,169
+70.7%
250
3,000
4,918
+63.9%
500
2,949
4,906
+66.3%
1,000
2,910
4,724
+62.3%
2,000
2,673
4,197
+57.0%
5,000
2,336
3,426
+46.7%
10,000
2,043
3,255
+59.3%
20,000
1,769
2,597
+46.8%
50,000
1,270
1,678
+32.1%
With explicit pre-built plan on both sides:
K
master
combined
Delta
10
2,892
3,141
+8.6%
100
4,760
5,169
+8.6%
250
4,506
4,894
+8.6%
500
4,496
4,883
+8.6%
1,000
4,435
4,813
+8.5%
2,000
3,998
4,267
+6.7%
5,000
3,513
3,617
+3.0%
10,000
3,130
3,277
+4.7%
20,000
2,646
2,942
+11.1%
50,000
1,993
2,230
+11.9%
Zen 4 Symbol-Size Sweep (decode Mbit/s)
Verifies gains hold across the full range of symbol sizes (64-8192 bytes). No regressions observed.
K=100
Symbol Size
master 0%
combined 0%
Delta
master 5%
combined 5%
Delta
64
234
274
+17.1%
230
269
+17.0%
128
458
530
+15.7%
451
523
+16.0%
256
882
1,004
+13.8%
869
991
+14.0%
512
1,643
1,815
+10.5%
1,605
1,781
+11.0%
1024
2,490
3,028
+21.6%
2,722
2,984
+9.6%
1280
3,259
3,554
+9.0%
3,188
3,529
+10.7%
2048
4,431
4,782
+7.9%
4,127
4,716
+14.3%
4096
6,269
6,551
+4.5%
6,156
6,509
+5.7%
8192
7,075
7,174
+1.4%
7,837
8,085
+3.2%
K=1,000
Symbol Size
master 0%
combined 0%
Delta
master 5%
combined 5%
Delta
64
285
316
+10.9%
280
383
+36.8%
128
549
608
+10.7%
540
730
+35.2%
256
1,054
1,145
+8.6%
1,027
1,376
+34.0%
512
1,895
1,995
+5.3%
1,837
2,454
+33.6%
1024
3,111
3,001
-3.5%
3,037
3,906
+28.6%
1280
3,576
3,748
+4.8%
3,502
4,554
+30.1%
2048
4,638
4,702
+1.4%
4,267
5,837
+36.8%
4096
6,061
6,211
+2.5%
6,211
8,197
+32.0%
8192
6,369
6,329
-0.6%
7,407
9,524
+28.6%
K=10,000
Symbol Size
master 0%
combined 0%
Delta
master 5%
combined 5%
Delta
64
245
270
+10.2%
208
281
+35.1%
128
464
518
+11.6%
404
551
+36.4%
256
871
940
+7.9%
752
1,002
+33.2%
512
1,530
1,580
+3.3%
1,347
1,763
+30.9%
1024
2,367
2,436
+2.9%
2,125
2,701
+27.1%
1280
2,342
2,782
+18.8%
2,399
3,005
+25.3%
2048
2,948
3,189
+8.2%
2,921
3,422
+17.1%
4096
3,244
3,485
+7.4%
3,472
3,662
+5.5%
8192
3,205
3,492
+9.0%
4,058
4,195
+3.4%
Summary
Workload
Zen 4
Zen 5
Intel EMR
Axion V2
Axion N3
Codec encode 10KB
-72.4% latency
-75.3% latency
-72.4% latency
-61.5% latency
-56.8% latency
Codec roundtrip 10KB
-70.7% latency
-77.4% latency
-69.5% latency
-59.7% latency
-55.4% latency
Codec repair 10KB
-46.0% latency
-51.8% latency
-48.7% latency
-39.4% latency
-35.2% latency
Decode 0% overhead
+2-11% tp
+21-48% tp
+17-39% tp
+12-45% tp
+8-13% tp
Decode 5% overhead
+9-34% tp
+22-42% tp
+21-37% tp
+8-40% tp
+11-60% tp
Encode (no plan vs plan cache)
+53-71% tp
+59-302% tp
+50-244% tp
+71-97% tp
+32-71% tp
Encode (pre-built plan)
N/A
+20-77% tp
+9-55% tp
+12-31% tp
+3-12% tp
Had a bit of fun with GPT-5.2 Pro, Codex 5.3 xhigh, Gemini, and Claude (and my own brain) over the past couple days to eek out performance (for a weekend project idea that didn't end up panning out). No hard feelings if you reject the drive-by PRs!
Allocations, as mentioned in #160, did not seem to make a huge difference as compared to the SIMD side of things, especially after the other optimizations. May do a PR to address anyway since it can be useful.
Recommended merge order if you want any/all of these PRs:
Full combined benchmarks:
Baseline:
fork/master(e777861). Combined:perf/all-combined(95b3e3e).Zen 4 — AMD EPYC 9654P 96-Core (bare metal, 256-bit AVX-512 µops)
codec_benchmark (criterion, symbol_size=512)
decode_benchmark (Mbit/s, symbol_size=1280)
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses
SourceBlockEncoder::new()(no plan). Combined uses global plan cache (PR #200).Zen 5 — AMD EPYC 9B45 (GCP c4d-standard-4, native 512-bit AVX-512)
codec_benchmark (criterion, symbol_size=512)
decode_benchmark (Mbit/s, symbol_size=1280)
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses
SourceBlockEncoder::new()(no plan). Combined uses global plan cache (PR #200).With explicit pre-built plan on both sides:
Intel Emerald Rapids — Xeon Platinum 8581C (GCP c4-standard-4)
codec_benchmark (criterion, symbol_size=512)
decode_benchmark (Mbit/s, symbol_size=1280)
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses
SourceBlockEncoder::new()(no plan). Combined uses global plan cache (PR #200).With explicit pre-built plan on both sides:
Axion V2 — Google Axion / Neoverse V2 (GCP c4a-standard-4, aarch64)
codec_benchmark (criterion, symbol_size=512)
decode_benchmark (Mbit/s, symbol_size=1280)
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses
SourceBlockEncoder::new()(no plan). Combined uses global plan cache (PR #200).With explicit pre-built plan on both sides:
Axion N3 — Google Axion / Neoverse N3 (GCP n4a-standard-4, aarch64)
codec_benchmark (criterion, symbol_size=512)
decode_benchmark (Mbit/s, symbol_size=1280)
encode_benchmark (Mbit/s, symbol_size=1280)
Master uses
SourceBlockEncoder::new()(no plan). Combined uses global plan cache (PR #200).With explicit pre-built plan on both sides:
Zen 4 Symbol-Size Sweep (decode Mbit/s)
Verifies gains hold across the full range of symbol sizes (64-8192 bytes). No regressions observed.
K=100
K=1,000
K=10,000
Summary