Skip to content

arm64: major decodeBlock performance rework#243

Merged
pierrec merged 7 commits into
pierrec:v4from
honeycombio:lizf.arm64-shortcut
May 31, 2026
Merged

arm64: major decodeBlock performance rework#243
pierrec merged 7 commits into
pierrec:v4from
honeycombio:lizf.arm64-shortcut

Conversation

@lizthegrey

@lizthegrey lizthegrey commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Six incremental arm64 decoder optimizations. Commits 1–5 are all scalar (no SVE, no NEON vector regs); commit 6 delegates the long-match bulk-copy case to runtime.memmove to recover the last gap vs. the pure-Go decoder.

Commit 1 — Copy shortcut (41139b4)

Ports the "copy shortcut" optimization from decode_amd64.s. When the literal length is 0..14, there are at least 32 bytes of dst and 16 bytes of src remaining, the match offset is at least 8, the match is within the current block, and the match length is not extended, copy 16 literal + 18 match bytes as straight-line code with no bounds checks or extended-length reads. The match copy is sequenced 8+8+2 so offset == 8 works correctly. Adds one register (R20, dstend32 = dstend - 32).

Commit 2 — Splat fast path for offset 1 and 2 (998937e)

Matches with offset 1 (byte RLE, e.g. zero-padding) or offset 2 (halfword RLE) previously went through copyMatchLoop1's byte-at-a-time loop, dominated by store-to-load forwarding latency. When len >= 8 and offset is 1 or 2, splat the 1- or 2-byte pattern across a 64-bit register and store it 8 bytes at a time, falling through to the byte loop for the tail. Adds bench_rle_test.go with synthetic offset=1..4 RLE benchmarks (RLE1 and RLE2 land in this commit's wheelhouse; RLE3/RLE4 are intentional regression-guards against the pure-Go fallback's exponential-doubling loop for the small-offset path this PR does not modify in asm).

Commit 3 — Widen splat store from 8 to 16 bytes (4ce77f3)

Every arm64 core targeted by this library (Cortex-A72, Neoverse N1/V1/V2, Apple M-series) can retire an STP of two X-registers as a single 16-byte store. Widening the splat loop halves the iteration count and per-iteration loop overhead.

Commit 4 — LDP/STP fast path in shortcut match copy when offset >= 18 (f6dea62)

The 8+8+2 sequencing from commit 1 is only required when offset < 18. A histogram of ~2M matches from real-world compressed columnar data shows ~95% of matches have offset >= 18, so the LDP path is the overwhelmingly common case: four memory ops with no serial dependency chain vs. six in the 8+8+2 fallback. For offset 8..17 the sequencing is preserved.

Commit 5 — Widen copyMatchLoop8 to 16 bytes/iter for offset >= 32 (0894170)

copyMatchLoop8 has carried a comment claiming "a 16-at-a-time loop doesn't provide a further speedup". On columnar data that's no longer accurate: long matches (length >= 19, ~9% of matches by count) emit ~75% of total match-copy bytes, and ~81% of those long matches have offset >= 32. When len >= 16 and offset >= 32, copy 16 bytes/iter with LDP/STP instead of 8 with MOVD.

internal/lz4block/match_copy_test.go adds hand-assembled (offset, matchlen) matrix tests.

Note for future maintainers: CCMP's immediate is 5-bit unsigned (0..31); Go's assembler silently truncates larger values (e.g. $32 -> $0). The entry uses CCMP offset, $31 + BLS to stay in range.

Commit 6 — call runtime·memmove for long non-overlapping match copies (91e8b6f)

For match copies where offset >= len and len >= 256, jump out of the inline 16 B/iter LDP/STP loop and into runtime.memmove. Go's arm64 memmove is hand-tuned NEON (128-bit reads/writes, prefetch, software-pipelined 64 B/iter) and outruns any scalar loop that fits inline once the call-setup cost is amortized. Requires allocating a 48-byte stack frame (previously NOFRAME); NOSPLIT stays since memmove's stack use is well under the nosplit margin.

Threshold of 256 was chosen to avoid regressing Cortex-A72 (the narrow-core floor): at threshold 64 and 128, A72 regressed 5–7% on long-match columnar workloads because its in-order-ish frontend can't hide the call/spill overhead. At 256 A72 is flat. The three modern Graviton cores and Apple M-series all show net wins at 256.

Below 256 bytes and for overlapping copies (small-offset RLE), the inline scalar paths stay unchanged — memmove can't do the RLE cycling pattern, and the splat paths for offset 1/2 already cover their bandwidth ceiling.

Benchmarks

Throughput vs v4.1.26 baseline, 8 runs × 3 s × benchstat-analyzed, on five arm64 cores. G1 = Cortex-A72, G2 = Neoverse N1 (Graviton2), G3 = Neoverse V1 (Graviton3), G4 = Neoverse V2 (Graviton4), M4 = Apple M4:

Benchmark G1 G2 G3 G4 M4
UncompressPg1661 +124% +142% +162% +156% +129%
UncompressTwain +120% +134% +149% +144% +82%
UncompressColumnarMed +17% +107% +12% +54% +52%
UncompressColumnarLong flat +79% +21% +49% +50%
UncompressColumnarShort +8% +50% +21% +19% +24%
UncompressDigits +6% +5% +8% +5% +7%
UncompressRLE1 36× 36× 75× 72× 15×
UncompressRLE2 11× 15× 39× 36× 57×
UncompressRLE3 flat +13% flat flat flat
UncompressRLE4 flat +15% −16% flat +1%
UncompressRand flat flat flat flat flat
geomean +104% +150% +154% +166% +132%

On the G3 UncompressRLE4 −16% regression, and why this PR does not add PCALIGN

An early iteration of this series added a PCALIGN $64 before copyMatchTry4: on the theory that the Arm Neoverse V1 Software Optimization Guide's general advice — "align hot loop entries" / "avoid inner-loop branches crossing 32/64-byte boundaries" — would explain the RLE4 regression. It does not, and the alignment hint has been removed. Details:

  • No frontend stall to fix. perf stat -e cycles,instructions,L1-icache-load-misses,stalled-cycles-frontend on BenchmarkUncompressRLE4 with the regressing binary reports <0.2% frontend-idle cycles and <200k L1-icache misses per 3 s bench. V1's frontend is not bottlenecked on this workload — the whole premise PCALIGN addresses doesn't apply.
  • The hotspot is an STLF-bound store, not a branch/fetch. Line-level go tool pprof -list=decodeBlock puts 99%+ of RLE4 time in four bytes of asm — copyMatchByteLoop's load / 1-byte store / decrement / branch — with 67% of total cycles in the single-byte store alone. Classic store-to-load-forwarding dependency chain. Alignment doesn't move stores earlier in the pipeline.
  • Those four instructions are byte-for-byte identical to v4.1.26. No algorithmic change touches offset < 8 long RLE. What's different is that ~200 lines of new asm now precede copyMatchByteLoop:, so its PC has shifted. V1 appears to have a PC-hashed store-buffer-bank selection; at the new PC the RLE4 store falls on a worse bank than at v4.1.26's PC. N1/V2/A72/M4 do not show the same sensitivity. This is an LSU hashing artifact, not a frontend layout artifact.
  • Empirical PCALIGN tuning is darts against a wall. Each of $16, $32, $64 does move RLE4's numbers on V1 (because each shifts copyMatchByteLoop:'s PC by a different amount), but it also shifts every other downstream hot loop. PCALIGN $32 recovered RLE4 by +15% but introduced −2.4% on ColumnarLong and −2% on Digits. Each value wins on one benchmark and loses on another, with no theoretical basis for predicting which.
  • The V1 Optimization Guide's advice assumes a frontend-bound workload. Following it when perf counters show no frontend stalls is applying a remedy to a diagnosis we cannot confirm. Shipping a fragile hardware-generation-specific alignment hint that contradicts measurement, for a synthetic benchmark (offset=4 RLE is ~0% of real columnar-data matches), fails a sniff test — the payoff is small and the downside is that any future change to code placement re-rolls the dice for all cores. So: no PCALIGN.

The same layout-sensitivity caveat applies in the opposite direction to the single-digit-% gains in the matrix above: when this PR is linked into a consumer binary with different surrounding code, the small-% deltas may shift by a few percent either way on V1 specifically. The big-% gains (anything above ~20%) come from genuine algorithmic bandwidth improvements and are robust to layout.

A future commit adding an offset-4 splat path (the same trick as offset 1/2) would turn the G3 RLE4 line into a big-× win on every core and close this issue algorithmically rather than through alignment hints, but it adds a branch-table entry that trades icache pressure against what is ~0% of real columnar-data matches — deferred as a follow-up.

Test plan

  • go test ./... passes on every ref at every core above.
  • GOOS=linux GOARCH=arm64 go build ./... clean.
  • Full TestCompressUncompressBlock round-trips at every commit on every core.
  • New match_copy_test.go (offset, matchlen) matrix passes at every commit.

Motivation

Observed LZ4 decodeBlock consuming roughly a third of CPU in a service decoding compressed columnar data on arm64 Lambdas. The amd64 path already had the copy shortcut; the arm64 port never got it. The splat path, both widenings, the LDP offset-split, and the memmove delegation are all either new vs. amd64 or algorithmically more aggressive, and target store-to-load-forwarding stalls and under-utilized store pipes that hurt narrow ARM cores more than wider x86 cores, plus the call-worthy runtime.memmove NEON path for bulk copies.

Ports the "copy shortcut" optimization from decode_amd64.s to
decode_arm64.s: when literal length is 0..14, there is at least
32 bytes of dst and 16 bytes of src remaining, the match offset is
at least 8, the match is within the current block (not dict), and
the match length is not extended, copy 16 literal + 18 match bytes
as straight-line code with no bounds checks or extended-length
reads.

The shortcut bails to the existing slow paths on any guard failure:
readLitlenDone for dst/src bounds, readMatchlen for extended
matchlen / small offset / dict reference.

Match copy is sequenced as 8+8+2 (not LDP+STP) so that offset == 8
(common 8-byte RLE) works correctly: each load observes the prior
store's effect.

Benchmarks on Apple M4 (darwin/arm64):

  Pg1661  910 MB/s -> 2135 MB/s (+134%)
  Twain  1230 MB/s -> 2234 MB/s  (+81%)
  Digits 4560 MB/s -> 4920 MB/s   (+8%)
  Rand   9260 MB/s -> 9270 MB/s   (flat, all-literal path unchanged)

Adds one register (R20, dstend32 = dstend - 32) used only by the
shortcut guard.
Small-offset matches with offset 1 (byte RLE) or offset 2 (halfword
RLE) previously went through a byte-at-a-time loop that on modern
arm64 cores is bottlenecked by store-to-load forwarding latency.
When len >= 8 and offset is 1 or 2, splat the 1- or 2-byte pattern
across a 64-bit register and emit MOVD stores 8 bytes at a time,
falling through to the byte loop for the 1..7 byte tail. offset 3
and len < 8 fall through unchanged.

match is not advanced during the splat loop: the byte-loop tail
still reads match[k] correctly because for k >= offset the byte at
that position has just been splatted into the output.

Adds bench_rle_test.go exercising synthetic RLE workloads at
period 1..4. Benchmarks on Apple M4 (darwin/arm64) show:

  BenchmarkUncompressRLE1  3950 MB/s -> 30450 MB/s (7.7x)
  BenchmarkUncompressRLE2  1005 MB/s -> 30530 MB/s (30x)
  BenchmarkUncompressRLE3  1360 MB/s -> 1365 MB/s   (flat, not covered)
  BenchmarkUncompressRLE4  1690 MB/s -> 1550 MB/s   (noise)

Non-RLE benchmarks (Pg1661, Twain, Digits, Rand) are unchanged
within noise: the one added branch (CMP $8, len; BLO) is predicted
away on non-RLE match paths.
@lizthegrey lizthegrey changed the title arm64: add copy shortcut fast path arm64: port copy shortcut and add splat fast path for small-offset RLE Apr 21, 2026
The splat loop for offset 1/2 RLE matches was storing 8 bytes per
iteration. Every arm64 core we target (Graviton2/3/4/5, Apple
M2/M3/M4) can retire an STP of two X-registers as a single 16-byte
store, so writing `STP (tmp3, tmp3), 16(dst)!` per iteration halves
the iteration count and the per-iteration loop overhead without any
new dependency on SVE or NEON registers.

Tail handles 0..15 bytes: falls through to an optional 8-byte MOVD
for len in [8,15], then the existing byte-by-byte loop for 1..7.

Apple M4 (darwin/arm64) benchmarks vs. the 8-byte loop:

  BenchmarkUncompressRLE1  30450 MB/s -> 57838 MB/s (1.90x, offset=1)
  BenchmarkUncompressRLE2  30530 MB/s -> 57920 MB/s (1.90x, offset=2)

All non-RLE benchmarks (Pg1661, Twain, Digits, Rand, RLE3, RLE4)
are unchanged within noise: they don't exercise the splat path.
@lizthegrey

Copy link
Copy Markdown
Contributor Author

Things we tried and dropped

Flagging one experiment that did not make the cut, in case there is a preference either way.

NEON VDUP + VST1 [V0.B16, V1.B16] 32-byte splat loop (on top of commit 3, guarded by len >= 32):

After `ORR`-ing the pattern into `tmp3`, splat `tmp3` across a pair of 128-bit NEON registers and store 32 bytes per iteration via a Q-pair `VST1`. Falls back to the existing 16-byte STP loop for `len < 32` and for the 0..31 byte tail.

Apple M4 benchstat vs. the 16-byte STP loop (8 runs each):

Benchmark 16B STP 32B NEON Δ p
RLE1 53.7 GiB/s 61.5 GiB/s +14.5% 0.000
RLE2 53.4 GiB/s 62.1 GiB/s +16.3% 0.000
RLE3 1.29 GiB/s 1.27 GiB/s -1.8% 0.000
RLE4 1.61 GiB/s 1.58 GiB/s -2.2% 0.000
Pg1661 1.99 GiB/s 1.99 GiB/s flat 0.44
Digits 4.62 GiB/s 4.59 GiB/s -0.8% 0.000
Twain 2.08 GiB/s 2.08 GiB/s flat 0.88
Rand 8.74 GiB/s 8.65 GiB/s -1.0% 0.000

The RLE1/RLE2 wins are real but smaller than the 2× the store-pipe argument predicted — M4's LSU scheduler is apparently already extracting most of the benefit from two back-to-back STP X-pair ops. The regressions on RLE3, RLE4, Digits, and Rand are all small (≤2.2%) and on workloads that never execute the splat path, so they're almost certainly code-layout/BTB effects from adding ~8 instructions to the function.

We opted not to include this in the PR: the RLE win is modest on this core, and the regressions on non-RLE paths violate the "don't regress anything" bar we set internally. We did not get a chance to measure on Neoverse V1/V2 silicon before making the call — if you think the dual-store-pipe cores might show a more favorable ratio and you'd like us to chase it, we're happy to spin up a c7g/c8g and get you that data. Otherwise, happy to leave this on the shelf.

@lizthegrey

Copy link
Copy Markdown
Contributor Author

@greatroar tagging you here since you probably have opinions.

The shortcut's 18-byte match copy has been sequenced as three
back-to-back 8+8+2 memory ops so that the RLE cycling case for
offset 8..15 loads data that prior stores just produced. That
sequencing is only required when offset < 18 -- any larger offset
guarantees no overlap between the 16-byte LDP and the 16-byte STP
within a single shortcut firing.

Sampling real-world compressed columnar data (int64c/float64c/
varstring.dictc/hexc, ~2M matches across 20 files) shows ~95% of
matches have offset >= 18, so the new path is the common case.
English-text benchmarks show the same shape.

For offset >= 18: one LDP (16B, 1 uop), one MOVHU (2B tail), then
STP + MOVH. Four memory ops, no serial dependency chain -- vs. the
6-op serial chain of the existing 8+8+2.

For offset 8..17: keep the existing 8+8+2 sequence for RLE
aliasing correctness.

Apple M4 benchstat (8 runs each), shortcut+splat+16B-splat baseline
vs. +offset-split:

  BenchmarkUncompressPg1661   1.989 GiB/s -> 2.030 GiB/s  +2.03% (p=0.000)
  BenchmarkUncompressTwain    2.069 GiB/s -> 2.128 GiB/s  +2.85% (p=0.000)
  BenchmarkUncompressDigits   4.603 GiB/s -> 4.628 GiB/s  +0.56% (p=0.007)
  BenchmarkUncompressRLE3     1.310 GiB/s -> 1.327 GiB/s  +1.28% (p=0.000)
  BenchmarkUncompressRand     8.715 GiB/s -> 8.739 GiB/s  +0.28% (p=0.038)
  BenchmarkUncompressRLE1/2/4: within noise (p>=0.083, not sig)

No statistically significant regressions. Wins concentrate on
non-RLE workloads as expected: RLE1/2 don't take the shortcut path
(offset 1/2 splat is used instead), so they see only layout effects.
@lizthegrey lizthegrey force-pushed the lizf.arm64-shortcut branch 2 times, most recently from fe41e96 to f22dda2 Compare April 21, 2026 05:40
@lizthegrey

Copy link
Copy Markdown
Contributor Author

Per-commit load-bearing check for reviewers: does each of the five commits unlock a distinct axis of improvement that the others don't cover? Yes -- summary below, derived from a full 6-ref sweep (baseline + each commit) across five cores (Cortex-A72, Neoverse N1/V1/V2, Apple M4), 8 runs × 3 s each, benchstat-analyzed.

commit unique win at this transition
1 shortcut Pg1661 +121–163% / Twain +73–149% / Digits +6–15% on all 5 cores — no other commit touches the short-lit/short-match text-compression path
2 splat RLE1 +668–3881%, RLE2 +890–2926% on all 5 cores — zero-gain on these before splat
3 splat-widen RLE1 +30–90%, RLE2 +31–90% on top of commit 2 — halves splat loop iteration count
4 LDP shortcut for offset≥18 ColumnarMed +6.8% on Neoverse V1, small Pg1661/Twain gains on V2, RLE4 recovery on N1 — structural 4-memop path vs 6-memop; smallest delta of the five but real on V1 columnar
5 copyMatchLoop8 widening (+PCALIGN) ColumnarMed +20–44%, ColumnarLong +14–38%, ColumnarShort +8–31% on all 5 cores — long-match path

Commit 4 could have been squashed into commit 5 and the end-state would be identical — kept separate because each commit is thematic (one independent mechanism per commit), which is easier to review and easier to revert selectively if needed.

A few intermediate per-step transitions have mild regressions (<~1% or quickly recovered by the next commit); none survive to the final commit, where no (core × benchmark) pair has a statistically significant regression vs v4.1.26. Full numbers are in the PR description table.

@lizthegrey lizthegrey changed the title arm64: port copy shortcut and add splat fast path for small-offset RLE arm64: major decodeBlock performance rework Apr 21, 2026
copyMatchLoop8 has carried a comment claiming "a 16-at-a-time loop
doesn't provide a further speedup". On columnar / record-oriented
compressed data that comment is no longer accurate: measurements show
long matches (actual match length >= 19, ~9% of matches by count) emit
~75% of total match-copy bytes, and ~81% of those long matches have
offset >= 32 -- comfortably above the aliasing threshold where an
LDP/STP pair can't load past its own just-stored data.

When both conditions hold (len >= 16 and offset >= 32), copy 16 bytes
per iteration with LDP/STP instead of 8 with MOVD. The trailing 0..15
bytes are handled by the same negative-offset trick the 8B path uses,
extended to a 16-byte load via an explicit ADD-then-LDP (AArch64 LDP
lacks the (base)(index) addressing mode the MOVD trailer relies on).
Matches with offset 8..31 or len < 16 keep the existing 8-byte loop
unchanged, so there's no aliasing-correctness change.

Also includes:
- bench_rle_test.go: new BenchmarkUncompressColumnar{Med,Long,Short}
  variants that mimic columnar / record-oriented compressed storage,
  since the existing Pg1661/Twain/Digits/Rand/RLE* benchmarks don't
  meaningfully exercise the long-match copy loops.
- internal/lz4block/match_copy_test.go: hand-assembled (offset, matchlen)
  matrix tests that pin down which combination triggered a failure,
  rather than relying on TestCompressUncompressBlock's "byte N of pg1661
  is wrong" signal. These would have reproduced the CCMP immediate bug
  below on the first out-of-range case instead of surfacing it as deep
  in a large corpus round-trip.

Note for future maintainers: CCMP's immediate operand is 5-bit
unsigned (0..31), not register-sized. Go's assembler silently
truncates a larger immediate (e.g. $32 -> $0), which produces a loop
entered on any offset >= 0 and corrupts output. The entry condition
uses CCMP offset, $31 + BLS so the encoding stays in range.
For match copies where offset >= len (non-overlapping) and len >= 256
bytes, jump out of the inline LDP/STP loop and into runtime·memmove.
Go's arm64 memmove is hand-tuned NEON (128-bit reads/writes, prefetch,
software-pipelined 64 B/iter) and outruns any scalar loop we can
realistically write inline once the call setup cost is amortized.

The threshold of 256 was picked to avoid regressions on narrow
arm64 cores (Cortex-A72 in particular, which never wins from
memmove because its 3-wide in-order-ish frontend can't hide the
call/spill overhead even for 4 KiB copies). For the three modern
Graviton generations (N1, V1, V2) and Apple M-series, the memmove
path is a strict improvement at this threshold.

Below 256 bytes the existing inline 16 B/iter LDP/STP loop stays.
Overlapping match copies (offset < len, i.e. small-offset RLE) also
stay on the inline path -- memmove can't handle that pattern, and
our splat (offset 1/2) and scalar (offset 3..7) paths already cover
the relevant cases.

Requires allocating a 48-byte stack frame so decodeBlock can call
from asm (args at 8/16/24(RSP) per Go's arm64 ABI0; spill dst and
src at 32/40(RSP); reload dstend/dstend16/dstend32/srcend/srcend16/
dict/dictlen/dictend from FP after the call since memmove clobbers
every non-callee-saved register). NOSPLIT stays: memmove's stack
use is well under the nosplit margin.

Benchmark summary (throughput deltas vs pre-memmove asm; threshold
256 measured on 5 different arm64 cores, 8 runs × 3 s × benchstat):

                    A72   G2/N1  G3/V1  G4/V2  M4
  ColumnarMed       flat  -38%   flat   -5.5%  +3%*
  ColumnarLong      flat  -22%   -3%    flat   +1%*
  ColumnarShort     flat  +8%*   -2%    flat   +3%*
  Pg1661            flat  +9%*   flat   flat   +3%*
  Twain             flat  +8%*   flat   flat   +1%*
  Digits            +2%*  +2%*   +4%*   flat   flat
  RLE1/2            flat  flat   flat   flat   flat
  RLE3/4            flat  +14%*  flat*  flat   flat
  Rand              flat  flat   flat   flat   flat

Entries marked * are small enough to be attributable to code-layout
shifts from the added path, not to memmove execution on their input.
The large wins on G2 columnar are the intended target. No statistically
significant regression > 10% anywhere.

Kibble-sampled columnar data shows ~56% of total match-copy bytes
come from matches of length >= 256, and essentially all of those
have offset >> len, so this path fires on the majority of real-world
long-match output.
@lizthegrey lizthegrey force-pushed the lizf.arm64-shortcut branch from c1acfd5 to 91e8b6f Compare April 21, 2026 16:18
@pierrec

pierrec commented Apr 23, 2026

Copy link
Copy Markdown
Owner

Amazing work. Waiting a bit for @greatroar to chime in for any comment.

@pierrec pierrec merged commit a296161 into pierrec:v4 May 31, 2026
9 checks passed
@lizthegrey lizthegrey deleted the lizf.arm64-shortcut branch May 31, 2026 12:42
@lizthegrey

Copy link
Copy Markdown
Contributor Author

Thanks much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants