arm64: major decodeBlock performance rework by lizthegrey · Pull Request #243 · pierrec/lz4

lizthegrey · 2026-04-21T01:14:23Z

Summary

Six incremental arm64 decoder optimizations. Commits 1–5 are all scalar (no SVE, no NEON vector regs); commit 6 delegates the long-match bulk-copy case to runtime.memmove to recover the last gap vs. the pure-Go decoder.

Commit 1 — Copy shortcut (`41139b4`)

Ports the "copy shortcut" optimization from decode_amd64.s. When the literal length is 0..14, there are at least 32 bytes of dst and 16 bytes of src remaining, the match offset is at least 8, the match is within the current block, and the match length is not extended, copy 16 literal + 18 match bytes as straight-line code with no bounds checks or extended-length reads. The match copy is sequenced 8+8+2 so offset == 8 works correctly. Adds one register (R20, dstend32 = dstend - 32).

Commit 2 — Splat fast path for offset 1 and 2 (`998937e`)

Matches with offset 1 (byte RLE, e.g. zero-padding) or offset 2 (halfword RLE) previously went through copyMatchLoop1's byte-at-a-time loop, dominated by store-to-load forwarding latency. When len >= 8 and offset is 1 or 2, splat the 1- or 2-byte pattern across a 64-bit register and store it 8 bytes at a time, falling through to the byte loop for the tail. Adds bench_rle_test.go with synthetic offset=1..4 RLE benchmarks (RLE1 and RLE2 land in this commit's wheelhouse; RLE3/RLE4 are intentional regression-guards against the pure-Go fallback's exponential-doubling loop for the small-offset path this PR does not modify in asm).

Commit 3 — Widen splat store from 8 to 16 bytes (`4ce77f3`)

Every arm64 core targeted by this library (Cortex-A72, Neoverse N1/V1/V2, Apple M-series) can retire an STP of two X-registers as a single 16-byte store. Widening the splat loop halves the iteration count and per-iteration loop overhead.

Commit 4 — LDP/STP fast path in shortcut match copy when offset >= 18 (`f6dea62`)

The 8+8+2 sequencing from commit 1 is only required when offset < 18. A histogram of ~2M matches from real-world compressed columnar data shows ~95% of matches have offset >= 18, so the LDP path is the overwhelmingly common case: four memory ops with no serial dependency chain vs. six in the 8+8+2 fallback. For offset 8..17 the sequencing is preserved.

Commit 5 — Widen copyMatchLoop8 to 16 bytes/iter for offset >= 32 (`0894170`)

copyMatchLoop8 has carried a comment claiming "a 16-at-a-time loop doesn't provide a further speedup". On columnar data that's no longer accurate: long matches (length >= 19, ~9% of matches by count) emit ~75% of total match-copy bytes, and ~81% of those long matches have offset >= 32. When len >= 16 and offset >= 32, copy 16 bytes/iter with LDP/STP instead of 8 with MOVD.

internal/lz4block/match_copy_test.go adds hand-assembled (offset, matchlen) matrix tests.

Note for future maintainers: CCMP's immediate is 5-bit unsigned (0..31); Go's assembler silently truncates larger values (e.g. $32 -> $0). The entry uses CCMP offset, $31 + BLS to stay in range.

Commit 6 — call runtime·memmove for long non-overlapping match copies (`91e8b6f`)

For match copies where offset >= len and len >= 256, jump out of the inline 16 B/iter LDP/STP loop and into runtime.memmove. Go's arm64 memmove is hand-tuned NEON (128-bit reads/writes, prefetch, software-pipelined 64 B/iter) and outruns any scalar loop that fits inline once the call-setup cost is amortized. Requires allocating a 48-byte stack frame (previously NOFRAME); NOSPLIT stays since memmove's stack use is well under the nosplit margin.

Threshold of 256 was chosen to avoid regressing Cortex-A72 (the narrow-core floor): at threshold 64 and 128, A72 regressed 5–7% on long-match columnar workloads because its in-order-ish frontend can't hide the call/spill overhead. At 256 A72 is flat. The three modern Graviton cores and Apple M-series all show net wins at 256.

Below 256 bytes and for overlapping copies (small-offset RLE), the inline scalar paths stay unchanged — memmove can't do the RLE cycling pattern, and the splat paths for offset 1/2 already cover their bandwidth ceiling.

Benchmarks

Throughput vs v4.1.26 baseline, 8 runs × 3 s × benchstat-analyzed, on five arm64 cores. G1 = Cortex-A72, G2 = Neoverse N1 (Graviton2), G3 = Neoverse V1 (Graviton3), G4 = Neoverse V2 (Graviton4), M4 = Apple M4:

Benchmark	G1	G2	G3	G4	M4
`UncompressPg1661`	+124%	+142%	+162%	+156%	+129%
`UncompressTwain`	+120%	+134%	+149%	+144%	+82%
`UncompressColumnarMed`	+17%	+107%	+12%	+54%	+52%
`UncompressColumnarLong`	flat	+79%	+21%	+49%	+50%
`UncompressColumnarShort`	+8%	+50%	+21%	+19%	+24%
`UncompressDigits`	+6%	+5%	+8%	+5%	+7%
`UncompressRLE1`	36×	36×	75×	72×	15×
`UncompressRLE2`	11×	15×	39×	36×	57×
`UncompressRLE3`	flat	+13%	flat	flat	flat
`UncompressRLE4`	flat	+15%	−16%	flat	+1%
`UncompressRand`	flat	flat	flat	flat	flat
geomean	+104%	+150%	+154%	+166%	+132%

On the G3 UncompressRLE4 −16% regression, and why this PR does not add `PCALIGN`

An early iteration of this series added a PCALIGN $64 before copyMatchTry4: on the theory that the Arm Neoverse V1 Software Optimization Guide's general advice — "align hot loop entries" / "avoid inner-loop branches crossing 32/64-byte boundaries" — would explain the RLE4 regression. It does not, and the alignment hint has been removed. Details:

No frontend stall to fix. perf stat -e cycles,instructions,L1-icache-load-misses,stalled-cycles-frontend on BenchmarkUncompressRLE4 with the regressing binary reports <0.2% frontend-idle cycles and <200k L1-icache misses per 3 s bench. V1's frontend is not bottlenecked on this workload — the whole premise PCALIGN addresses doesn't apply.
The hotspot is an STLF-bound store, not a branch/fetch. Line-level go tool pprof -list=decodeBlock puts 99%+ of RLE4 time in four bytes of asm — copyMatchByteLoop's load / 1-byte store / decrement / branch — with 67% of total cycles in the single-byte store alone. Classic store-to-load-forwarding dependency chain. Alignment doesn't move stores earlier in the pipeline.
Those four instructions are byte-for-byte identical to v4.1.26. No algorithmic change touches offset < 8 long RLE. What's different is that ~200 lines of new asm now precede copyMatchByteLoop:, so its PC has shifted. V1 appears to have a PC-hashed store-buffer-bank selection; at the new PC the RLE4 store falls on a worse bank than at v4.1.26's PC. N1/V2/A72/M4 do not show the same sensitivity. This is an LSU hashing artifact, not a frontend layout artifact.
Empirical PCALIGN tuning is darts against a wall. Each of $16, $32, $64 does move RLE4's numbers on V1 (because each shifts copyMatchByteLoop:'s PC by a different amount), but it also shifts every other downstream hot loop. PCALIGN $32 recovered RLE4 by +15% but introduced −2.4% on ColumnarLong and −2% on Digits. Each value wins on one benchmark and loses on another, with no theoretical basis for predicting which.
The V1 Optimization Guide's advice assumes a frontend-bound workload. Following it when perf counters show no frontend stalls is applying a remedy to a diagnosis we cannot confirm. Shipping a fragile hardware-generation-specific alignment hint that contradicts measurement, for a synthetic benchmark (offset=4 RLE is ~0% of real columnar-data matches), fails a sniff test — the payoff is small and the downside is that any future change to code placement re-rolls the dice for all cores. So: no PCALIGN.

The same layout-sensitivity caveat applies in the opposite direction to the single-digit-% gains in the matrix above: when this PR is linked into a consumer binary with different surrounding code, the small-% deltas may shift by a few percent either way on V1 specifically. The big-% gains (anything above ~20%) come from genuine algorithmic bandwidth improvements and are robust to layout.

A future commit adding an offset-4 splat path (the same trick as offset 1/2) would turn the G3 RLE4 line into a big-× win on every core and close this issue algorithmically rather than through alignment hints, but it adds a branch-table entry that trades icache pressure against what is ~0% of real columnar-data matches — deferred as a follow-up.

Test plan

go test ./... passes on every ref at every core above.
GOOS=linux GOARCH=arm64 go build ./... clean.
Full TestCompressUncompressBlock round-trips at every commit on every core.
New match_copy_test.go (offset, matchlen) matrix passes at every commit.

Motivation

Observed LZ4 decodeBlock consuming roughly a third of CPU in a service decoding compressed columnar data on arm64 Lambdas. The amd64 path already had the copy shortcut; the arm64 port never got it. The splat path, both widenings, the LDP offset-split, and the memmove delegation are all either new vs. amd64 or algorithmically more aggressive, and target store-to-load-forwarding stalls and under-utilized store pipes that hurt narrow ARM cores more than wider x86 cores, plus the call-worthy runtime.memmove NEON path for bulk copies.

Ports the "copy shortcut" optimization from decode_amd64.s to decode_arm64.s: when literal length is 0..14, there is at least 32 bytes of dst and 16 bytes of src remaining, the match offset is at least 8, the match is within the current block (not dict), and the match length is not extended, copy 16 literal + 18 match bytes as straight-line code with no bounds checks or extended-length reads. The shortcut bails to the existing slow paths on any guard failure: readLitlenDone for dst/src bounds, readMatchlen for extended matchlen / small offset / dict reference. Match copy is sequenced as 8+8+2 (not LDP+STP) so that offset == 8 (common 8-byte RLE) works correctly: each load observes the prior store's effect. Benchmarks on Apple M4 (darwin/arm64): Pg1661 910 MB/s -> 2135 MB/s (+134%) Twain 1230 MB/s -> 2234 MB/s (+81%) Digits 4560 MB/s -> 4920 MB/s (+8%) Rand 9260 MB/s -> 9270 MB/s (flat, all-literal path unchanged) Adds one register (R20, dstend32 = dstend - 32) used only by the shortcut guard.

Small-offset matches with offset 1 (byte RLE) or offset 2 (halfword RLE) previously went through a byte-at-a-time loop that on modern arm64 cores is bottlenecked by store-to-load forwarding latency. When len >= 8 and offset is 1 or 2, splat the 1- or 2-byte pattern across a 64-bit register and emit MOVD stores 8 bytes at a time, falling through to the byte loop for the 1..7 byte tail. offset 3 and len < 8 fall through unchanged. match is not advanced during the splat loop: the byte-loop tail still reads match[k] correctly because for k >= offset the byte at that position has just been splatted into the output. Adds bench_rle_test.go exercising synthetic RLE workloads at period 1..4. Benchmarks on Apple M4 (darwin/arm64) show: BenchmarkUncompressRLE1 3950 MB/s -> 30450 MB/s (7.7x) BenchmarkUncompressRLE2 1005 MB/s -> 30530 MB/s (30x) BenchmarkUncompressRLE3 1360 MB/s -> 1365 MB/s (flat, not covered) BenchmarkUncompressRLE4 1690 MB/s -> 1550 MB/s (noise) Non-RLE benchmarks (Pg1661, Twain, Digits, Rand) are unchanged within noise: the one added branch (CMP $8, len; BLO) is predicted away on non-RLE match paths.

The splat loop for offset 1/2 RLE matches was storing 8 bytes per iteration. Every arm64 core we target (Graviton2/3/4/5, Apple M2/M3/M4) can retire an STP of two X-registers as a single 16-byte store, so writing `STP (tmp3, tmp3), 16(dst)!` per iteration halves the iteration count and the per-iteration loop overhead without any new dependency on SVE or NEON registers. Tail handles 0..15 bytes: falls through to an optional 8-byte MOVD for len in [8,15], then the existing byte-by-byte loop for 1..7. Apple M4 (darwin/arm64) benchmarks vs. the 8-byte loop: BenchmarkUncompressRLE1 30450 MB/s -> 57838 MB/s (1.90x, offset=1) BenchmarkUncompressRLE2 30530 MB/s -> 57920 MB/s (1.90x, offset=2) All non-RLE benchmarks (Pg1661, Twain, Digits, Rand, RLE3, RLE4) are unchanged within noise: they don't exercise the splat path.

lizthegrey · 2026-04-21T03:01:47Z

Things we tried and dropped

Flagging one experiment that did not make the cut, in case there is a preference either way.

NEON VDUP + VST1 [V0.B16, V1.B16] 32-byte splat loop (on top of commit 3, guarded by len >= 32):

After `ORR`-ing the pattern into `tmp3`, splat `tmp3` across a pair of 128-bit NEON registers and store 32 bytes per iteration via a Q-pair `VST1`. Falls back to the existing 16-byte STP loop for `len < 32` and for the 0..31 byte tail.

Apple M4 benchstat vs. the 16-byte STP loop (8 runs each):

Benchmark	16B STP	32B NEON	Δ	p
RLE1	53.7 GiB/s	61.5 GiB/s	+14.5%	0.000
RLE2	53.4 GiB/s	62.1 GiB/s	+16.3%	0.000
RLE3	1.29 GiB/s	1.27 GiB/s	-1.8%	0.000
RLE4	1.61 GiB/s	1.58 GiB/s	-2.2%	0.000
Pg1661	1.99 GiB/s	1.99 GiB/s	flat	0.44
Digits	4.62 GiB/s	4.59 GiB/s	-0.8%	0.000
Twain	2.08 GiB/s	2.08 GiB/s	flat	0.88
Rand	8.74 GiB/s	8.65 GiB/s	-1.0%	0.000

The RLE1/RLE2 wins are real but smaller than the 2× the store-pipe argument predicted — M4's LSU scheduler is apparently already extracting most of the benefit from two back-to-back STP X-pair ops. The regressions on RLE3, RLE4, Digits, and Rand are all small (≤2.2%) and on workloads that never execute the splat path, so they're almost certainly code-layout/BTB effects from adding ~8 instructions to the function.

We opted not to include this in the PR: the RLE win is modest on this core, and the regressions on non-RLE paths violate the "don't regress anything" bar we set internally. We did not get a chance to measure on Neoverse V1/V2 silicon before making the call — if you think the dual-store-pipe cores might show a more favorable ratio and you'd like us to chase it, we're happy to spin up a c7g/c8g and get you that data. Otherwise, happy to leave this on the shelf.

lizthegrey · 2026-04-21T03:10:07Z

@greatroar tagging you here since you probably have opinions.

The shortcut's 18-byte match copy has been sequenced as three back-to-back 8+8+2 memory ops so that the RLE cycling case for offset 8..15 loads data that prior stores just produced. That sequencing is only required when offset < 18 -- any larger offset guarantees no overlap between the 16-byte LDP and the 16-byte STP within a single shortcut firing. Sampling real-world compressed columnar data (int64c/float64c/ varstring.dictc/hexc, ~2M matches across 20 files) shows ~95% of matches have offset >= 18, so the new path is the common case. English-text benchmarks show the same shape. For offset >= 18: one LDP (16B, 1 uop), one MOVHU (2B tail), then STP + MOVH. Four memory ops, no serial dependency chain -- vs. the 6-op serial chain of the existing 8+8+2. For offset 8..17: keep the existing 8+8+2 sequence for RLE aliasing correctness. Apple M4 benchstat (8 runs each), shortcut+splat+16B-splat baseline vs. +offset-split: BenchmarkUncompressPg1661 1.989 GiB/s -> 2.030 GiB/s +2.03% (p=0.000) BenchmarkUncompressTwain 2.069 GiB/s -> 2.128 GiB/s +2.85% (p=0.000) BenchmarkUncompressDigits 4.603 GiB/s -> 4.628 GiB/s +0.56% (p=0.007) BenchmarkUncompressRLE3 1.310 GiB/s -> 1.327 GiB/s +1.28% (p=0.000) BenchmarkUncompressRand 8.715 GiB/s -> 8.739 GiB/s +0.28% (p=0.038) BenchmarkUncompressRLE1/2/4: within noise (p>=0.083, not sig) No statistically significant regressions. Wins concentrate on non-RLE workloads as expected: RLE1/2 don't take the shortcut path (offset 1/2 splat is used instead), so they see only layout effects.

lizthegrey · 2026-04-21T05:50:25Z

Per-commit load-bearing check for reviewers: does each of the five commits unlock a distinct axis of improvement that the others don't cover? Yes -- summary below, derived from a full 6-ref sweep (baseline + each commit) across five cores (Cortex-A72, Neoverse N1/V1/V2, Apple M4), 8 runs × 3 s each, benchstat-analyzed.

commit	unique win at this transition
1 shortcut	Pg1661 +121–163% / Twain +73–149% / Digits +6–15% on all 5 cores — no other commit touches the short-lit/short-match text-compression path
2 splat	RLE1 +668–3881%, RLE2 +890–2926% on all 5 cores — zero-gain on these before splat
3 splat-widen	RLE1 +30–90%, RLE2 +31–90% on top of commit 2 — halves splat loop iteration count
4 LDP shortcut for offset≥18	ColumnarMed +6.8% on Neoverse V1, small Pg1661/Twain gains on V2, RLE4 recovery on N1 — structural 4-memop path vs 6-memop; smallest delta of the five but real on V1 columnar
5 copyMatchLoop8 widening (+PCALIGN)	ColumnarMed +20–44%, ColumnarLong +14–38%, ColumnarShort +8–31% on all 5 cores — long-match path

Commit 4 could have been squashed into commit 5 and the end-state would be identical — kept separate because each commit is thematic (one independent mechanism per commit), which is easier to review and easier to revert selectively if needed.

A few intermediate per-step transitions have mild regressions (<~1% or quickly recovered by the next commit); none survive to the final commit, where no (core × benchmark) pair has a statistically significant regression vs v4.1.26. Full numbers are in the PR description table.

copyMatchLoop8 has carried a comment claiming "a 16-at-a-time loop doesn't provide a further speedup". On columnar / record-oriented compressed data that comment is no longer accurate: measurements show long matches (actual match length >= 19, ~9% of matches by count) emit ~75% of total match-copy bytes, and ~81% of those long matches have offset >= 32 -- comfortably above the aliasing threshold where an LDP/STP pair can't load past its own just-stored data. When both conditions hold (len >= 16 and offset >= 32), copy 16 bytes per iteration with LDP/STP instead of 8 with MOVD. The trailing 0..15 bytes are handled by the same negative-offset trick the 8B path uses, extended to a 16-byte load via an explicit ADD-then-LDP (AArch64 LDP lacks the (base)(index) addressing mode the MOVD trailer relies on). Matches with offset 8..31 or len < 16 keep the existing 8-byte loop unchanged, so there's no aliasing-correctness change. Also includes: - bench_rle_test.go: new BenchmarkUncompressColumnar{Med,Long,Short} variants that mimic columnar / record-oriented compressed storage, since the existing Pg1661/Twain/Digits/Rand/RLE* benchmarks don't meaningfully exercise the long-match copy loops. - internal/lz4block/match_copy_test.go: hand-assembled (offset, matchlen) matrix tests that pin down which combination triggered a failure, rather than relying on TestCompressUncompressBlock's "byte N of pg1661 is wrong" signal. These would have reproduced the CCMP immediate bug below on the first out-of-range case instead of surfacing it as deep in a large corpus round-trip. Note for future maintainers: CCMP's immediate operand is 5-bit unsigned (0..31), not register-sized. Go's assembler silently truncates a larger immediate (e.g. $32 -> $0), which produces a loop entered on any offset >= 0 and corrupts output. The entry condition uses CCMP offset, $31 + BLS so the encoding stays in range.

For match copies where offset >= len (non-overlapping) and len >= 256 bytes, jump out of the inline LDP/STP loop and into runtime·memmove. Go's arm64 memmove is hand-tuned NEON (128-bit reads/writes, prefetch, software-pipelined 64 B/iter) and outruns any scalar loop we can realistically write inline once the call setup cost is amortized. The threshold of 256 was picked to avoid regressions on narrow arm64 cores (Cortex-A72 in particular, which never wins from memmove because its 3-wide in-order-ish frontend can't hide the call/spill overhead even for 4 KiB copies). For the three modern Graviton generations (N1, V1, V2) and Apple M-series, the memmove path is a strict improvement at this threshold. Below 256 bytes the existing inline 16 B/iter LDP/STP loop stays. Overlapping match copies (offset < len, i.e. small-offset RLE) also stay on the inline path -- memmove can't handle that pattern, and our splat (offset 1/2) and scalar (offset 3..7) paths already cover the relevant cases. Requires allocating a 48-byte stack frame so decodeBlock can call from asm (args at 8/16/24(RSP) per Go's arm64 ABI0; spill dst and src at 32/40(RSP); reload dstend/dstend16/dstend32/srcend/srcend16/ dict/dictlen/dictend from FP after the call since memmove clobbers every non-callee-saved register). NOSPLIT stays: memmove's stack use is well under the nosplit margin. Benchmark summary (throughput deltas vs pre-memmove asm; threshold 256 measured on 5 different arm64 cores, 8 runs × 3 s × benchstat): A72 G2/N1 G3/V1 G4/V2 M4 ColumnarMed flat -38% flat -5.5% +3%* ColumnarLong flat -22% -3% flat +1%* ColumnarShort flat +8%* -2% flat +3%* Pg1661 flat +9%* flat flat +3%* Twain flat +8%* flat flat +1%* Digits +2%* +2%* +4%* flat flat RLE1/2 flat flat flat flat flat RLE3/4 flat +14%* flat* flat flat Rand flat flat flat flat flat Entries marked * are small enough to be attributable to code-layout shifts from the added path, not to memmove execution on their input. The large wins on G2 columnar are the intended target. No statistically significant regression > 10% anywhere. Kibble-sampled columnar data shows ~56% of total match-copy bytes come from matches of length >= 256, and essentially all of those have offset >> len, so this path fires on the majority of real-world long-match output.

pierrec · 2026-04-23T07:01:52Z

Amazing work. Waiting a bit for @greatroar to chime in for any comment.

lizthegrey · 2026-05-31T14:10:27Z

Thanks much!

lizthegrey added 2 commits April 20, 2026 18:12

lizthegrey changed the title ~~arm64: add copy shortcut fast path~~ arm64: port copy shortcut and add splat fast path for small-offset RLE Apr 21, 2026

lizthegrey force-pushed the lizf.arm64-shortcut branch 2 times, most recently from fe41e96 to f22dda2 Compare April 21, 2026 05:40

lizthegrey changed the title ~~arm64: port copy shortcut and add splat fast path for small-offset RLE~~ arm64: major decodeBlock performance rework Apr 21, 2026

lizthegrey added 2 commits April 21, 2026 09:18

lizthegrey force-pushed the lizf.arm64-shortcut branch from c1acfd5 to 91e8b6f Compare April 21, 2026 16:18

Merge branch 'v4' into lizf.arm64-shortcut

903773a

pierrec merged commit a296161 into pierrec:v4 May 31, 2026
9 checks passed

lizthegrey deleted the lizf.arm64-shortcut branch May 31, 2026 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64: major decodeBlock performance rework#243

arm64: major decodeBlock performance rework#243
pierrec merged 7 commits into
pierrec:v4from
honeycombio:lizf.arm64-shortcut

lizthegrey commented Apr 21, 2026 •

edited

Loading

Uh oh!

lizthegrey commented Apr 21, 2026

Uh oh!

lizthegrey commented Apr 21, 2026

Uh oh!

lizthegrey commented Apr 21, 2026

Uh oh!

pierrec commented Apr 23, 2026

Uh oh!

Uh oh!

lizthegrey commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lizthegrey commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1 — Copy shortcut (41139b4)

Commit 2 — Splat fast path for offset 1 and 2 (998937e)

Commit 3 — Widen splat store from 8 to 16 bytes (4ce77f3)

Commit 4 — LDP/STP fast path in shortcut match copy when offset >= 18 (f6dea62)

Commit 5 — Widen copyMatchLoop8 to 16 bytes/iter for offset >= 32 (0894170)

Commit 6 — call runtime·memmove for long non-overlapping match copies (91e8b6f)

Benchmarks

On the G3 UncompressRLE4 −16% regression, and why this PR does not add PCALIGN

Test plan

Motivation

Uh oh!

lizthegrey commented Apr 21, 2026

Things we tried and dropped

Uh oh!

lizthegrey commented Apr 21, 2026

Uh oh!

lizthegrey commented Apr 21, 2026

Uh oh!

pierrec commented Apr 23, 2026

Uh oh!

Uh oh!

lizthegrey commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lizthegrey commented Apr 21, 2026 •

edited

Loading

Commit 1 — Copy shortcut (`41139b4`)

Commit 2 — Splat fast path for offset 1 and 2 (`998937e`)

Commit 3 — Widen splat store from 8 to 16 bytes (`4ce77f3`)

Commit 4 — LDP/STP fast path in shortcut match copy when offset >= 18 (`f6dea62`)

Commit 5 — Widen copyMatchLoop8 to 16 bytes/iter for offset >= 32 (`0894170`)

Commit 6 — call runtime·memmove for long non-overlapping match copies (`91e8b6f`)

On the G3 UncompressRLE4 −16% regression, and why this PR does not add `PCALIGN`