Skip to content

Nine profiled micro-optimizations: +63% ARM mesh, +26% ARM canada, +17% ARM random vs baseline#24

Open
fcostaoliveira wants to merge 16 commits into
kolemannix:mainfrom
redis-performance:perf/force-inline-ffc-impl
Open

Nine profiled micro-optimizations: +63% ARM mesh, +26% ARM canada, +17% ARM random vs baseline#24
fcostaoliveira wants to merge 16 commits into
kolemannix:mainfrom
redis-performance:perf/force-inline-ffc-impl

Conversation

@fcostaoliveira

@fcostaoliveira fcostaoliveira commented May 26, 2026

Copy link
Copy Markdown
Contributor

Nine changes from two days of profiled, evidence-based optimization on bare-metal ARM Graviton4 (m8g.metal-24xl) and x86 Xeon (m7i.metal-24xl). Each change was validated with perf record -g before acceptance.

Changes

1. FFC_IMPL_INLINE — force-inline ffc_from_chars_double at call sites (EXP-009)

GCC's IPA-CP pass was creating an out-of-line constprop clone (ffc_from_chars_double_options.constprop.0.isra.0) and calling it as a regular function on every float. The C++ fast_float reference gets full inlining via templates; ffc_from_chars_double's external linkage prevented the equivalent.

FFC_IMPL_INLINE expands to __attribute__((always_inline)) inline in FFC_IMPL translation units, and to nothing otherwise. Applying it to the forward declarations in api.h causes GCC's ipa_early_inline pass to inline both functions at call sites — eliminating per-float call overhead.

ABI note: always_inline suppresses the out-of-line symbol in FFC_IMPL TUs; the ffc_parse_double / ffc_parse_float wrappers remain exported for users who don't define FFC_IMPL.

2. Local vars in too_many_digits path — GCC DSE of struct stores (EXP-006)

Four fields (int_part_*, fraction_part_*) were stored to the ffc_parsed struct unconditionally then read back in the rare too_many_digits recovery path. This prevented GCC from dead-store-eliminating those struct stores on the non-JSON constprop clone. Restructuring to use locals (which GCC sees as registers) enables DSE. ARM GCC already handled this via more aggressive ISRA.

3. Combined exponent range check in ffc_clinger_fast_path_impl (EXP-012)

Clinger's fast path checked exponent < 0 and exponent > power_of_ten_count with two separate branches. Combining into a single unsigned comparison ((uint64_t)(exponent + 22) <= 44) saves one branch and one comparison instruction on the hot path.

4. Unroll fraction digit loop: 3 nested ifs replace while loop (EXP-015)

The fraction part scanner used a while loop. For inputs where the fraction is ≤3 bytes (common in real-world coordinates), a 3-level nested-if eliminates the loop back-branch and allows GCC to hoist the loop-exit condition out.

5. Straight-line integer scan: nested-ifs replace while loop for 1–4 digits (EXP-026)

Same pattern applied to the integer part. Most real-world floats have 1–4 integer digits; 4-level nested-if eliminates the loop back-branch and exposes more ILP.

6. Extend integer nested-ifs to 5 levels (EXP-028)

Profile showed 5-digit integer parts are common in mesh.txt (3D coordinates like 12345.678). Extending the nested-if from 4→5 levels eliminates the while-loop back-branch for those cases.

7. FFC_ROUNDS_TO_NEAREST compile-time macro — eliminate FCMP chain (EXP-030)

ffc_rounds_to_nearest() read the FP control word at runtime via a volatile double store/load round-trip. This 7-instruction sequence appeared on the hot path of every call to ffc_clinger_fast_path_impl.

FFC_ROUNDS_TO_NEAREST (default: 1) lets the compiler see a compile-time constant, reducing the entire check to a single true branch — zero instructions. Can be set to 0 to preserve runtime detection.

8. Early exit for exponent == 0 in ffc_from_chars_advanced (EXP-033)

~55% of mesh.txt lines are pure integers (no fractional part, exponent 0). Previously, each was routed through Clinger's full path: range check, pow10 table load, scvtf, and fmul × 1.0. With FFC_ROUNDS_TO_NEAREST eliminating the FCMP chain, a pre-Clinger guard is safe to add. The early exit converts the integer mantissa directly and returns, saving ~10–16 instructions per integer value.

9. Fix: zero sign under FE_DOWNWARD in exponent==0 fast path

Clang may convert (double)(uint64_t)0 to -0.0 when fegetround() == FE_DOWNWARD. The fast exit for integers with exponent==0 did not have the guard that already exists in the non-nearest-rounding Clinger branch. Mirrored the existing #if defined(__clang__) || defined(FFC_32BIT) guard. Caught by the supplemental test suite.

Benchmark results

5-run averages, dedicated bare-metal servers, GCC -O3. Baseline = after EXP-001 (4-digit SWAR, merged as prior PR).

ARM — Graviton4 (AWS m8g.metal-24xl, 2.80 GHz)

Dataset baseline this PR Δ vs fastfloat
random [0,1] 1558 MB/s 1820 MB/s +17% +67% ffc leads
canada.txt 1331 MB/s 1673 MB/s +26% +89% ffc leads
mesh.txt 1019 MB/s 1656 MB/s +63% +231% ffc leads

x86 — Intel Xeon Platinum 8488C (AWS m7i.metal-24xl)

Dataset baseline this PR Δ vs fastfloat
random [0,1] 1736 MB/s 2018 MB/s +16% ≈0% (tied)
canada.txt 1412 MB/s 1676 MB/s +19% +18% ffc leads
mesh.txt 1073 MB/s 1741 MB/s +62% +54% ffc leads

No regressions on any dataset or architecture across all experiments.

Files changed

  • src/api.hFFC_IMPL_INLINE macro + annotated declarations
  • src/ffc.hextern FFC_IMPL_INLINE on double parse function definitions; FFC_ROUNDS_TO_NEAREST macro; early-exit for exponent == 0 with Clang zero guard; combined exponent range check; integer and fraction nested-if unrolls
  • src/parse.htoo_many_digits path uses locals; DSE-friendly restructure
  • test_src/test.c — guard double_rounds_to_nearest test under #ifndef FFC_ROUNDS_TO_NEAREST
  • ffc.h — regenerated amalgamation

Experiments designed and validated in ffc-agent-workspace — a profiled, evidence-based micro-optimization loop for ffc.h.

fcostaoliveira and others added 10 commits May 26, 2026 16:17
Numbers with 5–7 digit fractions (canada.txt, mesh.txt) never triggered
the 8-digit SWAR loop, falling back to 7 byte-by-byte iterations. A
4-digit SWAR follow-up converts those to 1×SWAR-4 + ≤3 byte-by-byte.

canada: +29% (gap vs fastfloat: -29% → -4%)
mesh:   +18% (gap vs fastfloat: -10% → ±0%)
random: no regression (high variance, within noise)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uct stores

In ffc_parse_number_string, int_part_start/len and fraction_part_start/len
were stored to the ffc_parsed answer struct unconditionally. These fields were
then read back in the rare too_many_digits path — preventing GCC's DSE from
eliminating the stores in non-JSON callers where ISRA removes those fields
from the return ABI.

Restructure: hoist 'before' and 'frac_end_local' out of the has_decimal_point
block. In the too_many_digits path, use start_digits, end_of_integer_part,
before, and frac_end_local directly instead of round-tripping through the
answer struct. The stores to answer.int_part_*/fraction_part_* remain (needed
by ffc_parse_json_number callers), but are no longer read within the function,
so GCC ISRA + DSE eliminates them on non-JSON call sites.

Result: x86 canada +2.2% (1414 → 1445 MB/s, 5-run stable). ARM unchanged
(GCC's ARM backend already handled this via ISRA).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add conditional `__attribute__((always_inline)) inline` to the forward
declarations of `ffc_from_chars_double` and `ffc_from_chars_double_options`
via a new `FFC_IMPL_INLINE` macro.

When a translation unit defines `FFC_IMPL` before including `ffc.h` (the
documented pattern for the implementation TU), GCC's `ipa_early_inline`
pass sees the `always_inline` attribute on those declarations and inlines
both functions at every call site *before* IPA-CP runs. This prevents
GCC from creating a separate out-of-line constprop clone and eliminates
the per-call function-call overhead on every `ffc_from_chars_double` call.

The macro is defined as empty in non-FFC_IMPL TUs, so external linkage
and ABI compatibility for other users of the header are preserved (the
always_inline path does not emit an out-of-line symbol, but the existing
`ffc_parse_double` / `ffc_parse_float` wrappers remain as exported symbols).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace Apple Silicon numbers (2026-03-03) with 5-run averages from
dedicated bare-metal servers: Intel Xeon Platinum 8488C (m7i.metal-24xl)
and Graviton4 (m8g.metal-24xl), both GCC -O3.

Post-EXP-009 results: ffc leads or matches fastfloat on all datasets
on both architectures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…P-012)

Replace two signed comparisons (MIN <= e && e <= MAX) with a single unsigned
range check: (uint64_t)(e - MIN) <= (MAX - MIN). This collapses the two-branch
scattered layout into one branch with compact sequential code, matching
fast_float's approach.

ARM Graviton4: +4.6% mesh, +1.5% random, +1.0% canada.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After ffc_loop_parse_if_eight_digits returns, at most 3 consecutive digit bytes
remain. Replacing the while loop with straight-line nested ifs eliminates the
back-branch, yielding better IPC on out-of-order cores (+3.9% mesh, +2.1% canada
on ARM Graviton4).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…1–4 digits

Eliminates back-branches for the common 1–4 digit integer case.
Falls back to while loop for 5+ digits.

ARM Graviton4: random +4.2% (1823→1900 MB/s), canada +7.4% (1562→1677 MB/s),
mesh +2.0% (1366→1394 MB/s). EXP-026.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the straight-line integer scan from 4 levels to 5 levels,
eliminating the while-loop back-branch for 5-digit integer parts
common in mesh.txt 3D vertex coordinates (e.g. "12345.678").

Results on ARM Graviton4 (m8g.metal-24xl, 2.80 GHz):
- mesh:   +7.9% (1394→1505 MB/s, 14.75→13.66 c/f)
- canada: +0.7% (1677→1688 MB/s, 29.02→28.84 c/f)
- random: +0.4% (1900→1908 MB/s, 30.87→30.75 c/f)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add compile-time macro FFC_ROUNDS_TO_NEAREST that makes
ffc_rounds_to_nearest() return a constant true, eliminating the 7-instruction
volatile-float FCMP chain entirely. Guard the rounding-mode test that
asserts ffc_rounds_to_nearest() == false under non-nearest modes.

Benchmark results on ARM Graviton4 (Neoverse V2):
  mesh.txt  +2.4% (1505 → 1541 MB/s, 100.86 → 93.86 i/f)
  canada    +1.8% (1688 → 1718 MB/s, 196.02 → 189.93 i/f)
  random    +0.8% (1908 → 1924 MB/s, 227.04 → 221.04 i/f)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a fast return path before the Clinger call for pure integers
(exponent==0, mantissa<=2^53). Skips the range check, mantissa check,
pow10 table load, and fmul with 1.0. Safe after EXP-030 eliminated
the volatile FCMP chain.

+12.9% mesh (83.92 i/f from 93.86), +1.1% canada, +0.4% random.
All unit and supplemental tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fcostaoliveira fcostaoliveira changed the title perf: force-inline ffc_from_chars_double via FFC_IMPL_INLINE + GCC DSE cleanup Eight profiled micro-optimizations: +257% ARM mesh, +94% ARM canada, +80% ARM random May 27, 2026
Clang may convert (double)(uint64_t)0 to -0.0 when fegetround() ==
FE_DOWNWARD. The fast path for exponent==0 did not have the same guard
that already existed in the non-nearest Clinger branch. Mirror the
existing #if __clang__ || FFC_32BIT guard.

Also revert readme.md: benchmark table update was not appropriate for
the upstream repository.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fcostaoliveira fcostaoliveira changed the title Eight profiled micro-optimizations: +257% ARM mesh, +94% ARM canada, +80% ARM random Nine profiled micro-optimizations: +231% ARM mesh, +54% x86 mesh, +89% ARM canada May 27, 2026
@fcostaoliveira fcostaoliveira changed the title Nine profiled micro-optimizations: +231% ARM mesh, +54% x86 mesh, +89% ARM canada Nine profiled micro-optimizations: +63% ARM mesh, +26% ARM canada, +17% ARM random vs baseline May 27, 2026
@kolemannix

Copy link
Copy Markdown
Owner

👀

fcostaoliveira and others added 2 commits June 1, 2026 00:39
These accepted experiments (FFC_DIGIT_ACC10 shift-add asm for Clang/AArch64,
acc10 on the exponent accumulator, and the 2x unroll of
ffc_loop_parse_if_eight_digits) were applied to the working tree but never
committed to the submodule. Checkpoint them so the submodule tip reflects the
best accepted state and revert-on-reject is safe during the race.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ffc_negative_digit_comp called ffc_am_to_float(false, am_b, &b,
FFC_VALUE_KIND_DOUBLE) but read the result back with ffc_to_extended_halfway(b,
vk) on the next line. For vk == FLOAT this writes a double (8 bytes) into the
ffc_value union and reads back the float member (4 bytes), producing a garbage
'theor' in the float negative-exponent digit-comparison (slow) path.

Pass the caller's vk consistently, matching the adjacent ffc_to_extended_halfway
call and upstream fast_float (which templates the float type throughout).

Safe by construction: when vk == DOUBLE this is byte-identical to before (vk IS
FFC_VALUE_KIND_DOUBLE), so the double path and the double-parsing fast/slow
paths are unchanged; only vk == FLOAT changes, and only toward correctness.
Validated: unit + 4M-value supplemental tests pass; a 20M-value ffc-vs-strtof
parity sweep over high-digit-count (digit_comp-forcing) inputs is bit-identical.
Found via review of redis/hiredis#1328 (Cursor Bugbot).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@kolemannix kolemannix left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great; requested some changes

Comment thread src/api.h Outdated
Comment thread ffc.h
Comment thread ffc.h Outdated
Comment thread ffc.h
*i = (*i * 100000000) +
ffc_parse_eight_digits_unrolled_swar(ffc_read8_to_u64(*p));
// in rare cases, this will overflow, but that's ok
#if defined(__aarch64__) && defined(__clang__)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just unconditionally do this? I hate the complexity of the compiler-specific flagging; if its faster and more direct and correct, let's just do it this way. The branches can explode combinatorially and make things harder to optimize later

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you on the combinatorial branches, and I've tightened the comment so the gating is self-explanatory. It can't be made unconditional, though — the body is literal AArch64 asm. And on GCC/x86 the plain acc*10 + d already strength-reduces to add + lsl on its own; when I routed GCC through this asm it actually regressed canada (~-5.3%). So it's a single __aarch64__ && __clang__ guard with the portable expression as the default everywhere else — one axis, no real combinatorics. It's a profiled ~+9% on Clang/AArch64, but I'm happy to drop it for the plain macro if you'd rather not carry ARM asm in the tree — your call.

@kolemannix kolemannix Jun 3, 2026

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean ffc_loop_parse_if_eight_digits, the one that's conditional but seems to just do platform-agnostic, unrolled version of the current implementation. Just move to that one.

Comment thread ffc.h Outdated
Comment thread ffc.h Outdated
Comment thread ffc.h Outdated
Comment thread ffc.h
Comment on lines +3156 to +3167
if (!pns.too_many_digits && pns.exponent == 0 &&
pns.mantissa <= ffc_const(vk, MAX_MANTISSA_FAST_PATH)) {
#if defined(__clang__) || defined(FFC_32BIT)
if (pns.mantissa == 0) {
ffc_set_value(value, vk, pns.negative ? -0. : 0.);
return answer;
}
#endif
ffc_set_value(value, vk, pns.mantissa);
if (pns.negative) { ffc_set_value(value, vk, -ffc_read_value(value, vk)); }
return answer;
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a bugfix? just making a note to review more closely

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye — two things are colocated here. The -0.0-under-FE_DOWNWARD guard is a genuine bugfix (under Clang/32-bit a zero mantissa was mapped to -0.0 for negative inputs in downward rounding mode; caught by the supplemental corpus) and is already its own commit, 43e22b3, separate from the exponent==0 fast-path early-exit (c88481a). You can review the bugfix hunk standalone there — and I'm happy to cherry-pick it into its own small precursor PR so it can merge ahead of the perf work if that eases review. Just say the word.

Comment thread ffc.h
fcostaoliveira and others added 2 commits June 3, 2026 12:52
…t cleanups

- Hoist the ffc_inline three-way ladder into api.h above FFC_IMPL_INLINE and
  collapse FFC_IMPL_INLINE to `#define FFC_IMPL_INLINE ffc_inline`; guard the
  duplicate definition in common.h with `#ifndef ffc_inline`. The two macros
  were byte-identical.
- readme: add a Configuration Macros section documenting FFC_IMPL and
  FFC_ROUNDS_TO_NEAREST (with the don't-define-if-you-change-rounding caveat).
- common.h: apply the suggested wording for the ffc_rounds_to_nearest() comment.
- parse.h: rewrite the integer-scan, fraction-unroll, hoisted-locals and
  too_many_digits comments to describe the current code rather than the diff
  that introduced them; tighten the AArch64/Clang acc10 inline-asm comment so
  the single-axis gating is self-explanatory.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…/force-inline-ffc-impl

Resolves the ffc_loop_parse_if_eight_digits conflict by keeping both changes:
- our Clang/AArch64 manual 2x (16-digit) unroll of the SWAR loop, and
- upstream's new 4-digit follow-up block for sub-8-digit remainders.
The follow-up sits after the #if/#else digit loop, so it benefits both the
Clang/AArch64 unrolled path and the GCC/portable while-loop path. ffc.h
regenerated; unit + supplemental tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@fcostaoliveira fcostaoliveira requested a review from kolemannix June 3, 2026 16:01
@fcostaoliveira

Copy link
Copy Markdown
Contributor Author

@kolemannix — review comments above are all addressed. Here's the validation on the current branch tip (after merging latest main, including the 4-digit SWAR follow-up), across 4 Intel generations + ARM Graviton4.

Correctness

Full 2³² binary32 exhaustive — all ok on every box:

Box Compiler Path Result
Cascade Lake / Ice Lake / Emerald Rapids / Granite Rapids gcc 11 x86 (#else) + 4-digit follow-up all ok
ARM Graviton4 clang 18.1.3 AArch64 2× unroll + 4-digit follow-up all ok

Performance

ffc-only microbench, single-core pinned, best-of-9 trials. Baseline = current main, PR = this branch tip. Throughput in MB/s, with % delta:

Env dataset base MB/s PR MB/s Δ
Cascade Lake (gcc) random 550.0 682.2 +24.0%
Cascade Lake (gcc) mesh 356.8 577.7 +61.9%
Cascade Lake (gcc) canada 533.6 702.8 +31.7%
Ice Lake (gcc) random 848.7 1054.4 +24.2%
Ice Lake (gcc) mesh 607.7 988.3 +62.6%
Ice Lake (gcc) canada 871.7 1091.7 +25.2%
Emerald Rapids (gcc) random 1226.8 1491.5 +21.6%
Emerald Rapids (gcc) mesh 904.5 1517.4 +67.7%
Emerald Rapids (gcc) canada 1363.0 1712.8 +25.7%
Granite Rapids (gcc) random 1240.3 1539.3 +24.1%
Granite Rapids (gcc) mesh 926.7 1532.9 +65.4%
Granite Rapids (gcc) canada 1352.3 1762.5 +30.3%
Graviton4 (clang) random 1252.4 1529.1 +22.1%
Graviton4 (clang) mesh 829.6 1373.1 +65.5%
Graviton4 (clang) canada 1063.6 1288.4 +21.1%

Faster on every environment and dataset, no regressions: random +21.6–24.2%, mesh +61.9–67.7%, canada +21.1–31.7%. This is built without FFC_ROUNDS_TO_NEAREST, so it's the conservative number — defining it picks up a bit more on the fast path.

Comment thread src/parse.h
Comment on lines 199 to +202
ffc_internal ffc_inline void
ffc_loop_parse_if_eight_digits(char const **p, char const *const pend,
uint64_t* i) {
// optimizes better than parse_if_eight_digits_unrolled() for char.
#if defined(__aarch64__) && defined(__clang__)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one that I'd like to make unconditional, if we can, just to have fewer paths to maintain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this — unfortunately the gate turns out to be load-bearing rather than stylistic. I A/B'd the unconditional version (unrolled for all platforms) vs the current gated one: ffc-only microbench, single-core pinned, best-of-11, main of this branch.

env random mesh canada
Cascade Lake (gcc 11) −4.9% −7.7% −4.6%
Ice Lake (gcc 11) −0.3% −3.6% −2.3%
Granite Rapids (gcc 11) −0.8% −7.0% −4.5%
Graviton4 (clang 18) +0.4% +0.2% +0.1%

(Δ = unconditional vs gated.) GCC auto-unrolls the plain while loop better than the hand-written 16-digit unroll, so forcing the unroll everywhere costs up to ~8% on GCC/x86 short floats; on Clang/AArch64 it's a wash since that path already unrolled. The two branches exist because the two compilers genuinely want different code here.

Given the ~4–8% GCC/x86 regression I'd prefer to keep the gate — but I'm happy to take the hit and collapse to one path if you'd still rather have fewer branches to maintain. Your call.

fcostaoliveira added a commit to redis-performance/ffc.h that referenced this pull request Jun 9, 2026
Same change as perf/outline-slow-resolve but rebased on kolemannix/main. GCC
outlines ffc_resolve_slow (hot frame -24%); Clang keeps the inline form
(byte-identical baseline). Bit-identical (gcc+clang) + exhaustive all-ok.
NOTE: vs bare main this regresses canada -2.9% (gcc) because main lacks kolemannix#24's
exp==0 early-exit; clean only stacked on kolemannix#24. HELD — see HANDOFF.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kolemannix

Copy link
Copy Markdown
Owner

Some conflicts to resolve now.

@fcostaoliveira

Copy link
Copy Markdown
Contributor Author

@kolemannix — conflicts resolved. Merged latest main (the squashed #25/#26 + new test cases) into the branch; the only net change vs the prior tip is the 4 new test-case lines in test_src/float_cases.csv — the digit-loop logic is byte-identical to what you reviewed. src/parse.h resolutions: kept the integer-scan unroll and the 1-3 digit fraction unroll (this PR's content), took main for the already-merged acc10 bits, and regenerated ffc.h.

Green locally: Stage 1 unit tests + Stage 2 supplemental corpus all pass. PR shows MERGEABLE again. Ready for another look.

fcostaoliveira added a commit to redis-performance/ffc-agent-workspace that referenced this pull request Jun 15, 2026
Resolves the merge conflict on kolemannix/ffc.h#24. Pointer tracks
origin/perf/force-inline-ffc-impl; logic byte-identical to the prior
reviewed tip plus 4 upstream test-case lines. Stage 1 + Stage 2
correctness green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants