Nine profiled micro-optimizations: +63% ARM mesh, +26% ARM canada, +17% ARM random vs baseline#24
Conversation
Numbers with 5–7 digit fractions (canada.txt, mesh.txt) never triggered the 8-digit SWAR loop, falling back to 7 byte-by-byte iterations. A 4-digit SWAR follow-up converts those to 1×SWAR-4 + ≤3 byte-by-byte. canada: +29% (gap vs fastfloat: -29% → -4%) mesh: +18% (gap vs fastfloat: -10% → ±0%) random: no regression (high variance, within noise) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uct stores In ffc_parse_number_string, int_part_start/len and fraction_part_start/len were stored to the ffc_parsed answer struct unconditionally. These fields were then read back in the rare too_many_digits path — preventing GCC's DSE from eliminating the stores in non-JSON callers where ISRA removes those fields from the return ABI. Restructure: hoist 'before' and 'frac_end_local' out of the has_decimal_point block. In the too_many_digits path, use start_digits, end_of_integer_part, before, and frac_end_local directly instead of round-tripping through the answer struct. The stores to answer.int_part_*/fraction_part_* remain (needed by ffc_parse_json_number callers), but are no longer read within the function, so GCC ISRA + DSE eliminates them on non-JSON call sites. Result: x86 canada +2.2% (1414 → 1445 MB/s, 5-run stable). ARM unchanged (GCC's ARM backend already handled this via ISRA). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add conditional `__attribute__((always_inline)) inline` to the forward declarations of `ffc_from_chars_double` and `ffc_from_chars_double_options` via a new `FFC_IMPL_INLINE` macro. When a translation unit defines `FFC_IMPL` before including `ffc.h` (the documented pattern for the implementation TU), GCC's `ipa_early_inline` pass sees the `always_inline` attribute on those declarations and inlines both functions at every call site *before* IPA-CP runs. This prevents GCC from creating a separate out-of-line constprop clone and eliminates the per-call function-call overhead on every `ffc_from_chars_double` call. The macro is defined as empty in non-FFC_IMPL TUs, so external linkage and ABI compatibility for other users of the header are preserved (the always_inline path does not emit an out-of-line symbol, but the existing `ffc_parse_double` / `ffc_parse_float` wrappers remain as exported symbols). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace Apple Silicon numbers (2026-03-03) with 5-run averages from dedicated bare-metal servers: Intel Xeon Platinum 8488C (m7i.metal-24xl) and Graviton4 (m8g.metal-24xl), both GCC -O3. Post-EXP-009 results: ffc leads or matches fastfloat on all datasets on both architectures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…P-012) Replace two signed comparisons (MIN <= e && e <= MAX) with a single unsigned range check: (uint64_t)(e - MIN) <= (MAX - MIN). This collapses the two-branch scattered layout into one branch with compact sequential code, matching fast_float's approach. ARM Graviton4: +4.6% mesh, +1.5% random, +1.0% canada. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After ffc_loop_parse_if_eight_digits returns, at most 3 consecutive digit bytes remain. Replacing the while loop with straight-line nested ifs eliminates the back-branch, yielding better IPC on out-of-order cores (+3.9% mesh, +2.1% canada on ARM Graviton4). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…1–4 digits Eliminates back-branches for the common 1–4 digit integer case. Falls back to while loop for 5+ digits. ARM Graviton4: random +4.2% (1823→1900 MB/s), canada +7.4% (1562→1677 MB/s), mesh +2.0% (1366→1394 MB/s). EXP-026. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the straight-line integer scan from 4 levels to 5 levels, eliminating the while-loop back-branch for 5-digit integer parts common in mesh.txt 3D vertex coordinates (e.g. "12345.678"). Results on ARM Graviton4 (m8g.metal-24xl, 2.80 GHz): - mesh: +7.9% (1394→1505 MB/s, 14.75→13.66 c/f) - canada: +0.7% (1677→1688 MB/s, 29.02→28.84 c/f) - random: +0.4% (1900→1908 MB/s, 30.87→30.75 c/f) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add compile-time macro FFC_ROUNDS_TO_NEAREST that makes ffc_rounds_to_nearest() return a constant true, eliminating the 7-instruction volatile-float FCMP chain entirely. Guard the rounding-mode test that asserts ffc_rounds_to_nearest() == false under non-nearest modes. Benchmark results on ARM Graviton4 (Neoverse V2): mesh.txt +2.4% (1505 → 1541 MB/s, 100.86 → 93.86 i/f) canada +1.8% (1688 → 1718 MB/s, 196.02 → 189.93 i/f) random +0.8% (1908 → 1924 MB/s, 227.04 → 221.04 i/f) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a fast return path before the Clinger call for pure integers (exponent==0, mantissa<=2^53). Skips the range check, mantissa check, pow10 table load, and fmul with 1.0. Safe after EXP-030 eliminated the volatile FCMP chain. +12.9% mesh (83.92 i/f from 93.86), +1.1% canada, +0.4% random. All unit and supplemental tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Clang may convert (double)(uint64_t)0 to -0.0 when fegetround() == FE_DOWNWARD. The fast path for exponent==0 did not have the same guard that already existed in the non-nearest Clinger branch. Mirror the existing #if __clang__ || FFC_32BIT guard. Also revert readme.md: benchmark table update was not appropriate for the upstream repository. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
👀 |
These accepted experiments (FFC_DIGIT_ACC10 shift-add asm for Clang/AArch64, acc10 on the exponent accumulator, and the 2x unroll of ffc_loop_parse_if_eight_digits) were applied to the working tree but never committed to the submodule. Checkpoint them so the submodule tip reflects the best accepted state and revert-on-reject is safe during the race. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ffc_negative_digit_comp called ffc_am_to_float(false, am_b, &b, FFC_VALUE_KIND_DOUBLE) but read the result back with ffc_to_extended_halfway(b, vk) on the next line. For vk == FLOAT this writes a double (8 bytes) into the ffc_value union and reads back the float member (4 bytes), producing a garbage 'theor' in the float negative-exponent digit-comparison (slow) path. Pass the caller's vk consistently, matching the adjacent ffc_to_extended_halfway call and upstream fast_float (which templates the float type throughout). Safe by construction: when vk == DOUBLE this is byte-identical to before (vk IS FFC_VALUE_KIND_DOUBLE), so the double path and the double-parsing fast/slow paths are unchanged; only vk == FLOAT changes, and only toward correctness. Validated: unit + 4M-value supplemental tests pass; a 20M-value ffc-vs-strtof parity sweep over high-digit-count (digit_comp-forcing) inputs is bit-identical. Found via review of redis/hiredis#1328 (Cursor Bugbot). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kolemannix
left a comment
There was a problem hiding this comment.
Looks great; requested some changes
| *i = (*i * 100000000) + | ||
| ffc_parse_eight_digits_unrolled_swar(ffc_read8_to_u64(*p)); | ||
| // in rare cases, this will overflow, but that's ok | ||
| #if defined(__aarch64__) && defined(__clang__) |
There was a problem hiding this comment.
Can we just unconditionally do this? I hate the complexity of the compiler-specific flagging; if its faster and more direct and correct, let's just do it this way. The branches can explode combinatorially and make things harder to optimize later
There was a problem hiding this comment.
I hear you on the combinatorial branches, and I've tightened the comment so the gating is self-explanatory. It can't be made unconditional, though — the body is literal AArch64 asm. And on GCC/x86 the plain acc*10 + d already strength-reduces to add + lsl on its own; when I routed GCC through this asm it actually regressed canada (~-5.3%). So it's a single __aarch64__ && __clang__ guard with the portable expression as the default everywhere else — one axis, no real combinatorics. It's a profiled ~+9% on Clang/AArch64, but I'm happy to drop it for the plain macro if you'd rather not carry ARM asm in the tree — your call.
There was a problem hiding this comment.
I mean ffc_loop_parse_if_eight_digits, the one that's conditional but seems to just do platform-agnostic, unrolled version of the current implementation. Just move to that one.
| if (!pns.too_many_digits && pns.exponent == 0 && | ||
| pns.mantissa <= ffc_const(vk, MAX_MANTISSA_FAST_PATH)) { | ||
| #if defined(__clang__) || defined(FFC_32BIT) | ||
| if (pns.mantissa == 0) { | ||
| ffc_set_value(value, vk, pns.negative ? -0. : 0.); | ||
| return answer; | ||
| } | ||
| #endif | ||
| ffc_set_value(value, vk, pns.mantissa); | ||
| if (pns.negative) { ffc_set_value(value, vk, -ffc_read_value(value, vk)); } | ||
| return answer; | ||
| } |
There was a problem hiding this comment.
Looks like a bugfix? just making a note to review more closely
There was a problem hiding this comment.
Good eye — two things are colocated here. The -0.0-under-FE_DOWNWARD guard is a genuine bugfix (under Clang/32-bit a zero mantissa was mapped to -0.0 for negative inputs in downward rounding mode; caught by the supplemental corpus) and is already its own commit, 43e22b3, separate from the exponent==0 fast-path early-exit (c88481a). You can review the bugfix hunk standalone there — and I'm happy to cherry-pick it into its own small precursor PR so it can merge ahead of the perf work if that eases review. Just say the word.
…t cleanups - Hoist the ffc_inline three-way ladder into api.h above FFC_IMPL_INLINE and collapse FFC_IMPL_INLINE to `#define FFC_IMPL_INLINE ffc_inline`; guard the duplicate definition in common.h with `#ifndef ffc_inline`. The two macros were byte-identical. - readme: add a Configuration Macros section documenting FFC_IMPL and FFC_ROUNDS_TO_NEAREST (with the don't-define-if-you-change-rounding caveat). - common.h: apply the suggested wording for the ffc_rounds_to_nearest() comment. - parse.h: rewrite the integer-scan, fraction-unroll, hoisted-locals and too_many_digits comments to describe the current code rather than the diff that introduced them; tighten the AArch64/Clang acc10 inline-asm comment so the single-axis gating is self-explanatory. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…/force-inline-ffc-impl Resolves the ffc_loop_parse_if_eight_digits conflict by keeping both changes: - our Clang/AArch64 manual 2x (16-digit) unroll of the SWAR loop, and - upstream's new 4-digit follow-up block for sub-8-digit remainders. The follow-up sits after the #if/#else digit loop, so it benefits both the Clang/AArch64 unrolled path and the GCC/portable while-loop path. ffc.h regenerated; unit + supplemental tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@kolemannix — review comments above are all addressed. Here's the validation on the current branch tip (after merging latest CorrectnessFull 2³² binary32 exhaustive —
Performanceffc-only microbench, single-core pinned, best-of-9 trials. Baseline = current
Faster on every environment and dataset, no regressions: random +21.6–24.2%, mesh +61.9–67.7%, canada +21.1–31.7%. This is built without |
| ffc_internal ffc_inline void | ||
| ffc_loop_parse_if_eight_digits(char const **p, char const *const pend, | ||
| uint64_t* i) { | ||
| // optimizes better than parse_if_eight_digits_unrolled() for char. | ||
| #if defined(__aarch64__) && defined(__clang__) |
There was a problem hiding this comment.
This is the one that I'd like to make unconditional, if we can, just to have fewer paths to maintain.
There was a problem hiding this comment.
I tried this — unfortunately the gate turns out to be load-bearing rather than stylistic. I A/B'd the unconditional version (unrolled for all platforms) vs the current gated one: ffc-only microbench, single-core pinned, best-of-11, main of this branch.
| env | random | mesh | canada |
|---|---|---|---|
| Cascade Lake (gcc 11) | −4.9% | −7.7% | −4.6% |
| Ice Lake (gcc 11) | −0.3% | −3.6% | −2.3% |
| Granite Rapids (gcc 11) | −0.8% | −7.0% | −4.5% |
| Graviton4 (clang 18) | +0.4% | +0.2% | +0.1% |
(Δ = unconditional vs gated.) GCC auto-unrolls the plain while loop better than the hand-written 16-digit unroll, so forcing the unroll everywhere costs up to ~8% on GCC/x86 short floats; on Clang/AArch64 it's a wash since that path already unrolled. The two branches exist because the two compilers genuinely want different code here.
Given the ~4–8% GCC/x86 regression I'd prefer to keep the gate — but I'm happy to take the hit and collapse to one path if you'd still rather have fewer branches to maintain. Your call.
Same change as perf/outline-slow-resolve but rebased on kolemannix/main. GCC outlines ffc_resolve_slow (hot frame -24%); Clang keeps the inline form (byte-identical baseline). Bit-identical (gcc+clang) + exhaustive all-ok. NOTE: vs bare main this regresses canada -2.9% (gcc) because main lacks kolemannix#24's exp==0 early-exit; clean only stacked on kolemannix#24. HELD — see HANDOFF.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Some conflicts to resolve now. |
…fc-impl # Conflicts: # ffc.h # src/parse.h
|
@kolemannix — conflicts resolved. Merged latest Green locally: Stage 1 unit tests + Stage 2 supplemental corpus all pass. PR shows |
Resolves the merge conflict on kolemannix/ffc.h#24. Pointer tracks origin/perf/force-inline-ffc-impl; logic byte-identical to the prior reviewed tip plus 4 upstream test-case lines. Stage 1 + Stage 2 correctness green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Nine changes from two days of profiled, evidence-based optimization on bare-metal ARM Graviton4 (m8g.metal-24xl) and x86 Xeon (m7i.metal-24xl). Each change was validated with
perf record -gbefore acceptance.Changes
1.
FFC_IMPL_INLINE— force-inlineffc_from_chars_doubleat call sites (EXP-009)GCC's IPA-CP pass was creating an out-of-line constprop clone (
ffc_from_chars_double_options.constprop.0.isra.0) and calling it as a regular function on every float. The C++fast_floatreference gets full inlining via templates;ffc_from_chars_double's external linkage prevented the equivalent.FFC_IMPL_INLINEexpands to__attribute__((always_inline)) inlinein FFC_IMPL translation units, and to nothing otherwise. Applying it to the forward declarations inapi.hcauses GCC'sipa_early_inlinepass to inline both functions at call sites — eliminating per-float call overhead.ABI note:
always_inlinesuppresses the out-of-line symbol in FFC_IMPL TUs; theffc_parse_double/ffc_parse_floatwrappers remain exported for users who don't defineFFC_IMPL.2. Local vars in
too_many_digitspath — GCC DSE of struct stores (EXP-006)Four fields (
int_part_*,fraction_part_*) were stored to theffc_parsedstruct unconditionally then read back in the raretoo_many_digitsrecovery path. This prevented GCC from dead-store-eliminating those struct stores on the non-JSON constprop clone. Restructuring to use locals (which GCC sees as registers) enables DSE. ARM GCC already handled this via more aggressive ISRA.3. Combined exponent range check in
ffc_clinger_fast_path_impl(EXP-012)Clinger's fast path checked
exponent < 0andexponent > power_of_ten_countwith two separate branches. Combining into a single unsigned comparison ((uint64_t)(exponent + 22) <= 44) saves one branch and one comparison instruction on the hot path.4. Unroll fraction digit loop: 3 nested ifs replace while loop (EXP-015)
The fraction part scanner used a while loop. For inputs where the fraction is ≤3 bytes (common in real-world coordinates), a 3-level nested-if eliminates the loop back-branch and allows GCC to hoist the loop-exit condition out.
5. Straight-line integer scan: nested-ifs replace while loop for 1–4 digits (EXP-026)
Same pattern applied to the integer part. Most real-world floats have 1–4 integer digits; 4-level nested-if eliminates the loop back-branch and exposes more ILP.
6. Extend integer nested-ifs to 5 levels (EXP-028)
Profile showed 5-digit integer parts are common in
mesh.txt(3D coordinates like12345.678). Extending the nested-if from 4→5 levels eliminates the while-loop back-branch for those cases.7.
FFC_ROUNDS_TO_NEARESTcompile-time macro — eliminate FCMP chain (EXP-030)ffc_rounds_to_nearest()read the FP control word at runtime via a volatiledoublestore/load round-trip. This 7-instruction sequence appeared on the hot path of every call toffc_clinger_fast_path_impl.FFC_ROUNDS_TO_NEAREST(default: 1) lets the compiler see a compile-time constant, reducing the entire check to a singletruebranch — zero instructions. Can be set to 0 to preserve runtime detection.8. Early exit for
exponent == 0inffc_from_chars_advanced(EXP-033)~55% of
mesh.txtlines are pure integers (no fractional part, exponent 0). Previously, each was routed through Clinger's full path: range check,pow10table load,scvtf, andfmul × 1.0. WithFFC_ROUNDS_TO_NEARESTeliminating the FCMP chain, a pre-Clinger guard is safe to add. The early exit converts the integer mantissa directly and returns, saving ~10–16 instructions per integer value.9. Fix: zero sign under
FE_DOWNWARDin exponent==0 fast pathClang may convert
(double)(uint64_t)0to-0.0whenfegetround() == FE_DOWNWARD. The fast exit for integers with exponent==0 did not have the guard that already exists in the non-nearest-rounding Clinger branch. Mirrored the existing#if defined(__clang__) || defined(FFC_32BIT)guard. Caught by the supplemental test suite.Benchmark results
5-run averages, dedicated bare-metal servers, GCC
-O3. Baseline = after EXP-001 (4-digit SWAR, merged as prior PR).ARM — Graviton4 (AWS m8g.metal-24xl, 2.80 GHz)
x86 — Intel Xeon Platinum 8488C (AWS m7i.metal-24xl)
No regressions on any dataset or architecture across all experiments.
Files changed
src/api.h—FFC_IMPL_INLINEmacro + annotated declarationssrc/ffc.h—extern FFC_IMPL_INLINEon double parse function definitions;FFC_ROUNDS_TO_NEARESTmacro; early-exit forexponent == 0with Clang zero guard; combined exponent range check; integer and fraction nested-if unrollssrc/parse.h—too_many_digitspath uses locals; DSE-friendly restructuretest_src/test.c— guarddouble_rounds_to_nearesttest under#ifndef FFC_ROUNDS_TO_NEARESTffc.h— regenerated amalgamationExperiments designed and validated in ffc-agent-workspace — a profiled, evidence-based micro-optimization loop for ffc.h.