Nine profiled micro-optimizations: +63% ARM mesh, +26% ARM canada, +17% ARM random vs baseline by fcostaoliveira · Pull Request #24 · kolemannix/ffc.h

fcostaoliveira · 2026-05-26T20:22:05Z

Nine changes from two days of profiled, evidence-based optimization on bare-metal ARM Graviton4 (m8g.metal-24xl) and x86 Xeon (m7i.metal-24xl). Each change was validated with perf record -g before acceptance.

Changes

1. `FFC_IMPL_INLINE` — force-inline `ffc_from_chars_double` at call sites (EXP-009)

GCC's IPA-CP pass was creating an out-of-line constprop clone (ffc_from_chars_double_options.constprop.0.isra.0) and calling it as a regular function on every float. The C++ fast_float reference gets full inlining via templates; ffc_from_chars_double's external linkage prevented the equivalent.

FFC_IMPL_INLINE expands to __attribute__((always_inline)) inline in FFC_IMPL translation units, and to nothing otherwise. Applying it to the forward declarations in api.h causes GCC's ipa_early_inline pass to inline both functions at call sites — eliminating per-float call overhead.

ABI note: always_inline suppresses the out-of-line symbol in FFC_IMPL TUs; the ffc_parse_double / ffc_parse_float wrappers remain exported for users who don't define FFC_IMPL.

2. Local vars in `too_many_digits` path — GCC DSE of struct stores (EXP-006)

Four fields (int_part_*, fraction_part_*) were stored to the ffc_parsed struct unconditionally then read back in the rare too_many_digits recovery path. This prevented GCC from dead-store-eliminating those struct stores on the non-JSON constprop clone. Restructuring to use locals (which GCC sees as registers) enables DSE. ARM GCC already handled this via more aggressive ISRA.

3. Combined exponent range check in `ffc_clinger_fast_path_impl` (EXP-012)

Clinger's fast path checked exponent < 0 and exponent > power_of_ten_count with two separate branches. Combining into a single unsigned comparison ((uint64_t)(exponent + 22) <= 44) saves one branch and one comparison instruction on the hot path.

4. Unroll fraction digit loop: 3 nested ifs replace while loop (EXP-015)

The fraction part scanner used a while loop. For inputs where the fraction is ≤3 bytes (common in real-world coordinates), a 3-level nested-if eliminates the loop back-branch and allows GCC to hoist the loop-exit condition out.

5. Straight-line integer scan: nested-ifs replace while loop for 1–4 digits (EXP-026)

Same pattern applied to the integer part. Most real-world floats have 1–4 integer digits; 4-level nested-if eliminates the loop back-branch and exposes more ILP.

6. Extend integer nested-ifs to 5 levels (EXP-028)

Profile showed 5-digit integer parts are common in mesh.txt (3D coordinates like 12345.678). Extending the nested-if from 4→5 levels eliminates the while-loop back-branch for those cases.

7. `FFC_ROUNDS_TO_NEAREST` compile-time macro — eliminate FCMP chain (EXP-030)

ffc_rounds_to_nearest() read the FP control word at runtime via a volatile double store/load round-trip. This 7-instruction sequence appeared on the hot path of every call to ffc_clinger_fast_path_impl.

FFC_ROUNDS_TO_NEAREST (default: 1) lets the compiler see a compile-time constant, reducing the entire check to a single true branch — zero instructions. Can be set to 0 to preserve runtime detection.

8. Early exit for `exponent == 0` in `ffc_from_chars_advanced` (EXP-033)

~55% of mesh.txt lines are pure integers (no fractional part, exponent 0). Previously, each was routed through Clinger's full path: range check, pow10 table load, scvtf, and fmul × 1.0. With FFC_ROUNDS_TO_NEAREST eliminating the FCMP chain, a pre-Clinger guard is safe to add. The early exit converts the integer mantissa directly and returns, saving ~10–16 instructions per integer value.

9. Fix: zero sign under `FE_DOWNWARD` in exponent==0 fast path

Clang may convert (double)(uint64_t)0 to -0.0 when fegetround() == FE_DOWNWARD. The fast exit for integers with exponent==0 did not have the guard that already exists in the non-nearest-rounding Clinger branch. Mirrored the existing #if defined(__clang__) || defined(FFC_32BIT) guard. Caught by the supplemental test suite.

Benchmark results

5-run averages, dedicated bare-metal servers, GCC -O3. Baseline = after EXP-001 (4-digit SWAR, merged as prior PR).

ARM — Graviton4 (AWS m8g.metal-24xl, 2.80 GHz)

Dataset	baseline	this PR	Δ	vs fastfloat
random [0,1]	1558 MB/s	1820 MB/s	+17%	+67% ffc leads
canada.txt	1331 MB/s	1673 MB/s	+26%	+89% ffc leads
mesh.txt	1019 MB/s	1656 MB/s	+63%	+231% ffc leads

x86 — Intel Xeon Platinum 8488C (AWS m7i.metal-24xl)

Dataset	baseline	this PR	Δ	vs fastfloat
random [0,1]	1736 MB/s	2018 MB/s	+16%	≈0% (tied)
canada.txt	1412 MB/s	1676 MB/s	+19%	+18% ffc leads
mesh.txt	1073 MB/s	1741 MB/s	+62%	+54% ffc leads

No regressions on any dataset or architecture across all experiments.

Files changed

src/api.h — FFC_IMPL_INLINE macro + annotated declarations
src/ffc.h — extern FFC_IMPL_INLINE on double parse function definitions; FFC_ROUNDS_TO_NEAREST macro; early-exit for exponent == 0 with Clang zero guard; combined exponent range check; integer and fraction nested-if unrolls
src/parse.h — too_many_digits path uses locals; DSE-friendly restructure
test_src/test.c — guard double_rounds_to_nearest test under #ifndef FFC_ROUNDS_TO_NEAREST
ffc.h — regenerated amalgamation

Experiments designed and validated in ffc-agent-workspace — a profiled, evidence-based micro-optimization loop for ffc.h.

Numbers with 5–7 digit fractions (canada.txt, mesh.txt) never triggered the 8-digit SWAR loop, falling back to 7 byte-by-byte iterations. A 4-digit SWAR follow-up converts those to 1×SWAR-4 + ≤3 byte-by-byte. canada: +29% (gap vs fastfloat: -29% → -4%) mesh: +18% (gap vs fastfloat: -10% → ±0%) random: no regression (high variance, within noise) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uct stores In ffc_parse_number_string, int_part_start/len and fraction_part_start/len were stored to the ffc_parsed answer struct unconditionally. These fields were then read back in the rare too_many_digits path — preventing GCC's DSE from eliminating the stores in non-JSON callers where ISRA removes those fields from the return ABI. Restructure: hoist 'before' and 'frac_end_local' out of the has_decimal_point block. In the too_many_digits path, use start_digits, end_of_integer_part, before, and frac_end_local directly instead of round-tripping through the answer struct. The stores to answer.int_part_*/fraction_part_* remain (needed by ffc_parse_json_number callers), but are no longer read within the function, so GCC ISRA + DSE eliminates them on non-JSON call sites. Result: x86 canada +2.2% (1414 → 1445 MB/s, 5-run stable). ARM unchanged (GCC's ARM backend already handled this via ISRA). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add conditional `__attribute__((always_inline)) inline` to the forward declarations of `ffc_from_chars_double` and `ffc_from_chars_double_options` via a new `FFC_IMPL_INLINE` macro. When a translation unit defines `FFC_IMPL` before including `ffc.h` (the documented pattern for the implementation TU), GCC's `ipa_early_inline` pass sees the `always_inline` attribute on those declarations and inlines both functions at every call site *before* IPA-CP runs. This prevents GCC from creating a separate out-of-line constprop clone and eliminates the per-call function-call overhead on every `ffc_from_chars_double` call. The macro is defined as empty in non-FFC_IMPL TUs, so external linkage and ABI compatibility for other users of the header are preserved (the always_inline path does not emit an out-of-line symbol, but the existing `ffc_parse_double` / `ffc_parse_float` wrappers remain as exported symbols). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace Apple Silicon numbers (2026-03-03) with 5-run averages from dedicated bare-metal servers: Intel Xeon Platinum 8488C (m7i.metal-24xl) and Graviton4 (m8g.metal-24xl), both GCC -O3. Post-EXP-009 results: ffc leads or matches fastfloat on all datasets on both architectures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…P-012) Replace two signed comparisons (MIN <= e && e <= MAX) with a single unsigned range check: (uint64_t)(e - MIN) <= (MAX - MIN). This collapses the two-branch scattered layout into one branch with compact sequential code, matching fast_float's approach. ARM Graviton4: +4.6% mesh, +1.5% random, +1.0% canada. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

After ffc_loop_parse_if_eight_digits returns, at most 3 consecutive digit bytes remain. Replacing the while loop with straight-line nested ifs eliminates the back-branch, yielding better IPC on out-of-order cores (+3.9% mesh, +2.1% canada on ARM Graviton4). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…1–4 digits Eliminates back-branches for the common 1–4 digit integer case. Falls back to while loop for 5+ digits. ARM Graviton4: random +4.2% (1823→1900 MB/s), canada +7.4% (1562→1677 MB/s), mesh +2.0% (1366→1394 MB/s). EXP-026. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extends the straight-line integer scan from 4 levels to 5 levels, eliminating the while-loop back-branch for 5-digit integer parts common in mesh.txt 3D vertex coordinates (e.g. "12345.678"). Results on ARM Graviton4 (m8g.metal-24xl, 2.80 GHz): - mesh: +7.9% (1394→1505 MB/s, 14.75→13.66 c/f) - canada: +0.7% (1677→1688 MB/s, 29.02→28.84 c/f) - random: +0.4% (1900→1908 MB/s, 30.87→30.75 c/f) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add compile-time macro FFC_ROUNDS_TO_NEAREST that makes ffc_rounds_to_nearest() return a constant true, eliminating the 7-instruction volatile-float FCMP chain entirely. Guard the rounding-mode test that asserts ffc_rounds_to_nearest() == false under non-nearest modes. Benchmark results on ARM Graviton4 (Neoverse V2): mesh.txt +2.4% (1505 → 1541 MB/s, 100.86 → 93.86 i/f) canada +1.8% (1688 → 1718 MB/s, 196.02 → 189.93 i/f) random +0.8% (1908 → 1924 MB/s, 227.04 → 221.04 i/f) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a fast return path before the Clinger call for pure integers (exponent==0, mantissa<=2^53). Skips the range check, mantissa check, pow10 table load, and fmul with 1.0. Safe after EXP-030 eliminated the volatile FCMP chain. +12.9% mesh (83.92 i/f from 93.86), +1.1% canada, +0.4% random. All unit and supplemental tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Clang may convert (double)(uint64_t)0 to -0.0 when fegetround() == FE_DOWNWARD. The fast path for exponent==0 did not have the same guard that already existed in the non-nearest Clinger branch. Mirror the existing #if __clang__ || FFC_32BIT guard. Also revert readme.md: benchmark table update was not appropriate for the upstream repository. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kolemannix · 2026-05-29T13:19:40Z

👀

These accepted experiments (FFC_DIGIT_ACC10 shift-add asm for Clang/AArch64, acc10 on the exponent accumulator, and the 2x unroll of ffc_loop_parse_if_eight_digits) were applied to the working tree but never committed to the submodule. Checkpoint them so the submodule tip reflects the best accepted state and revert-on-reject is safe during the race. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ffc_negative_digit_comp called ffc_am_to_float(false, am_b, &b, FFC_VALUE_KIND_DOUBLE) but read the result back with ffc_to_extended_halfway(b, vk) on the next line. For vk == FLOAT this writes a double (8 bytes) into the ffc_value union and reads back the float member (4 bytes), producing a garbage 'theor' in the float negative-exponent digit-comparison (slow) path. Pass the caller's vk consistently, matching the adjacent ffc_to_extended_halfway call and upstream fast_float (which templates the float type throughout). Safe by construction: when vk == DOUBLE this is byte-identical to before (vk IS FFC_VALUE_KIND_DOUBLE), so the double path and the double-parsing fast/slow paths are unchanged; only vk == FLOAT changes, and only toward correctness. Validated: unit + 4M-value supplemental tests pass; a 20M-value ffc-vs-strtof parity sweep over high-digit-count (digit_comp-forcing) inputs is bit-identical. Found via review of redis/hiredis#1328 (Cursor Bugbot). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kolemannix

Looks great; requested some changes

kolemannix · 2026-06-02T16:26:43Z

-    *i = (*i * 100000000) +
-        ffc_parse_eight_digits_unrolled_swar(ffc_read8_to_u64(*p)); 
-        // in rare cases, this will overflow, but that's ok
+#if defined(__aarch64__) && defined(__clang__)


Can we just unconditionally do this? I hate the complexity of the compiler-specific flagging; if its faster and more direct and correct, let's just do it this way. The branches can explode combinatorially and make things harder to optimize later

I hear you on the combinatorial branches, and I've tightened the comment so the gating is self-explanatory. It can't be made unconditional, though — the body is literal AArch64 asm. And on GCC/x86 the plain acc*10 + d already strength-reduces to add + lsl on its own; when I routed GCC through this asm it actually regressed canada (~-5.3%). So it's a single __aarch64__ && __clang__ guard with the portable expression as the default everywhere else — one axis, no real combinatorics. It's a profiled ~+9% on Clang/AArch64, but I'm happy to drop it for the plain macro if you'd rather not carry ARM asm in the tree — your call.

I mean ffc_loop_parse_if_eight_digits, the one that's conditional but seems to just do platform-agnostic, unrolled version of the current implementation. Just move to that one.

kolemannix · 2026-06-02T16:30:50Z

+  if (!pns.too_many_digits && pns.exponent == 0 &&
+      pns.mantissa <= ffc_const(vk, MAX_MANTISSA_FAST_PATH)) {
+#if defined(__clang__) || defined(FFC_32BIT)
+    if (pns.mantissa == 0) {
+      ffc_set_value(value, vk, pns.negative ? -0. : 0.);
+      return answer;
+    }
+#endif
+    ffc_set_value(value, vk, pns.mantissa);
+    if (pns.negative) { ffc_set_value(value, vk, -ffc_read_value(value, vk)); }
+    return answer;
+  }


Looks like a bugfix? just making a note to review more closely

Good eye — two things are colocated here. The -0.0-under-FE_DOWNWARD guard is a genuine bugfix (under Clang/32-bit a zero mantissa was mapped to -0.0 for negative inputs in downward rounding mode; caught by the supplemental corpus) and is already its own commit, 43e22b3, separate from the exponent==0 fast-path early-exit (c88481a). You can review the bugfix hunk standalone there — and I'm happy to cherry-pick it into its own small precursor PR so it can merge ahead of the perf work if that eases review. Just say the word.

…t cleanups - Hoist the ffc_inline three-way ladder into api.h above FFC_IMPL_INLINE and collapse FFC_IMPL_INLINE to `#define FFC_IMPL_INLINE ffc_inline`; guard the duplicate definition in common.h with `#ifndef ffc_inline`. The two macros were byte-identical. - readme: add a Configuration Macros section documenting FFC_IMPL and FFC_ROUNDS_TO_NEAREST (with the don't-define-if-you-change-rounding caveat). - common.h: apply the suggested wording for the ffc_rounds_to_nearest() comment. - parse.h: rewrite the integer-scan, fraction-unroll, hoisted-locals and too_many_digits comments to describe the current code rather than the diff that introduced them; tighten the AArch64/Clang acc10 inline-asm comment so the single-axis gating is self-explanatory. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…/force-inline-ffc-impl Resolves the ffc_loop_parse_if_eight_digits conflict by keeping both changes: - our Clang/AArch64 manual 2x (16-digit) unroll of the SWAR loop, and - upstream's new 4-digit follow-up block for sub-8-digit remainders. The follow-up sits after the #if/#else digit loop, so it benefits both the Clang/AArch64 unrolled path and the GCC/portable while-loop path. ffc.h regenerated; unit + supplemental tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fcostaoliveira · 2026-06-03T16:47:10Z

@kolemannix — review comments above are all addressed. Here's the validation on the current branch tip (after merging latest main, including the 4-digit SWAR follow-up), across 4 Intel generations + ARM Graviton4.

Correctness

Full 2³² binary32 exhaustive — all ok on every box:

Box	Compiler	Path	Result
Cascade Lake / Ice Lake / Emerald Rapids / Granite Rapids	gcc 11	x86 (`#else`) + 4-digit follow-up	all ok
ARM Graviton4	clang 18.1.3	AArch64 2× unroll + 4-digit follow-up	all ok

Performance

ffc-only microbench, single-core pinned, best-of-9 trials. Baseline = current main, PR = this branch tip. Throughput in MB/s, with % delta:

Env	dataset	base MB/s	PR MB/s	Δ
Cascade Lake (gcc)	random	550.0	682.2	+24.0%
Cascade Lake (gcc)	mesh	356.8	577.7	+61.9%
Cascade Lake (gcc)	canada	533.6	702.8	+31.7%
Ice Lake (gcc)	random	848.7	1054.4	+24.2%
Ice Lake (gcc)	mesh	607.7	988.3	+62.6%
Ice Lake (gcc)	canada	871.7	1091.7	+25.2%
Emerald Rapids (gcc)	random	1226.8	1491.5	+21.6%
Emerald Rapids (gcc)	mesh	904.5	1517.4	+67.7%
Emerald Rapids (gcc)	canada	1363.0	1712.8	+25.7%
Granite Rapids (gcc)	random	1240.3	1539.3	+24.1%
Granite Rapids (gcc)	mesh	926.7	1532.9	+65.4%
Granite Rapids (gcc)	canada	1352.3	1762.5	+30.3%
Graviton4 (clang)	random	1252.4	1529.1	+22.1%
Graviton4 (clang)	mesh	829.6	1373.1	+65.5%
Graviton4 (clang)	canada	1063.6	1288.4	+21.1%

Faster on every environment and dataset, no regressions: random +21.6–24.2%, mesh +61.9–67.7%, canada +21.1–31.7%. This is built without FFC_ROUNDS_TO_NEAREST, so it's the conservative number — defining it picks up a bit more on the fast path.

kolemannix · 2026-06-04T00:04:45Z

 ffc_internal ffc_inline void
 ffc_loop_parse_if_eight_digits(char const **p, char const *const pend,
                           uint64_t* i) {
-  // optimizes better than parse_if_eight_digits_unrolled() for char.
+#if defined(__aarch64__) && defined(__clang__)


This is the one that I'd like to make unconditional, if we can, just to have fewer paths to maintain.

I tried this — unfortunately the gate turns out to be load-bearing rather than stylistic. I A/B'd the unconditional version (unrolled for all platforms) vs the current gated one: ffc-only microbench, single-core pinned, best-of-11, main of this branch.

env random mesh canada

Cascade Lake (gcc 11) −4.9% −7.7% −4.6%

Ice Lake (gcc 11) −0.3% −3.6% −2.3%

Granite Rapids (gcc 11) −0.8% −7.0% −4.5%

Graviton4 (clang 18) +0.4% +0.2% +0.1%

(Δ = unconditional vs gated.) GCC auto-unrolls the plain while loop better than the hand-written 16-digit unroll, so forcing the unroll everywhere costs up to ~8% on GCC/x86 short floats; on Clang/AArch64 it's a wash since that path already unrolled. The two branches exist because the two compilers genuinely want different code here.

Given the ~4–8% GCC/x86 regression I'd prefer to keep the gate — but I'm happy to take the hit and collapse to one path if you'd still rather have fewer branches to maintain. Your call.

Same change as perf/outline-slow-resolve but rebased on kolemannix/main. GCC outlines ffc_resolve_slow (hot frame -24%); Clang keeps the inline form (byte-identical baseline). Bit-identical (gcc+clang) + exhaustive all-ok. NOTE: vs bare main this regresses canada -2.9% (gcc) because main lacks kolemannix#24's exp==0 early-exit; clean only stacked on kolemannix#24. HELD — see HANDOFF.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kolemannix · 2026-06-15T03:02:33Z

Some conflicts to resolve now.

…fc-impl # Conflicts: # ffc.h # src/parse.h

fcostaoliveira · 2026-06-15T16:10:20Z

@kolemannix — conflicts resolved. Merged latest main (the squashed #25/#26 + new test cases) into the branch; the only net change vs the prior tip is the 4 new test-case lines in test_src/float_cases.csv — the digit-loop logic is byte-identical to what you reviewed. src/parse.h resolutions: kept the integer-scan unroll and the 1-3 digit fraction unroll (this PR's content), took main for the already-merged acc10 bits, and regenerated ffc.h.

Green locally: Stage 1 unit tests + Stage 2 supplemental corpus all pass. PR shows MERGEABLE again. Ready for another look.

Resolves the merge conflict on kolemannix/ffc.h#24. Pointer tracks origin/perf/force-inline-ffc-impl; logic byte-identical to the prior reviewed tip plus 4 upstream test-case lines. Stage 1 + Stage 2 correctness green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fcostaoliveira and others added 10 commits May 26, 2026 16:17

fcostaoliveira changed the title ~~perf: force-inline ffc_from_chars_double via FFC_IMPL_INLINE + GCC DSE cleanup~~ Eight profiled micro-optimizations: +257% ARM mesh, +94% ARM canada, +80% ARM random May 27, 2026

fcostaoliveira changed the title ~~Eight profiled micro-optimizations: +257% ARM mesh, +94% ARM canada, +80% ARM random~~ Nine profiled micro-optimizations: +231% ARM mesh, +54% x86 mesh, +89% ARM canada May 27, 2026

fcostaoliveira changed the title ~~Nine profiled micro-optimizations: +231% ARM mesh, +54% x86 mesh, +89% ARM canada~~ Nine profiled micro-optimizations: +63% ARM mesh, +26% ARM canada, +17% ARM random vs baseline May 27, 2026

fcostaoliveira and others added 2 commits June 1, 2026 00:39

fcostaoliveira mentioned this pull request Jun 2, 2026

Use ffc (pure-C99) as the RESP3 double parser instead of strtod redis/hiredis#1328

Merged

kolemannix requested changes Jun 2, 2026

View reviewed changes

fcostaoliveira and others added 2 commits June 3, 2026 12:52

fcostaoliveira requested a review from kolemannix June 3, 2026 16:01

kolemannix reviewed Jun 4, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into perf/force-inline-f…

c901d3e

…fc-impl # Conflicts: # ffc.h # src/parse.h

env	random	mesh	canada
Cascade Lake (gcc 11)	−4.9%	−7.7%	−4.6%
Ice Lake (gcc 11)	−0.3%	−3.6%	−2.3%
Granite Rapids (gcc 11)	−0.8%	−7.0%	−4.5%
Graviton4 (clang 18)	+0.4%	+0.2%	+0.1%

Conversation

fcostaoliveira commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

1. FFC_IMPL_INLINE — force-inline ffc_from_chars_double at call sites (EXP-009)

2. Local vars in too_many_digits path — GCC DSE of struct stores (EXP-006)

3. Combined exponent range check in ffc_clinger_fast_path_impl (EXP-012)

4. Unroll fraction digit loop: 3 nested ifs replace while loop (EXP-015)

5. Straight-line integer scan: nested-ifs replace while loop for 1–4 digits (EXP-026)

6. Extend integer nested-ifs to 5 levels (EXP-028)

7. FFC_ROUNDS_TO_NEAREST compile-time macro — eliminate FCMP chain (EXP-030)

8. Early exit for exponent == 0 in ffc_from_chars_advanced (EXP-033)

9. Fix: zero sign under FE_DOWNWARD in exponent==0 fast path

Benchmark results

ARM — Graviton4 (AWS m8g.metal-24xl, 2.80 GHz)

x86 — Intel Xeon Platinum 8488C (AWS m7i.metal-24xl)

Files changed

Uh oh!

kolemannix commented May 29, 2026

Uh oh!

kolemannix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kolemannix Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

kolemannix Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kolemannix Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fcostaoliveira commented Jun 3, 2026

Correctness

Performance

Uh oh!

kolemannix Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

kolemannix commented Jun 15, 2026

Uh oh!

fcostaoliveira commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fcostaoliveira commented May 26, 2026 •

edited

Loading

1. `FFC_IMPL_INLINE` — force-inline `ffc_from_chars_double` at call sites (EXP-009)

2. Local vars in `too_many_digits` path — GCC DSE of struct stores (EXP-006)

3. Combined exponent range check in `ffc_clinger_fast_path_impl` (EXP-012)

7. `FFC_ROUNDS_TO_NEAREST` compile-time macro — eliminate FCMP chain (EXP-030)

8. Early exit for `exponent == 0` in `ffc_from_chars_advanced` (EXP-033)

9. Fix: zero sign under `FE_DOWNWARD` in exponent==0 fast path

kolemannix Jun 3, 2026 •

edited

Loading