Make the collator a lot faster by hsivonen · Pull Request #7600 · unicode-org/icu4x

hsivonen · 2026-02-06T09:51:19Z

#7528 needs to land first, but opening a PR to let reviewers take a look before that.

The key insight here is that by far the most comparisons stop at the first primary difference at the position of the first code unit difference, and that comparison is most often between simple collation unit types.

* AbstractCodePointTrie allows code to be generic over both typed and untyped tries. * UTF-8 accessors allow optimal access from within a UTF-8 decoder. * Latin1 accessor allows optimal access with Latin1.

…bles

hsivonen · 2026-02-12T12:47:50Z

This now statically assumes small tries in the collator, which around a 5% perf win give or take a % point. Our data pipeline has never supported generating fast tries. Putting on the discussion agenda for this point.

hsivonen · 2026-02-13T08:13:25Z

This now statically assumes small tries in the collator, which around a 5% perf win give or take a % point. Our data pipeline has never supported generating fast tries. Putting on the discussion agenda for this point.

We discussed this and concluded that statically assuming small-mode tries is not a semver break given what the data tooling has been like.

robertbastian · 2026-02-16T16:23:21Z

provider/source/src/collator/mod.rs

        Ok(self
            .icuexport()?
            .list(&format!("collation/{}", self.collation_root_han()))?
+            .filter(|name| !name.contains("POSIX")) // No known use cases


I'd like to know more about this. If it's actually not needed, then ICU shouldn't support and export it, and CLDR shouldn't have data for it.

I'd prefer removing it from CLDR, or from ICU export data. We should not remove things here, because then there's no way to include them, even if the data is there and the caller selects this.

I thought we were asked about it recently? Maybe I'm misremembering...

All I can remember was a recent discussion about posix in host_info: #6574

There's an issue for collation data: #6511

I think there was a comment from someone external. @sffc might remember? But yeah this is something CLDR can figure out

sffc · 2026-02-17T16:04:01Z

WG discussion:

@Manishearth It's hard to review the PR without normalizer landing.
@hsivonen I found that after doing experiments with fast fata for the root, for chinese/japanese, data size for myanmar/khmer, ICU4C has small tries for all these, and how much data it adds to do fast tries, consider, if we assume small tries at compile time, we get like 5% boost across the board. So I think we should be assuming small tries at compile time, and design a better data structure. Given that we never shipped a data pipeline shipping fast mode tries for collation, can we say that it's not a semver break since we've never shipped tooling for otherwise? They would have had to patch more stuff than is reasonable.
@Manishearth I find that compelling
@sffc Does the datagen flag for trie small/fast not impact collator tries?
@hsivonen It impacts normalizer and properties. Not on these.
@Manishearth If a client mucked with the bytes in unsupported ways, we can break them.
@sffc I agree. If a client couldn't get to this code path with a public icu4x-datagen invocation, then we can delete the code path.

hsivonen added 30 commits January 22, 2026 12:01

Introduce AbstractCodePointTrie, Latin1 getter, and UTF-8 getters

8fbb1f1

* AbstractCodePointTrie allows code to be generic over both typed and untyped tries. * UTF-8 accessors allow optimal access from within a UTF-8 decoder. * Latin1 accessor allows optimal access with Latin1.

Decouple UAX 15 and UTS 46 trie types

73356ab

Add iterators that also do trie lookups

9a4983a

Use trie-aware iterators in the normalizer

ed51573

Use fused trie lookup and UTF decoding in the collator

da07c72

Add functions for normalizing Latin1 to UTF-16

da4183c

Prepare for Gecko

9db4973

Implement new data layout for canonical composition data

db21369

Optimize NFD

7e7b618

Optimize Latin1

67c7783

Stay on NFD fast track for single combining mark

2e2611c

Rework Latin1 norm

a481c01

Likely/unlikely now used in non-UTF-16

5063f39

Merge remote-tracking branch 'origin/latin1chunk' into nfdsinglemark

e291315

Tweak passthrough bounds

9b6b8ce

Prepare for Gecko landing

6110491

if outside loop

3480e1e

Fix clippy lints

e6ecb7c

Removed stale conditional compliation

be60b32

Merge branch 'main' into normreview

2e9c82c

Merge branch 'main' into normreview

338a42c

Collator perf notes

212299e

Move Hangul syllable handling

6f76a5a

Make init faster

34524c4

Add remark about invariant relaxation

2937a74

Fast primary check after identical prefix

3680b2d

Add a quick primary check when there is no identical prefix

7f8cbc7

Optimize the first difference being a comparison between Hangul sylla…

bd8b5bb

…bles

Revise comments

e04a69d

Restore the less useful collator benches

2ca0dac

hsivonen added 15 commits February 10, 2026 15:01

Merge branch 'normreview' into collatoropt

57e605d

Prepare for trie customization

5d16ee8

cargo fmt

6b25873

Merge branch 'normreview' into collatoropt

19ae6f4

Tailoring-specific trie type

bf4f38f

Make more languages fast

afb446b

Prepare to hoist CE32s

045abf8

Back to small tries for now

00a1d2b

Copy from root to tailoring

629fefd

Replace missing tailoring with root earlier

4198d6a

Hoist Hiragana

932983f

No fallback to root in the quick primary check at the very start

5c0ab1d

Serialization support for typed tries

dabee88

Statically assume small trie in the collator

86f85a9

Hoist performance-sensitive ranges

db74341

hsivonen added the discuss-priority Discuss at the next ICU4X meeting label Feb 12, 2026

hsivonen removed the discuss-priority Discuss at the next ICU4X meeting label Feb 13, 2026

hsivonen mentioned this pull request Feb 13, 2026

Ensure that non-Latin collation tailoring tries optimize common letters #7614

Open

hsivonen added 4 commits February 13, 2026 16:56

Rearrange fast primary check branches again

7f63971

Optimize kana, bench Cyrillic tailorings

3f9da98

Use correct data after identical prefix

c433ab7

Do not optimize two-jamo Hangul syllables over everything else

a78ff8a

hsivonen mentioned this pull request Feb 16, 2026

Known problems with collator performance #7655

Open

robertbastian reviewed Feb 16, 2026

View reviewed changes

hsivonen added 2 commits February 19, 2026 13:52

Add enumeration of collator locales behind ustable flag

0848ea6

Use write16 from crates.io

3579f23

hsivonen mentioned this pull request Mar 11, 2026

CodePointTrie support for normalizer and collator perf improvements #7768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the collator a lot faster#7600

Make the collator a lot faster#7600
hsivonen wants to merge 56 commits intounicode-org:mainfrom
hsivonen:collatoropt

hsivonen commented Feb 6, 2026

Uh oh!

hsivonen commented Feb 12, 2026

Uh oh!

hsivonen commented Feb 13, 2026

Uh oh!

robertbastian Feb 16, 2026

Uh oh!

Manishearth Feb 16, 2026

Uh oh!

robertbastian Feb 16, 2026

Uh oh!

Manishearth Feb 16, 2026

Uh oh!

sffc commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hsivonen commented Feb 6, 2026

Uh oh!

hsivonen commented Feb 12, 2026

Uh oh!

hsivonen commented Feb 13, 2026

Uh oh!

robertbastian Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Manishearth Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

robertbastian Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Manishearth Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

sffc commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants