Skip to content

Make the collator a lot faster#7600

Open
hsivonen wants to merge 56 commits intounicode-org:mainfrom
hsivonen:collatoropt
Open

Make the collator a lot faster#7600
hsivonen wants to merge 56 commits intounicode-org:mainfrom
hsivonen:collatoropt

Conversation

@hsivonen
Copy link
Copy Markdown
Member

@hsivonen hsivonen commented Feb 6, 2026

#7528 needs to land first, but opening a PR to let reviewers take a look before that.

The key insight here is that by far the most comparisons stop at the first primary difference at the position of the first code unit difference, and that comparison is most often between simple collation unit types.

* AbstractCodePointTrie allows code to be generic over both typed and untyped tries.
* UTF-8 accessors allow optimal access from within a UTF-8 decoder.
* Latin1 accessor allows optimal access with Latin1.
@hsivonen hsivonen added the discuss-priority Discuss at the next ICU4X meeting label Feb 12, 2026
@hsivonen
Copy link
Copy Markdown
Member Author

This now statically assumes small tries in the collator, which around a 5% perf win give or take a % point. Our data pipeline has never supported generating fast tries. Putting on the discussion agenda for this point.

@hsivonen
Copy link
Copy Markdown
Member Author

This now statically assumes small tries in the collator, which around a 5% perf win give or take a % point. Our data pipeline has never supported generating fast tries. Putting on the discussion agenda for this point.

We discussed this and concluded that statically assuming small-mode tries is not a semver break given what the data tooling has been like.

Ok(self
.icuexport()?
.list(&format!("collation/{}", self.collation_root_han()))?
.filter(|name| !name.contains("POSIX")) // No known use cases
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know more about this. If it's actually not needed, then ICU shouldn't support and export it, and CLDR shouldn't have data for it.

I'd prefer removing it from CLDR, or from ICU export data. We should not remove things here, because then there's no way to include them, even if the data is there and the caller selects this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were asked about it recently? Maybe I'm misremembering...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I can remember was a recent discussion about posix in host_info: #6574

There's an issue for collation data: #6511

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a comment from someone external. @sffc might remember? But yeah this is something CLDR can figure out

@sffc
Copy link
Copy Markdown
Member

sffc commented Feb 17, 2026

WG discussion:

  • @Manishearth It's hard to review the PR without normalizer landing.
  • @hsivonen I found that after doing experiments with fast fata for the root, for chinese/japanese, data size for myanmar/khmer, ICU4C has small tries for all these, and how much data it adds to do fast tries, consider, if we assume small tries at compile time, we get like 5% boost across the board. So I think we should be assuming small tries at compile time, and design a better data structure. Given that we never shipped a data pipeline shipping fast mode tries for collation, can we say that it's not a semver break since we've never shipped tooling for otherwise? They would have had to patch more stuff than is reasonable.
  • @Manishearth I find that compelling
  • @sffc Does the datagen flag for trie small/fast not impact collator tries?
  • @hsivonen It impacts normalizer and properties. Not on these.
  • @Manishearth If a client mucked with the bytes in unsupported ways, we can break them.
  • @sffc I agree. If a client couldn't get to this code path with a public icu4x-datagen invocation, then we can delete the code path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-performance Area: Performance (CPU, Memory) C-collator Component: Collation, normalization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants