Skip to content

Data struct with codepoint builder for radicals#7805

Open
opnuub wants to merge 2 commits intounicode-org:mainfrom
opnuub:radicals-struct
Open

Data struct with codepoint builder for radicals#7805
opnuub wants to merge 2 commits intounicode-org:mainfrom
opnuub:radicals-struct

Conversation

@opnuub
Copy link
Copy Markdown
Contributor

@opnuub opnuub commented Mar 24, 2026

Issue: #6941

Modified from #7722 and #7646

Changelog

  • icu_segmenter: Add unstable Unihan radical provider data and baked support
    • New types: icu_segmenter::provider::UnihanIrgData<'data>, icu_segmenter::provider::SegmenterUnihanRadicalV1
    • New associated const: icu_segmenter::provider::Baked::SINGLETON_SEGMENTER_UNIHAN_RADICAL_V1
    • The experimental_segmenter example now uses radaboost for the Chinese radical model and adds thadaboost for Thai
  • icu_provider_source: Add Unihan radical trie generation for icu_segmenter::provider::SegmenterUnihanRadicalV1
    • SourceDataProvider can now load this marker from Unihan IRG data
  • icu_provider_registry: Export icu_segmenter::provider::SegmenterUnihanRadicalV1

@robertbastian
Copy link
Copy Markdown
Member

This is the third PR doing the same thing. There is significant discussion in #7646, why do you keep creating new PRs?

@opnuub
Copy link
Copy Markdown
Contributor Author

opnuub commented Apr 7, 2026

This is the third PR doing the same thing. There is significant discussion in #7646, why do you keep creating new PRs?

because there's a merge conflict and I don't think I'm supposed to git rebase

@robertbastian
Copy link
Copy Markdown
Member

Ideally you'd resolve merge conflicts using a merge commit, but rebasing is better than creating a new PR that drops all previous context.

@opnuub
Copy link
Copy Markdown
Contributor Author

opnuub commented Apr 7, 2026

Ideally you'd resolve merge conflicts using a merge commit, but rebasing is better than creating a new PR that drops all previous context.

Got it, thank you. I'll follow this practice in the future.

Comment on lines 18 to +21
mod lstm;
mod radical;
pub use lstm::*;
pub use radical::*;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mod lstm;
mod radical;
pub use lstm::*;
pub use radical::*;
mod lstm;
#[cfg(feature = "unstable")]
pub mod radical;
pub use lstm::*;

);

icu_provider::data_marker!(
/// `SegmenterUnihanRadicalV1`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertbastian commented on the old PR about these docs. Please write what the marker represents.

use std::collections::HashMap;

fn load_irg_from_baked() -> &'static UnihanIrgData<'static> {
Baked::SINGLETON_SEGMENTER_UNIHAN_RADICAL_V1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline. you can even inline this all the way into Predictor::for_test


let unihan_cache = self.unihan()?;
let ucd = self.ucd()?;
let irg_map = unihan_cache.irg_sources(ucd)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still disagree with the meat of this implementation living outside this module, with the need of this having to be cached, and with the overhead of creating an intermediate LiteMap


let unihan_cache = self.unihan()?;
let ucd = self.ucd()?;
let irg_map = unihan_cache.irg_sources(ucd)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you still have irg naming everywhere, these variables/functions should be called radicals now

}

#[test]
fn test_chinese_irg_values_trie() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good test, because it tests the DataProvider<SegmenterUnihanRadicalV1> implementation. the test above test the same, however, but through accessing the internal cache; why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants