Add unicode support by adag24 · Pull Request #592 · ocaml/ocaml-re

adag24 · 2026-01-29T23:39:54Z

1st proposal to add Unicode support in the continuation of #48 to address #24
Currently utf8, utf16be, utf16le and Latin1 (backward compatibility) are supported but others can be build with respect to the interface Uucodecs.T.

Note that:

this is not a byte comparison but a decoding of the strings (byte comparison would not work for very well founded reasons).
String.t is used but it is easily possible to generalize it if necessary with a very small API (Bytes, in channel, ...). They are supposed to be correctly encoded (there is no fallback / replacement character atm)
the regular expressions and strings are normalized to NFC with the help of uunf when processed (to check completeness within the code). So different normalization forms in the regular expressions or strings should not have any impact on a match. But this slows down the process?

Basically everything is kept from the original except that this is heavily functor based:

the variable-length encoding should match the Uucodecs.T interface
Cset is a functor taking as argument a CodePage.T (see lib/unicode/cset.mli) and a Uucodecs.T
a Color_map.T is created (functor) with a Cset.T argument

The unicode categories (digit, alpha, alphanum, xdigit, ..) are generated during the building stage (currently shall be thoroughly checked and corrected - 1st attempt) with the help of uucd & uucp. ~~But~~ the library itself ~~does not rely~~ relies on uucp:

WARNING : the case insensitive implementation in unicode/cset.ml is currently wrong. I left it aside at the beginning, but it looks more complex to implement than I thought: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G33992 and https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt With the current type Cset.t, I think that we can only implement a simple case folding (C + S in the latter link) but not a (C + F). A full case folding would require to map one Uchar.t to a sequence of 1, 2 or 3 Uchar.t. The case insensitivity is implemented with the property simple_case_folding of the uucd lib. As per https://www.unicode.org/Public/17.0.0/ucd/CaseFolding.txt:

Usage:
A. To do a simple case folding, use the mappings with status C + S.
B. To do a full case folding, use the mappings with status C + F.
The mappings with status T can be used or omitted depending on the desired case-folding
behavior. (The default option is to exclude them.)

Any comment is welcome!

Here after the results of the benchmarks with the original lib and the modified one

Latin1 with the original lib
$_build/default/benchmarks/benchmark.exe

Name	Time/Run	mWd/Run	mjWd/Run	Prom/Run	Percentage
20 zeroes/exec/00000000000000000000	41.46us	21.77kw	400.08w	400.08w	0.03%
20 zeroes/exec/00000000000000000000 (compiled)	42.63us	21.78kw	418.03w	418.03w	0.03%
20 zeroes/execp/00000000000000000000	41.26us	21.69kw	395.27w	395.27w	0.03%
20 zeroes/execp/00000000000000000000 (compiled)	41.38us	21.70kw	412.92w	412.92w	0.03%
20 zeroes/exec_opt/00000000000000000000	41.64us	21.77kw	400.00w	400.00w	0.03%
20 zeroes/exec_opt/00000000000000000000 (compiled)	41.48us	21.78kw	417.27w	417.27w	0.03%
lots of a's/exec/aaaaaaaaaa .. (101)	6.16us	2.30kw	4.80w	4.80w
lots of a's/exec/aaaaaaaaaa .. (101) (compiled)	6.25us	2.31kw	5.73w	5.73w
lots of a's/execp/aaaaaaaaaa .. (101)	5.74us	2.29kw	3.30w	3.30w
lots of a's/execp/aaaaaaaaaa .. (101) (compiled)	5.86us	2.30kw	5.63w	5.63w
lots of a's/exec_opt/aaaaaaaaaa .. (101)	6.14us	2.31kw	4.52w	4.52w
lots of a's/exec_opt/aaaaaaaaaa .. (101) (compiled)	6.28us	2.31kw	5.73w	5.73w
media type match/exec/ foo/bar ; charset=UTF-8	5.41us	2.48kw	5.30w	5.30w
media type match/exec/ foo/bar ; charset=UTF-8 (compiled)	5.44us	2.48kw	6.56w	6.56w
media type match/execp/ foo/bar ; charset=UTF-8	5.25us	2.46kw	5.46w	5.46w
media type match/execp/ foo/bar ; charset=UTF-8 (compiled)	5.30us	2.46kw	6.83w	6.83w
media type match/exec_opt/ foo/bar ; charset=UTF-8	5.39us	2.48kw	5.18w	5.18w
media type match/exec_opt/ foo/bar ; charset=UTF-8 (compiled)	5.49us	2.49kw	6.34w	6.34w
uri/exec/https://google.com	20.81us	12.99kw	96.65w	96.65w	0.01%
uri/exec/https://google.com (compiled)	21.04us	13.00kw	112.86w	112.86w	0.01%
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two	34.07us	19.80kw	219.60w	219.60w	0.02%
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two (compiled)	34.39us	19.81kw	245.03w	245.03w	0.02%
uri/exec/file:/random_crap	12.45us	7.22kw	34.98w	34.98w
uri/exec/file:/random_crap (compiled)	12.67us	7.23kw	45.20w	45.20w
uri/execp/https://google.com	20.40us	12.96kw	96.16w	96.16w	0.01%
uri/execp/https://google.com (compiled)	20.54us	12.97kw	112.21w	112.21w	0.01%
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two	33.53us	19.77kw	219.07w	219.07w	0.02%
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two (compiled)	33.71us	19.78kw	242.97w	242.97w	0.02%
uri/execp/file:/random_crap	12.20us	7.20kw	34.73w	34.73w
uri/execp/file:/random_crap (compiled)	12.35us	7.21kw	45.42w	45.42w
uri/exec_opt/https://google.com	20.56us	12.99kw	97.28w	97.28w	0.01%
uri/exec_opt/https://google.com (compiled)	21.02us	13.00kw	112.95w	112.95w	0.01%
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two	34.54us	19.80kw	217.06w	217.06w	0.02%
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two (compiled)	35.08us	19.81kw	241.45w	241.45w	0.02%
uri/exec_opt/file:/random_crap	12.44us	7.22kw	34.90w	34.90w
uri/exec_opt/file:/random_crap (compiled)	12.64us	7.23kw	45.64w	45.64w
tex gitignore/execp	2_301.81us	885.08kw	37_535.37w	37_535.37w	1.63%
tex gitignore/execp (compiled)	2_281.65us	885.09kw	37_512.69w	37_512.69w	1.61%
tex gitignore/exec_opt	2_414.22us	895.45kw	37_654.49w	37_654.49w	1.70%
tex gitignore/exec_opt (compiled)	2_419.66us	895.46kw	37_632.94w	37_632.94w	1.71%
http/manual/no group	2_503.97us	763.96kw	84_051.23w	83_025.23w	1.77%
http/manual/no group (compiled)	2_513.46us	763.97kw	84_053.34w	83_027.34w	1.77%
http/manual/group	363.36us	81.59kw	2_082.90w	2_082.90w	0.26%
http/manual/group (compiled)	370.84us	81.60kw	2_135.82w	2_135.82w	0.26%
http/auto/execp no group	4_393.49us	1_569.95kw	165_111.49w	164_085.49w	3.10%
http/auto/execp no group (compiled)	4_350.63us	1_569.95kw	165_119.63w	164_093.63w	3.07%
http/auto/all_gen	482.83us	142.05kw	6_646.74w	6_646.74w	0.34%
http/auto/all_gen (compiled)	481.66us	142.06kw	6_767.41w	6_767.41w	0.34%
string traversal from #210	6_669.15us	38.26kw	1_181.19w	1_181.19w	4.71%
string traversal from #210 (compiled)	6_832.59us	38.27kw	1_211.34w	1_211.34w	4.82%
kleene star compilation	68.41us	1.25kw	0.39w	0.39w	0.05%
kleene star compilation (compiled)	68.64us	1.26kw	2.13w	2.13w	0.05%
memory 1:10	25.81us	16.20kw	459.31w	459.31w	0.02%
memory 1:20	49.20us	25.39kw	969.06w	969.06w	0.03%
memory 1:40	135.01us	54.86kw	3_626.63w	3_626.63w	0.10%
memory 1:80	538.39us	157.93kw	25_296.31w	25_038.31w	0.38%
memory 1:100	930.88us	231.60kw	53_578.12w	53_320.12w	0.66%
memory 1:1000	85_911.83us	18_867.08kw	4_813_038.41w	4_808_164.41w	60.66%
memory 2:10	27.84us	17.84kw	533.94w	533.94w	0.02%
memory 2:20	58.13us	32.99kw	1_477.58w	1_477.58w	0.04%
memory 2:40	186.34us	87.57kw	7_847.35w	7_847.35w	0.13%
memory 2:80	934.79us	293.68kw	66_448.95w	66_190.95w	0.66%
memory 2:100	1_384.22us	445.27kw	96_782.61w	96_524.61w	0.98%
memory 2:1000	132_747.46us	40_803.16kw	9_250_358.54w	9_245_484.54w	93.72%
repeated sequence re	141_639.90us	42_854.30kw	11_452_525.43w	8_118_113.43w	100.00%
repeated sequence re (compiled)	140_852.12us	42_854.31kw	11_451_911.58w	8_117_499.58w	99.44%
split on whitespace	27.30us	7.64kw	45.05w	45.05w	0.02%
split on whitespace (compiled)	25.86us	7.64kw	46.75w	46.75w	0.02%
shared prefixes	113.11us	31.27kw	332.18w	332.18w	0.08%
shared prefixes (compiled)	113.44us	31.28kw	474.93w	474.93w	0.08%

Latin1 with the re_unicode lib

$ _build/default/benchmarks/unicode/benchmark_unicode.exe

Name	Time/Run	mWd/Run	mjWd/Run	Prom/Run	Percentage
20 zeroes/exec/00000000000000000000	43.64us	21.96kw	403.78w	403.78w
20 zeroes/exec/00000000000000000000 (compiled)	43.62us	21.97kw	423.49w	423.49w
20 zeroes/execp/00000000000000000000	43.04us	21.88kw	398.54w	398.54w
20 zeroes/execp/00000000000000000000 (compiled)	43.43us	21.89kw	419.03w	419.03w
20 zeroes/exec_opt/00000000000000000000	43.28us	21.96kw	398.37w	398.37w
20 zeroes/exec_opt/00000000000000000000 (compiled)	43.39us	21.97kw	422.08w	422.08w
lots of a's/exec/aaaaaaaaaa .. (101)	80.86us	36.37kw	1_046.21w	1_046.21w
lots of a's/exec/aaaaaaaaaa .. (101) (compiled)	82.13us	36.38kw	1_064.89w	1_064.89w
lots of a's/execp/aaaaaaaaaa .. (101)	80.86us	36.36kw	1_047.11w	1_047.11w
lots of a's/execp/aaaaaaaaaa .. (101) (compiled)	81.17us	36.36kw	1_063.95w	1_063.95w
lots of a's/exec_opt/aaaaaaaaaa .. (101)	81.86us	36.37kw	1_048.18w	1_048.18w
lots of a's/exec_opt/aaaaaaaaaa .. (101) (compiled)	80.64us	36.38kw	1_065.40w	1_065.40w
media type match/exec/ foo/bar ; charset=UTF-8	4.79us	1.96kw	3.78w	3.78w
media type match/exec/ foo/bar ; charset=UTF-8 (compiled)	4.90us	1.97kw	5.13w	5.13w
media type match/execp/ foo/bar ; charset=UTF-8	4.75us	1.94kw	3.53w	3.53w
media type match/execp/ foo/bar ; charset=UTF-8 (compiled)	4.80us	1.95kw	5.05w	5.05w
media type match/exec_opt/ foo/bar ; charset=UTF-8	4.85us	1.96kw	3.80w	3.80w
media type match/exec_opt/ foo/bar ; charset=UTF-8 (compiled)	4.90us	1.97kw	5.16w	5.16w
uri/exec/https://google.com	32.63us	20.29kw	268.22w	268.22w
uri/exec/https://google.com (compiled)	33.11us	20.29kw	298.33w	298.33w
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two	62.87us	34.81kw	827.79w	827.79w
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two (compiled)	64.41us	34.82kw	936.98w	936.98w
uri/exec/file:/random_crap	22.38us	12.35kw	108.13w	108.13w
uri/exec/file:/random_crap (compiled)	22.82us	12.35kw	129.79w	129.79w
uri/execp/https://google.com	32.34us	20.25kw	265.75w	265.75w
uri/execp/https://google.com (compiled)	32.64us	20.26kw	297.24w	297.24w
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two	62.82us	34.79kw	876.39w	876.39w
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two (compiled)	64.02us	34.79kw	934.21w	934.21w
uri/execp/file:/random_crap	22.35us	12.33kw	107.56w	107.56w
uri/execp/file:/random_crap (compiled)	22.11us	12.33kw	129.12w	129.12w
uri/exec_opt/https://google.com	32.53us	20.29kw	266.72w	266.72w
uri/exec_opt/https://google.com (compiled)	33.20us	20.29kw	298.63w	298.63w
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two	63.20us	34.82kw	878.71w	878.71w
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two (compiled)	64.36us	34.82kw	936.53w	936.53w
uri/exec_opt/file:/random_crap	22.26us	12.35kw	107.54w	107.54w
uri/exec_opt/file:/random_crap (compiled)	22.65us	12.36kw	129.53w	129.53w
tex gitignore/execp	145_827.46us	50_257.93kw	7_351_814.62w	7_351_814.62w	2.91%
tex gitignore/execp (compiled)	146_297.92us	50_257.93kw	7_352_371.27w	7_352_371.27w	2.92%
tex gitignore/exec_opt	145_323.45us	50_268.29kw	7_353_503.86w	7_353_503.86w	2.90%
tex gitignore/exec_opt (compiled)	145_683.31us	50_268.30kw	7_353_840.54w	7_353_840.54w	2.91%
http/manual/no group	31_846.39us	8_868.11kw	1_922_220.84w	1_922_220.84w	0.64%
http/manual/no group (compiled)	32_142.42us	8_868.11kw	1_922_503.25w	1_922_503.25w	0.64%
http/manual/group	24_493.21us	7_178.55kw	1_735_082.74w	1_735_082.74w	0.49%
http/manual/group (compiled)	24_319.71us	7_178.56kw	1_735_309.09w	1_735_309.09w	0.49%
http/auto/execp no group	34_073.90us	9_822.85kw	2_074_315.76w	2_074_315.76w	0.68%
http/auto/execp no group (compiled)	33_702.31us	9_822.86kw	2_074_047.17w	2_074_047.17w	0.67%
http/auto/all_gen	26_157.87us	8_130.88kw	1_886_891.12w	1_886_891.12w	0.52%
http/auto/all_gen (compiled)	26_212.25us	8_130.89kw	1_887_129.83w	1_887_129.83w	0.52%
string traversal from #210	5_008_271.36us	1_465_991.56kw	297_886_563.00w	297_886_563.00w	100.00%
string traversal from #210 (compiled)	5_002_785.92us	1_465_991.57kw	297_886_572.50w	297_886_572.50w	99.89%
kleene star compilation	7_786.82us	1_920.74kw	500_504.42w	500_504.42w	0.16%
kleene star compilation (compiled)	7_798.07us	1_920.75kw	500_581.15w	500_581.15w	0.16%
memory 1:10	28.15us	16.33kw	451.20w	451.20w
memory 1:20	53.32us	25.54kw	965.31w	965.31w
memory 1:40	146.79us	55.04kw	3_598.92w	3_598.92w
memory 1:80	582.69us	158.19kw	25_990.42w	25_732.42w	0.01%
memory 1:100	974.97us	231.90kw	52_641.13w	52_383.13w	0.02%
memory 1:1000	90_230.10us	18_868.65kw	4_806_495.85w	4_802_647.85w	1.80%
memory 2:10	30.16us	17.98kw	526.55w	526.55w
memory 2:20	62.40us	33.15kw	1_472.76w	1_472.76w
memory 2:40	199.52us	87.77kw	7_673.42w	7_673.42w
memory 2:80	946.12us	293.96kw	66_114.00w	65_856.00w	0.02%
memory 2:100	1_396.75us	445.59kw	96_410.57w	96_152.57w	0.03%
memory 2:1000	135_157.72us	40_804.74kw	9_243_417.96w	9_239_569.96w	2.70%
repeated sequence re	147_852.66us	42_884.98kw	11_369_642.72w	8_066_984.72w	2.95%
repeated sequence re (compiled)	138_846.95us	42_884.99kw	11_369_639.91w	8_066_981.91w	2.77%
split on whitespace	593.79us	166.74kw	22_780.53w	22_780.53w	0.01%
split on whitespace (compiled)	601.46us	166.75kw	22_779.86w	22_779.86w	0.01%
shared prefixes	250.28us	75.93kw	3_236.20w	3_236.20w
shared prefixes (compiled)	263.02us	75.94kw	4_473.54w	4_473.54w

Utf8 with the re_unicode lib

$ _build/default/benchmarks/unicode/benchmark_unicode.exe

Name	Time/Run	mWd/Run	mjWd/Run	Prom/Run	Percentage
20 zeroes/exec/00000000000000000000	45.59us	23.10kw	420.12w	420.12w
20 zeroes/exec/00000000000000000000 (compiled)	45.57us	23.11kw	450.63w	450.63w
20 zeroes/execp/00000000000000000000	45.05us	23.02kw	414.99w	414.99w
20 zeroes/execp/00000000000000000000 (compiled)	45.52us	23.03kw	445.32w	445.32w
20 zeroes/exec_opt/00000000000000000000	45.77us	23.11kw	419.55w	419.55w
20 zeroes/exec_opt/00000000000000000000 (compiled)	45.62us	23.11kw	450.59w	450.59w
lots of a's/exec/aaaaaaaaaa .. (101)	89.31us	39.50kw	1_163.90w	1_163.90w
lots of a's/exec/aaaaaaaaaa .. (101) (compiled)	88.81us	39.50kw	1_190.95w	1_190.95w
lots of a's/execp/aaaaaaaaaa .. (101)	88.80us	39.48kw	1_165.94w	1_165.94w
lots of a's/execp/aaaaaaaaaa .. (101) (compiled)	88.18us	39.49kw	1_189.40w	1_189.40w
lots of a's/exec_opt/aaaaaaaaaa .. (101)	88.26us	39.50kw	1_164.33w	1_164.33w
lots of a's/exec_opt/aaaaaaaaaa .. (101) (compiled)	88.41us	39.51kw	1_190.48w	1_190.48w
media type match/exec/ foo/bar ; charset=UTF-8	8.88us	2.38kw	4.57w	4.57w
media type match/exec/ foo/bar ; charset=UTF-8 (compiled)	8.95us	2.38kw	6.69w	6.69w
media type match/execp/ foo/bar ; charset=UTF-8	8.63us	2.36kw	3.98w	3.98w
media type match/execp/ foo/bar ; charset=UTF-8 (compiled)	8.62us	2.36kw	6.60w	6.60w
media type match/exec_opt/ foo/bar ; charset=UTF-8	8.84us	2.38kw	4.56w	4.56w
media type match/exec_opt/ foo/bar ; charset=UTF-8 (compiled)	8.90us	2.39kw	6.70w	6.70w
uri/exec/https://google.com	55.29us	21.82kw	287.95w	287.95w
uri/exec/https://google.com (compiled)	55.75us	21.82kw	332.61w	332.61w
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two	112.42us	37.10kw	943.84w	943.84w
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two (compiled)	113.08us	37.11kw	1_024.63w	1_024.63w
uri/exec/file:/random_crap	45.14us	13.85kw	121.20w	121.20w
uri/exec/file:/random_crap (compiled)	45.23us	13.86kw	153.24w	153.24w
uri/execp/https://google.com	53.97us	21.79kw	284.82w	284.82w
uri/execp/https://google.com (compiled)	54.47us	21.79kw	329.93w	329.93w
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two	109.59us	37.07kw	919.01w	919.01w
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two (compiled)	108.30us	37.08kw	1_022.51w	1_022.51w
uri/execp/file:/random_crap	43.07us	13.83kw	121.19w	121.19w
uri/execp/file:/random_crap (compiled)	43.60us	13.84kw	152.53w	152.53w
uri/exec_opt/https://google.com	55.05us	21.82kw	287.06w	287.06w
uri/exec_opt/https://google.com (compiled)	55.81us	21.83kw	332.36w	332.36w
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two	112.28us	37.10kw	939.32w	939.32w
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two (compiled)	113.90us	37.11kw	1_024.15w	1_024.15w
uri/exec_opt/file:/random_crap	45.52us	13.85kw	121.15w	121.15w
uri/exec_opt/file:/random_crap (compiled)	45.93us	13.86kw	152.90w	152.90w
tex gitignore/execp	152_088.76us	51_020.68kw	7_390_355.13w	7_390_355.13w	2.37%
tex gitignore/execp (compiled)	152_883.30us	51_020.69kw	7_390_821.17w	7_390_821.17w	2.38%
tex gitignore/exec_opt	152_834.77us	51_031.05kw	7_392_046.09w	7_392_046.09w	2.38%
tex gitignore/exec_opt (compiled)	152_869.32us	51_031.06kw	7_391_694.46w	7_391_694.46w	2.38%
http/manual/no group	50_357.13us	12_545.87kw	5_003_479.38w	5_003_479.38w	0.78%
http/manual/no group (compiled)	50_880.42us	12_545.88kw	5_002_508.56w	5_002_508.56w	0.79%
http/manual/group	41_717.42us	10_856.32kw	4_815_267.66w	4_815_267.66w	0.65%
http/manual/group (compiled)	42_226.32us	10_856.32kw	4_816_119.34w	4_816_119.34w	0.66%
http/auto/execp no group	51_148.32us	13_500.62kw	5_155_333.28w	5_155_333.28w	0.80%
http/auto/execp no group (compiled)	58_799.59us	13_500.62kw	5_154_934.07w	5_154_934.07w	0.92%
http/auto/all_gen	58_334.85us	11_808.81kw	4_967_357.38w	4_967_357.38w	0.91%
http/auto/all_gen (compiled)	53_023.88us	11_808.81kw	4_968_237.30w	4_968_237.30w	0.83%
string traversal from #210	6_425_725.27us	1_494_992.11kw	299_897_071.50w	299_897_071.50w	100.00%
string traversal from #210 (compiled)	5_361_362.50us	1_494_992.12kw	299_897_170.00w	299_897_170.00w	83.44%
kleene star compilation	8_778.17us	2_210.87kw	520_495.73w	520_495.73w	0.14%
kleene star compilation (compiled)	8_668.49us	2_210.87kw	520_497.11w	520_497.11w	0.13%
memory 1:10	30.07us	17.30kw	469.45w	469.45w
memory 1:20	55.10us	26.79kw	1_001.07w	1_001.07w
memory 1:40	150.32us	56.88kw	3_689.84w	3_689.84w
memory 1:80	596.55us	161.19kw	26_007.37w	25_749.37w
memory 1:100	1_011.82us	235.48kw	55_199.60w	54_941.60w	0.02%
memory 1:1000	91_130.32us	18_898.32kw	4_808_407.05w	4_804_559.05w	1.42%
memory 2:10	30.72us	19.02kw	548.28w	548.28w
memory 2:20	64.00us	34.48kw	1_514.77w	1_514.77w
memory 2:40	200.63us	89.68kw	7_950.73w	7_950.73w
memory 2:80	988.76us	297.03kw	66_320.46w	66_062.46w	0.02%
memory 2:100	1_426.93us	449.24kw	96_652.97w	96_394.97w	0.02%
memory 2:1000	136_153.52us	40_834.49kw	9_245_760.41w	9_241_912.41w	2.12%
repeated sequence re	148_580.83us	43_286.58kw	11_410_293.58w	8_067_411.58w	2.31%
repeated sequence re (compiled)	148_058.63us	43_286.59kw	11_409_715.62w	8_066_833.62w	2.30%
split on whitespace	144.38us	43.30kw	1_247.60w	1_247.60w
split on whitespace (compiled)	144.20us	43.31kw	1_286.35w	1_286.35w
shared prefixes	453.48us	129.38kw	5_016.28w	5_016.28w
shared prefixes (compiled)	476.86us	129.38kw	7_992.90w	7_992.90w

rgrinberg · 2026-02-01T18:16:51Z

Thanks. Looks very interesting!

Did you accidentally commit some duplicated files?

adag24 · 2026-02-01T19:41:51Z

the unicode (lib/unicode) part is currently separated from the original Re code (lib/).
But I didn't see any duplicated file.
So to check in a existing program that uses Re, just make module Re = Re_unicode.Utf8.Re (or Utf16be, Utf16le, Latin1).
I saw that the build failed due to missing dependencies. I guess they were not updated in the dune-project file and dune files (mainly uucp and uucd & zip during build stage).

# File "lib/unicode/gen/dune", line 15, characters 12-16:
# 15 |  (libraries uucd))
#                  ^^^^
# Error: Library "uucd" not found.

# File "lib/unicode/gen/dune", line 10, characters 12-15:
# 10 |  (libraries zip))
#                  ^^^
# Error: Library "zip" not found.

# File "lib/unicode/gen/dune", line 5, characters 12-16:
# 5 |  (libraries uucp))
#                 ^^^^
# Error: Library "uucp" not found.

adag24 · 2026-02-04T11:46:21Z

I have tried to check the gap to be compliant with the Unicode® Technical Standard 18 - Unicode Regular Expressions.
The following are the requirements to reach the Level 1: Basic Unicode support.
I will try to update the code to fill the gaps.

Requirements	Mandatory	Status	Remarks
1.1 Hex Notation The character set used by the regular expression writer may not be Unicode, or may not have the ability to input all Unicode code points from a keyboard.
RL1.1 Hex Notation To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation.	Yes	✅	present in Stdlib.String "\u{1D11E}" since 4.12.0 (confirm version ?)
1.1.1 Hex Notation and Normalization A regular expression engine may also enforce a single, uniform interpretation of regular expressions by always normalizing input text to Normalization Form NFC before interpreting that text. For more information, see UAX #15 - Unicode Normalization Forms.	No	❓	to check complete implementation for all inputs (regexp and strings)
1.2 Properties Because Unicode is a large character set that is regularly extended, a regular expression engine needs to provide for the recognition of whole categories of characters as well as simply literal sets of characters and strings; otherwise the listing of characters becomes impractical, out of date, and error-prone. This is done by providing syntax for sets of characters based on the Unicode character properties, as well as related properties and functions.
RL1.2 Properties To meet this requirement, an implementation shall provide at least a minimal list of properties, consisting of the following: - General_Category and Core Properties - Script and Script_Extensions - Alphabetic - Uppercase - Lowercase - White_Space - Noncharacter_Code_Point - Default_Ignorable_Code_Point - ANY - ASCII - ASSIGNED The values for these properties must follow the Unicode definitions, and include the property and property value aliases from the UCD. Matching of Binary, Enumerated, Catalog, and Name values must follow the Matching Rules from UAX44 - Unicode Character Database with one exception: implementations are not required to ignore an initial prefix string of "is" in property values.	Yes	❌	to implement in the unicode.ml file generated at building stage with the help of uucd (or uucp)
RL1.2a Compatibility Properties To meet this requirement, an implementation shall provide the properties listed in Annex C: Compatibility Properties, with the property values as listed there. Such an implementation shall document whether it is using the Standard Recommendation or POSIX-compatible properties.	Yes	❓	check completness of compatibility
1.3 Subtraction and Intersection character properties are essential with a large character set. In addition, there needs to be a way to "subtract" characters from what is already in the list. For example, one may want to include all non-ASCII letters without having to list every character in \p{letter} that is not one of those 52.
RL1.3 Subtraction and Intersection To meet this requirement, an implementation shall supply mechanisms for union, intersection and set-difference of sets of characters within regular expression character class expressions.	Yes	✅	done in cset, …
1.4 Simple Word Boundaries Most regular expression engines allow a test for word boundaries (such as by "\b" in Perl). They generally use a very simple mechanism for determining word boundaries: one example of that would be having word boundaries between any pair of characters where one is a <word_character> and the other is not, or at the start and end of a string. This is not adequate for Unicode regular expressions.
RL1.4 Simple Word Boundaries To meet this requirement, an implementation shall extend the word boundary mechanism so that: The class of <word_character> includes all the Alphabetic values from the Unicode character database, from UnicodeData.txt, plus the decimals (General_Category=Decimal_Number, or equivalently Numeric_Type=Decimal), and the U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER (Join_Control=True). See also Annex C: Compatibility Properties. Nonspacing marks are never divided from their base characters, and otherwise ignored in locating boundaries.	Yes	❌	In the current implementation, we use cword that matches the general categories `Lu, Ll, Lt, Lm, Lo, Nd, Nl, No and character 0x005f = '_'`. Not compliant
1.5 Simple Loose Matches Most regular expression engines offer caseless matching as the only loose matching. If the engine does offer this, then it needs to account for the large range of cased Unicode characters outside of ASCII.
RL1.5 Simple Loose Matches To meet this requirement, if an implementation provides for case-insensitive matching, then it shall provide at least the simple, default Unicode case-insensitive matching, and specify which properties are closed and which are not. To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the simple, default Unicode case folding.	Yes	✅	implemented in unicode.ml with get_simple_case_folding. If case insensitivity is specified, we replace each character with all the characters that have the exact same case folding.
1.6 Line Boundaries Most regular expression engines also allow a test for line boundaries: end-of-line or start-of-line. This presumes that lines of text are separated by line (or paragraph) separators.
RL1.6 Line Boundaries To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PARAGRAPH SEPARATOR (U+2029) and LINE SEPARATOR (U+2028). Formfeed (U+000C) also normally indicates an end-of-line. For more information, see Chapter 3 of [Unicode].	Yes	✅	We use Cset.nl in category.ml that is cset_new_line in unicode.ml that is the line_break property with the values `CR, LF, NL and BK` : LF, VT, FF, CR, NEL, LSEP, PSEP ([ (0x000A, 0x000D); (0x0085, 0x0085); (0x2028, 0x2029) ])
1.7 Code Points A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units.
RL1.7 Supplementary Code Points To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.	Yes	✅	the utf8/16be/16le encoded string are properly decoded in code points. See uucodecs.ml

…ries as per UTS ocaml#18 - Unicode Regular Expressions

adag added 5 commits January 29, 2026 23:38

first unicode commit

4a2555b

delete useless files

ed2c58b

delete useless files

fb204c0

minor change

74386cc

case insentivity implemented in unicode

57b49c2

adag added 2 commits February 1, 2026 19:37

fix dependencies in dune project and dune files

aba4216

fix zip dependency during build

878ea57

Add Utf16be, Utf16le to the Re_unicode interface

faff54c

unicode: make word boundaries compliant with RL1.4 Simple Word Bounda…

b18835c

…ries as per UTS ocaml#18 - Unicode Regular Expressions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unicode support#592

Add unicode support#592
adag24 wants to merge 9 commits intoocaml:masterfrom
adag24:master

adag24 commented Jan 29, 2026 •

edited

Loading

Uh oh!

rgrinberg commented Feb 1, 2026

Uh oh!

adag24 commented Feb 1, 2026 •

edited

Loading

Uh oh!

adag24 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adag24 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgrinberg commented Feb 1, 2026

Uh oh!

adag24 commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adag24 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adag24 commented Jan 29, 2026 •

edited

Loading

adag24 commented Feb 1, 2026 •

edited

Loading