Skip to content

Add unicode support#592

Open
adag24 wants to merge 9 commits intoocaml:masterfrom
adag24:master
Open

Add unicode support#592
adag24 wants to merge 9 commits intoocaml:masterfrom
adag24:master

Conversation

@adag24
Copy link
Copy Markdown

@adag24 adag24 commented Jan 29, 2026

1st proposal to add Unicode support in the continuation of #48 to address #24
Currently utf8, utf16be, utf16le and Latin1 (backward compatibility) are supported but others can be build with respect to the interface Uucodecs.T.

Note that:

  • this is not a byte comparison but a decoding of the strings (byte comparison would not work for very well founded reasons).
  • String.t is used but it is easily possible to generalize it if necessary with a very small API (Bytes, in channel, ...). They are supposed to be correctly encoded (there is no fallback / replacement character atm)
  • the regular expressions and strings are normalized to NFC with the help of uunf when processed (to check completeness within the code). So different normalization forms in the regular expressions or strings should not have any impact on a match. But this slows down the process?

Basically everything is kept from the original except that this is heavily functor based:

  • the variable-length encoding should match the Uucodecs.T interface
  • Cset is a functor taking as argument a CodePage.T (see lib/unicode/cset.mli) and a Uucodecs.T
  • a Color_map.T is created (functor) with a Cset.T argument

The unicode categories (digit, alpha, alphanum, xdigit, ..) are generated during the building stage (currently shall be thoroughly checked and corrected - 1st attempt) with the help of uucd & uucp. But the library itself does not rely relies on uucp:

Usage:
A. To do a simple case folding, use the mappings with status C + S.
B. To do a full case folding, use the mappings with status C + F.
The mappings with status T can be used or omitted depending on the desired case-folding
behavior. (The default option is to exclude them.)

Any comment is welcome!

Here after the results of the benchmarks with the original lib and the modified one

  1. Latin1 with the original lib
    $_build/default/benchmarks/benchmark.exe
Name Time/Run mWd/Run mjWd/Run Prom/Run Percentage
20 zeroes/exec/00000000000000000000 41.46us 21.77kw 400.08w 400.08w 0.03%
20 zeroes/exec/00000000000000000000 (compiled) 42.63us 21.78kw 418.03w 418.03w 0.03%
20 zeroes/execp/00000000000000000000 41.26us 21.69kw 395.27w 395.27w 0.03%
20 zeroes/execp/00000000000000000000 (compiled) 41.38us 21.70kw 412.92w 412.92w 0.03%
20 zeroes/exec_opt/00000000000000000000 41.64us 21.77kw 400.00w 400.00w 0.03%
20 zeroes/exec_opt/00000000000000000000 (compiled) 41.48us 21.78kw 417.27w 417.27w 0.03%
lots of a's/exec/aaaaaaaaaa .. (101) 6.16us 2.30kw 4.80w 4.80w
lots of a's/exec/aaaaaaaaaa .. (101) (compiled) 6.25us 2.31kw 5.73w 5.73w
lots of a's/execp/aaaaaaaaaa .. (101) 5.74us 2.29kw 3.30w 3.30w
lots of a's/execp/aaaaaaaaaa .. (101) (compiled) 5.86us 2.30kw 5.63w 5.63w
lots of a's/exec_opt/aaaaaaaaaa .. (101) 6.14us 2.31kw 4.52w 4.52w
lots of a's/exec_opt/aaaaaaaaaa .. (101) (compiled) 6.28us 2.31kw 5.73w 5.73w
media type match/exec/ foo/bar ; charset=UTF-8 5.41us 2.48kw 5.30w 5.30w
media type match/exec/ foo/bar ; charset=UTF-8 (compiled) 5.44us 2.48kw 6.56w 6.56w
media type match/execp/ foo/bar ; charset=UTF-8 5.25us 2.46kw 5.46w 5.46w
media type match/execp/ foo/bar ; charset=UTF-8 (compiled) 5.30us 2.46kw 6.83w 6.83w
media type match/exec_opt/ foo/bar ; charset=UTF-8 5.39us 2.48kw 5.18w 5.18w
media type match/exec_opt/ foo/bar ; charset=UTF-8 (compiled) 5.49us 2.49kw 6.34w 6.34w
uri/exec/https://google.com 20.81us 12.99kw 96.65w 96.65w 0.01%
uri/exec/https://google.com (compiled) 21.04us 13.00kw 112.86w 112.86w 0.01%
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two 34.07us 19.80kw 219.60w 219.60w 0.02%
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two (compiled) 34.39us 19.81kw 245.03w 245.03w 0.02%
uri/exec/file:/random_crap 12.45us 7.22kw 34.98w 34.98w
uri/exec/file:/random_crap (compiled) 12.67us 7.23kw 45.20w 45.20w
uri/execp/https://google.com 20.40us 12.96kw 96.16w 96.16w 0.01%
uri/execp/https://google.com (compiled) 20.54us 12.97kw 112.21w 112.21w 0.01%
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two 33.53us 19.77kw 219.07w 219.07w 0.02%
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two (compiled) 33.71us 19.78kw 242.97w 242.97w 0.02%
uri/execp/file:/random_crap 12.20us 7.20kw 34.73w 34.73w
uri/execp/file:/random_crap (compiled) 12.35us 7.21kw 45.42w 45.42w
uri/exec_opt/https://google.com 20.56us 12.99kw 97.28w 97.28w 0.01%
uri/exec_opt/https://google.com (compiled) 21.02us 13.00kw 112.95w 112.95w 0.01%
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two 34.54us 19.80kw 217.06w 217.06w 0.02%
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two (compiled) 35.08us 19.81kw 241.45w 241.45w 0.02%
uri/exec_opt/file:/random_crap 12.44us 7.22kw 34.90w 34.90w
uri/exec_opt/file:/random_crap (compiled) 12.64us 7.23kw 45.64w 45.64w
tex gitignore/execp 2_301.81us 885.08kw 37_535.37w 37_535.37w 1.63%
tex gitignore/execp (compiled) 2_281.65us 885.09kw 37_512.69w 37_512.69w 1.61%
tex gitignore/exec_opt 2_414.22us 895.45kw 37_654.49w 37_654.49w 1.70%
tex gitignore/exec_opt (compiled) 2_419.66us 895.46kw 37_632.94w 37_632.94w 1.71%
http/manual/no group 2_503.97us 763.96kw 84_051.23w 83_025.23w 1.77%
http/manual/no group (compiled) 2_513.46us 763.97kw 84_053.34w 83_027.34w 1.77%
http/manual/group 363.36us 81.59kw 2_082.90w 2_082.90w 0.26%
http/manual/group (compiled) 370.84us 81.60kw 2_135.82w 2_135.82w 0.26%
http/auto/execp no group 4_393.49us 1_569.95kw 165_111.49w 164_085.49w 3.10%
http/auto/execp no group (compiled) 4_350.63us 1_569.95kw 165_119.63w 164_093.63w 3.07%
http/auto/all_gen 482.83us 142.05kw 6_646.74w 6_646.74w 0.34%
http/auto/all_gen (compiled) 481.66us 142.06kw 6_767.41w 6_767.41w 0.34%
string traversal from #210 6_669.15us 38.26kw 1_181.19w 1_181.19w 4.71%
string traversal from #210 (compiled) 6_832.59us 38.27kw 1_211.34w 1_211.34w 4.82%
kleene star compilation 68.41us 1.25kw 0.39w 0.39w 0.05%
kleene star compilation (compiled) 68.64us 1.26kw 2.13w 2.13w 0.05%
memory 1:10 25.81us 16.20kw 459.31w 459.31w 0.02%
memory 1:20 49.20us 25.39kw 969.06w 969.06w 0.03%
memory 1:40 135.01us 54.86kw 3_626.63w 3_626.63w 0.10%
memory 1:80 538.39us 157.93kw 25_296.31w 25_038.31w 0.38%
memory 1:100 930.88us 231.60kw 53_578.12w 53_320.12w 0.66%
memory 1:1000 85_911.83us 18_867.08kw 4_813_038.41w 4_808_164.41w 60.66%
memory 2:10 27.84us 17.84kw 533.94w 533.94w 0.02%
memory 2:20 58.13us 32.99kw 1_477.58w 1_477.58w 0.04%
memory 2:40 186.34us 87.57kw 7_847.35w 7_847.35w 0.13%
memory 2:80 934.79us 293.68kw 66_448.95w 66_190.95w 0.66%
memory 2:100 1_384.22us 445.27kw 96_782.61w 96_524.61w 0.98%
memory 2:1000 132_747.46us 40_803.16kw 9_250_358.54w 9_245_484.54w 93.72%
repeated sequence re 141_639.90us 42_854.30kw 11_452_525.43w 8_118_113.43w 100.00%
repeated sequence re (compiled) 140_852.12us 42_854.31kw 11_451_911.58w 8_117_499.58w 99.44%
split on whitespace 27.30us 7.64kw 45.05w 45.05w 0.02%
split on whitespace (compiled) 25.86us 7.64kw 46.75w 46.75w 0.02%
shared prefixes 113.11us 31.27kw 332.18w 332.18w 0.08%
shared prefixes (compiled) 113.44us 31.28kw 474.93w 474.93w 0.08%
  1. Latin1 with the re_unicode lib

$ _build/default/benchmarks/unicode/benchmark_unicode.exe

Name Time/Run mWd/Run mjWd/Run Prom/Run Percentage
20 zeroes/exec/00000000000000000000 43.64us 21.96kw 403.78w 403.78w
20 zeroes/exec/00000000000000000000 (compiled) 43.62us 21.97kw 423.49w 423.49w
20 zeroes/execp/00000000000000000000 43.04us 21.88kw 398.54w 398.54w
20 zeroes/execp/00000000000000000000 (compiled) 43.43us 21.89kw 419.03w 419.03w
20 zeroes/exec_opt/00000000000000000000 43.28us 21.96kw 398.37w 398.37w
20 zeroes/exec_opt/00000000000000000000 (compiled) 43.39us 21.97kw 422.08w 422.08w
lots of a's/exec/aaaaaaaaaa .. (101) 80.86us 36.37kw 1_046.21w 1_046.21w
lots of a's/exec/aaaaaaaaaa .. (101) (compiled) 82.13us 36.38kw 1_064.89w 1_064.89w
lots of a's/execp/aaaaaaaaaa .. (101) 80.86us 36.36kw 1_047.11w 1_047.11w
lots of a's/execp/aaaaaaaaaa .. (101) (compiled) 81.17us 36.36kw 1_063.95w 1_063.95w
lots of a's/exec_opt/aaaaaaaaaa .. (101) 81.86us 36.37kw 1_048.18w 1_048.18w
lots of a's/exec_opt/aaaaaaaaaa .. (101) (compiled) 80.64us 36.38kw 1_065.40w 1_065.40w
media type match/exec/ foo/bar ; charset=UTF-8 4.79us 1.96kw 3.78w 3.78w
media type match/exec/ foo/bar ; charset=UTF-8 (compiled) 4.90us 1.97kw 5.13w 5.13w
media type match/execp/ foo/bar ; charset=UTF-8 4.75us 1.94kw 3.53w 3.53w
media type match/execp/ foo/bar ; charset=UTF-8 (compiled) 4.80us 1.95kw 5.05w 5.05w
media type match/exec_opt/ foo/bar ; charset=UTF-8 4.85us 1.96kw 3.80w 3.80w
media type match/exec_opt/ foo/bar ; charset=UTF-8 (compiled) 4.90us 1.97kw 5.16w 5.16w
uri/exec/https://google.com 32.63us 20.29kw 268.22w 268.22w
uri/exec/https://google.com (compiled) 33.11us 20.29kw 298.33w 298.33w
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two 62.87us 34.81kw 827.79w 827.79w
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two (compiled) 64.41us 34.82kw 936.98w 936.98w
uri/exec/file:/random_crap 22.38us 12.35kw 108.13w 108.13w
uri/exec/file:/random_crap (compiled) 22.82us 12.35kw 129.79w 129.79w
uri/execp/https://google.com 32.34us 20.25kw 265.75w 265.75w
uri/execp/https://google.com (compiled) 32.64us 20.26kw 297.24w 297.24w
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two 62.82us 34.79kw 876.39w 876.39w
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two (compiled) 64.02us 34.79kw 934.21w 934.21w
uri/execp/file:/random_crap 22.35us 12.33kw 107.56w 107.56w
uri/execp/file:/random_crap (compiled) 22.11us 12.33kw 129.12w 129.12w
uri/exec_opt/https://google.com 32.53us 20.29kw 266.72w 266.72w
uri/exec_opt/https://google.com (compiled) 33.20us 20.29kw 298.63w 298.63w
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two 63.20us 34.82kw 878.71w 878.71w
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two (compiled) 64.36us 34.82kw 936.53w 936.53w
uri/exec_opt/file:/random_crap 22.26us 12.35kw 107.54w 107.54w
uri/exec_opt/file:/random_crap (compiled) 22.65us 12.36kw 129.53w 129.53w
tex gitignore/execp 145_827.46us 50_257.93kw 7_351_814.62w 7_351_814.62w 2.91%
tex gitignore/execp (compiled) 146_297.92us 50_257.93kw 7_352_371.27w 7_352_371.27w 2.92%
tex gitignore/exec_opt 145_323.45us 50_268.29kw 7_353_503.86w 7_353_503.86w 2.90%
tex gitignore/exec_opt (compiled) 145_683.31us 50_268.30kw 7_353_840.54w 7_353_840.54w 2.91%
http/manual/no group 31_846.39us 8_868.11kw 1_922_220.84w 1_922_220.84w 0.64%
http/manual/no group (compiled) 32_142.42us 8_868.11kw 1_922_503.25w 1_922_503.25w 0.64%
http/manual/group 24_493.21us 7_178.55kw 1_735_082.74w 1_735_082.74w 0.49%
http/manual/group (compiled) 24_319.71us 7_178.56kw 1_735_309.09w 1_735_309.09w 0.49%
http/auto/execp no group 34_073.90us 9_822.85kw 2_074_315.76w 2_074_315.76w 0.68%
http/auto/execp no group (compiled) 33_702.31us 9_822.86kw 2_074_047.17w 2_074_047.17w 0.67%
http/auto/all_gen 26_157.87us 8_130.88kw 1_886_891.12w 1_886_891.12w 0.52%
http/auto/all_gen (compiled) 26_212.25us 8_130.89kw 1_887_129.83w 1_887_129.83w 0.52%
string traversal from #210 5_008_271.36us 1_465_991.56kw 297_886_563.00w 297_886_563.00w 100.00%
string traversal from #210 (compiled) 5_002_785.92us 1_465_991.57kw 297_886_572.50w 297_886_572.50w 99.89%
kleene star compilation 7_786.82us 1_920.74kw 500_504.42w 500_504.42w 0.16%
kleene star compilation (compiled) 7_798.07us 1_920.75kw 500_581.15w 500_581.15w 0.16%
memory 1:10 28.15us 16.33kw 451.20w 451.20w
memory 1:20 53.32us 25.54kw 965.31w 965.31w
memory 1:40 146.79us 55.04kw 3_598.92w 3_598.92w
memory 1:80 582.69us 158.19kw 25_990.42w 25_732.42w 0.01%
memory 1:100 974.97us 231.90kw 52_641.13w 52_383.13w 0.02%
memory 1:1000 90_230.10us 18_868.65kw 4_806_495.85w 4_802_647.85w 1.80%
memory 2:10 30.16us 17.98kw 526.55w 526.55w
memory 2:20 62.40us 33.15kw 1_472.76w 1_472.76w
memory 2:40 199.52us 87.77kw 7_673.42w 7_673.42w
memory 2:80 946.12us 293.96kw 66_114.00w 65_856.00w 0.02%
memory 2:100 1_396.75us 445.59kw 96_410.57w 96_152.57w 0.03%
memory 2:1000 135_157.72us 40_804.74kw 9_243_417.96w 9_239_569.96w 2.70%
repeated sequence re 147_852.66us 42_884.98kw 11_369_642.72w 8_066_984.72w 2.95%
repeated sequence re (compiled) 138_846.95us 42_884.99kw 11_369_639.91w 8_066_981.91w 2.77%
split on whitespace 593.79us 166.74kw 22_780.53w 22_780.53w 0.01%
split on whitespace (compiled) 601.46us 166.75kw 22_779.86w 22_779.86w 0.01%
shared prefixes 250.28us 75.93kw 3_236.20w 3_236.20w
shared prefixes (compiled) 263.02us 75.94kw 4_473.54w 4_473.54w
  1. Utf8 with the re_unicode lib

$ _build/default/benchmarks/unicode/benchmark_unicode.exe

Name Time/Run mWd/Run mjWd/Run Prom/Run Percentage
20 zeroes/exec/00000000000000000000 45.59us 23.10kw 420.12w 420.12w
20 zeroes/exec/00000000000000000000 (compiled) 45.57us 23.11kw 450.63w 450.63w
20 zeroes/execp/00000000000000000000 45.05us 23.02kw 414.99w 414.99w
20 zeroes/execp/00000000000000000000 (compiled) 45.52us 23.03kw 445.32w 445.32w
20 zeroes/exec_opt/00000000000000000000 45.77us 23.11kw 419.55w 419.55w
20 zeroes/exec_opt/00000000000000000000 (compiled) 45.62us 23.11kw 450.59w 450.59w
lots of a's/exec/aaaaaaaaaa .. (101) 89.31us 39.50kw 1_163.90w 1_163.90w
lots of a's/exec/aaaaaaaaaa .. (101) (compiled) 88.81us 39.50kw 1_190.95w 1_190.95w
lots of a's/execp/aaaaaaaaaa .. (101) 88.80us 39.48kw 1_165.94w 1_165.94w
lots of a's/execp/aaaaaaaaaa .. (101) (compiled) 88.18us 39.49kw 1_189.40w 1_189.40w
lots of a's/exec_opt/aaaaaaaaaa .. (101) 88.26us 39.50kw 1_164.33w 1_164.33w
lots of a's/exec_opt/aaaaaaaaaa .. (101) (compiled) 88.41us 39.51kw 1_190.48w 1_190.48w
media type match/exec/ foo/bar ; charset=UTF-8 8.88us 2.38kw 4.57w 4.57w
media type match/exec/ foo/bar ; charset=UTF-8 (compiled) 8.95us 2.38kw 6.69w 6.69w
media type match/execp/ foo/bar ; charset=UTF-8 8.63us 2.36kw 3.98w 3.98w
media type match/execp/ foo/bar ; charset=UTF-8 (compiled) 8.62us 2.36kw 6.60w 6.60w
media type match/exec_opt/ foo/bar ; charset=UTF-8 8.84us 2.38kw 4.56w 4.56w
media type match/exec_opt/ foo/bar ; charset=UTF-8 (compiled) 8.90us 2.39kw 6.70w 6.70w
uri/exec/https://google.com 55.29us 21.82kw 287.95w 287.95w
uri/exec/https://google.com (compiled) 55.75us 21.82kw 332.61w 332.61w
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two 112.42us 37.10kw 943.84w 943.84w
uri/exec/http://yahoo.com/xxx/yyy?query=param&one=two (compiled) 113.08us 37.11kw 1_024.63w 1_024.63w
uri/exec/file:/random_crap 45.14us 13.85kw 121.20w 121.20w
uri/exec/file:/random_crap (compiled) 45.23us 13.86kw 153.24w 153.24w
uri/execp/https://google.com 53.97us 21.79kw 284.82w 284.82w
uri/execp/https://google.com (compiled) 54.47us 21.79kw 329.93w 329.93w
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two 109.59us 37.07kw 919.01w 919.01w
uri/execp/http://yahoo.com/xxx/yyy?query=param&one=two (compiled) 108.30us 37.08kw 1_022.51w 1_022.51w
uri/execp/file:/random_crap 43.07us 13.83kw 121.19w 121.19w
uri/execp/file:/random_crap (compiled) 43.60us 13.84kw 152.53w 152.53w
uri/exec_opt/https://google.com 55.05us 21.82kw 287.06w 287.06w
uri/exec_opt/https://google.com (compiled) 55.81us 21.83kw 332.36w 332.36w
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two 112.28us 37.10kw 939.32w 939.32w
uri/exec_opt/http://yahoo.com/xxx/yyy?query=param&one=two (compiled) 113.90us 37.11kw 1_024.15w 1_024.15w
uri/exec_opt/file:/random_crap 45.52us 13.85kw 121.15w 121.15w
uri/exec_opt/file:/random_crap (compiled) 45.93us 13.86kw 152.90w 152.90w
tex gitignore/execp 152_088.76us 51_020.68kw 7_390_355.13w 7_390_355.13w 2.37%
tex gitignore/execp (compiled) 152_883.30us 51_020.69kw 7_390_821.17w 7_390_821.17w 2.38%
tex gitignore/exec_opt 152_834.77us 51_031.05kw 7_392_046.09w 7_392_046.09w 2.38%
tex gitignore/exec_opt (compiled) 152_869.32us 51_031.06kw 7_391_694.46w 7_391_694.46w 2.38%
http/manual/no group 50_357.13us 12_545.87kw 5_003_479.38w 5_003_479.38w 0.78%
http/manual/no group (compiled) 50_880.42us 12_545.88kw 5_002_508.56w 5_002_508.56w 0.79%
http/manual/group 41_717.42us 10_856.32kw 4_815_267.66w 4_815_267.66w 0.65%
http/manual/group (compiled) 42_226.32us 10_856.32kw 4_816_119.34w 4_816_119.34w 0.66%
http/auto/execp no group 51_148.32us 13_500.62kw 5_155_333.28w 5_155_333.28w 0.80%
http/auto/execp no group (compiled) 58_799.59us 13_500.62kw 5_154_934.07w 5_154_934.07w 0.92%
http/auto/all_gen 58_334.85us 11_808.81kw 4_967_357.38w 4_967_357.38w 0.91%
http/auto/all_gen (compiled) 53_023.88us 11_808.81kw 4_968_237.30w 4_968_237.30w 0.83%
string traversal from #210 6_425_725.27us 1_494_992.11kw 299_897_071.50w 299_897_071.50w 100.00%
string traversal from #210 (compiled) 5_361_362.50us 1_494_992.12kw 299_897_170.00w 299_897_170.00w 83.44%
kleene star compilation 8_778.17us 2_210.87kw 520_495.73w 520_495.73w 0.14%
kleene star compilation (compiled) 8_668.49us 2_210.87kw 520_497.11w 520_497.11w 0.13%
memory 1:10 30.07us 17.30kw 469.45w 469.45w
memory 1:20 55.10us 26.79kw 1_001.07w 1_001.07w
memory 1:40 150.32us 56.88kw 3_689.84w 3_689.84w
memory 1:80 596.55us 161.19kw 26_007.37w 25_749.37w
memory 1:100 1_011.82us 235.48kw 55_199.60w 54_941.60w 0.02%
memory 1:1000 91_130.32us 18_898.32kw 4_808_407.05w 4_804_559.05w 1.42%
memory 2:10 30.72us 19.02kw 548.28w 548.28w
memory 2:20 64.00us 34.48kw 1_514.77w 1_514.77w
memory 2:40 200.63us 89.68kw 7_950.73w 7_950.73w
memory 2:80 988.76us 297.03kw 66_320.46w 66_062.46w 0.02%
memory 2:100 1_426.93us 449.24kw 96_652.97w 96_394.97w 0.02%
memory 2:1000 136_153.52us 40_834.49kw 9_245_760.41w 9_241_912.41w 2.12%
repeated sequence re 148_580.83us 43_286.58kw 11_410_293.58w 8_067_411.58w 2.31%
repeated sequence re (compiled) 148_058.63us 43_286.59kw 11_409_715.62w 8_066_833.62w 2.30%
split on whitespace 144.38us 43.30kw 1_247.60w 1_247.60w
split on whitespace (compiled) 144.20us 43.31kw 1_286.35w 1_286.35w
shared prefixes 453.48us 129.38kw 5_016.28w 5_016.28w
shared prefixes (compiled) 476.86us 129.38kw 7_992.90w 7_992.90w

@rgrinberg
Copy link
Copy Markdown
Member

Thanks. Looks very interesting!

Did you accidentally commit some duplicated files?

@adag24
Copy link
Copy Markdown
Author

adag24 commented Feb 1, 2026

the unicode (lib/unicode) part is currently separated from the original Re code (lib/).
But I didn't see any duplicated file.
So to check in a existing program that uses Re, just make module Re = Re_unicode.Utf8.Re (or Utf16be, Utf16le, Latin1).
I saw that the build failed due to missing dependencies. I guess they were not updated in the dune-project file and dune files (mainly uucp and uucd & zip during build stage).

# File "lib/unicode/gen/dune", line 15, characters 12-16:
# 15 |  (libraries uucd))
#                  ^^^^
# Error: Library "uucd" not found.
# File "lib/unicode/gen/dune", line 10, characters 12-15:
# 10 |  (libraries zip))
#                  ^^^
# Error: Library "zip" not found.
# File "lib/unicode/gen/dune", line 5, characters 12-16:
# 5 |  (libraries uucp))
#                 ^^^^
# Error: Library "uucp" not found.

@adag24
Copy link
Copy Markdown
Author

adag24 commented Feb 4, 2026

I have tried to check the gap to be compliant with the Unicode® Technical Standard 18 - Unicode Regular Expressions.
The following are the requirements to reach the Level 1: Basic Unicode support.
I will try to update the code to fill the gaps.

Requirements Mandatory Status Remarks
1.1 Hex Notation
The character set used by the regular expression writer may not be Unicode, or may not have the ability to input all Unicode code points from a keyboard.
RL1.1 Hex Notation
To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation.
Yes present in Stdlib.String "\u{1D11E}" since 4.12.0 (confirm version ?)

1.1.1 Hex Notation and Normalization
A regular expression engine may also enforce a single, uniform interpretation of regular expressions by always normalizing input text to Normalization Form NFC before interpreting that text. For more information, see UAX #15 - Unicode Normalization Forms.
No to check complete implementation for all inputs (regexp and strings)
1.2 Properties
Because Unicode is a large character set that is regularly extended, a regular expression engine needs to provide for the recognition of whole categories of characters as well as simply literal sets of characters and strings; otherwise the listing of characters becomes impractical, out of date, and error-prone. This is done by providing syntax for sets of characters based on the Unicode character properties, as well as related properties and functions.
RL1.2 Properties
To meet this requirement, an implementation shall provide at least a minimal list of properties, consisting of the following:
- General_Category and Core Properties
- Script and Script_Extensions
- Alphabetic
- Uppercase
- Lowercase
- White_Space
- Noncharacter_Code_Point
- Default_Ignorable_Code_Point
- ANY
- ASCII
- ASSIGNED

The values for these properties must follow the Unicode definitions, and include the property and property value aliases from the UCD. Matching of Binary, Enumerated, Catalog, and Name values must follow the Matching Rules from UAX44 - Unicode Character Database with one exception: implementations are not required to ignore an initial prefix string of "is" in property values.
Yes to implement in the unicode.ml file generated at building stage with the help of uucd (or uucp)
RL1.2a Compatibility Properties
To meet this requirement, an implementation shall provide the properties listed in Annex C: Compatibility Properties, with the property values as listed there. Such an implementation shall document whether it is using the Standard Recommendation or POSIX-compatible properties.
Yes check completness of compatibility
1.3 Subtraction and Intersection
character properties are essential with a large character set. In addition, there needs to be a way to "subtract" characters from what is already in the list. For example, one may want to include all non-ASCII letters without having to list every character in \p{letter} that is not one of those 52.
RL1.3 Subtraction and Intersection
To meet this requirement, an implementation shall supply mechanisms for union, intersection and set-difference of sets of characters within regular expression character class expressions.
Yes done in cset, …
1.4 Simple Word Boundaries
Most regular expression engines allow a test for word boundaries (such as by "\b" in Perl). They generally use a very simple mechanism for determining word boundaries: one example of that would be having word boundaries between any pair of characters where one is a <word_character> and the other is not, or at the start and end of a string. This is not adequate for Unicode regular expressions.
RL1.4 Simple Word Boundaries
To meet this requirement, an implementation shall extend the word boundary mechanism so that:
The class of <word_character> includes all the Alphabetic values from the Unicode character database, from UnicodeData.txt, plus the decimals (General_Category=Decimal_Number, or equivalently Numeric_Type=Decimal), and the U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER (Join_Control=True). See also Annex C: Compatibility Properties.
Nonspacing marks are never divided from their base characters, and otherwise ignored in locating boundaries.
Yes In the current implementation, we use cword that matches the general categories Lu, Ll, Lt, Lm, Lo, Nd, Nl, No and character 0x005f = '_'.
Not compliant
1.5 Simple Loose Matches
Most regular expression engines offer caseless matching as the only loose matching. If the engine does offer this, then it needs to account for the large range of cased Unicode characters outside of ASCII.
RL1.5 Simple Loose Matches
To meet this requirement, if an implementation provides for case-insensitive matching, then it shall provide at least the simple, default Unicode case-insensitive matching, and specify which properties are closed and which are not.
To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the simple, default Unicode case folding.
Yes implemented in unicode.ml with get_simple_case_folding.
If case insensitivity is specified, we replace each character with all the characters that have the exact same case folding.
1.6 Line Boundaries
Most regular expression engines also allow a test for line boundaries: end-of-line or start-of-line. This presumes that lines of text are separated by line (or paragraph) separators.
RL1.6 Line Boundaries
To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PARAGRAPH SEPARATOR (U+2029) and LINE SEPARATOR (U+2028).
Formfeed (U+000C) also normally indicates an end-of-line. For more information, see Chapter 3 of [Unicode].
Yes We use Cset.nl in category.ml that is cset_new_line in unicode.ml that is the line_break property with the values CR, LF, NL and BK : LF, VT, FF, CR, NEL, LSEP, PSEP ([ (0x000A, 0x000D); (0x0085, 0x0085); (0x2028, 0x2029) ])
1.7 Code Points
A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units.
RL1.7 Supplementary Code Points
To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.
Yes the utf8/16be/16le encoded string are properly decoded in code points.
See uucodecs.ml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants