Lokad.Utf8Regex is a net10.0 library whose semantic oracle is System.Text.RegularExpressions.Regex on .NET 10, while its primary I/O surface is UTF-8 ReadOnlySpan<byte> / Span<byte>. That choice is justified because the official regex APIs and span-based matching APIs are currently defined on UTF-16 string / ReadOnlySpan<char>, not bytes, and the open runtime request for first-class UTF-8 regex support is still in Future / No due date. .NET strings themselves are UTF-16 sequences, so this project must preserve UTF-16 regex semantics while changing the execution/storage model to UTF-8 bytes. (Microsoft Learn)
Lokad.Utf8Regex must be a managed, dependency-minimal, high-performance UTF-8-native regex engine that stays aligned with the .NET regex language and behavior: same pattern language, same option model, same replacement language, same timeout model, same right-to-left behavior, and the same default backtracking semantics unless the caller explicitly requests RegexOptions.NonBacktracking. For the first released version, the semantic target is restricted to RegexOptions.CultureInvariant behavior; broader culture-sensitive parity is explicitly deferred. The .NET regex language is documented as Perl-5-compatible with additional .NET features such as right-to-left matching, and inline options are part of that language. (Microsoft Learn)
The project goal is not “a different regex flavor that happens to be fast.” It is “the fastest possible managed UTF-8 implementation that behaves like .NET regex.” That means the default engine must preserve constructs and behaviors that RegexOptions.NonBacktracking intentionally does not: backreferences, balancing groups, conditionals, lookarounds, \G, and full capture history inside loops. The public NonBacktracking option should still be supported, but it is a separate mode, not the foundation of the default engine. (Microsoft Learn)
Implementation priority may still be performance-first. In practice, that means the engine should optimize first for the overwhelmingly common aligned region where UTF-8 byte boundaries and the relevant UTF-16-observable boundaries coincide, especially ASCII and BMP-dominant workloads. Rare supplementary / surrogate-splitting edge cases may initially use slower native machinery, or explicit compatibility fallback during development, but the destination remains full .NET-semantic behavior rather than a weakened UTF-8-specific regex flavor.
This performance-first stance must not turn into parser accretion. Lokad.Utf8Regex should have exactly one semantic front-end, and native fast paths must be derived from that front-end rather than from independent pattern-string parsers. Temporary bootstrap heuristics are acceptable only as a short-lived development tactic before the vendored front-end is integrated; they are not the long-term semantic basis of the project.
- Target framework:
net10.0only. - Dependencies: none beyond the base/standard libraries, xUnit, and BenchmarkDotNet.
- Implementation model: fully managed. No native code, no P/Invoke, no external regex engines, no Roslyn/source-generator package dependency.
- Options model: use
RegexOptionsdirectly. - Timeout model: use
TimeSpanand throwRegexMatchTimeoutExceptionfor timeout. - Public primary surface: UTF-8 spans and byte buffers.
- Public semantic oracle: .NET 10
Regex. - Performance objective: beat
UTF8 decode -> Regex -> UTF8 encodeon the intended workloads, especially ASCII and ASCII-dominant workloads, without changing semantics. The built-in regex engine is powerful and often fast, but Microsoft explicitly documents that backtracking can still become pathological, which is why timeout parity is mandatory. (Microsoft Learn)
The accepted pattern language must match .NET regex behavior on .NET 10. That includes the normal grammar, inline options, RightToLeft, ECMAScript, CultureInvariant behavior, backtracking behavior, captures, backreferences, balancing groups, conditionals, lookarounds, and the documented replacement syntax. Broader culture-sensitive parity is out of scope for the first version. (Microsoft Learn)
The fixed replacement-string path must behave like .NET Regex.Replace, including the substitution language ($1, ${name}, $&, $```, $', $+, $_, $$) and the numeric-group parsing rules. The evaluator path must behave like MatchEvaluator`: the replacement callback is invoked once per match, and its result is substituted for that match. Right-to-left replacement must search from the end, and the segment assembly logic must preserve the same observable output order as the runtime. (Microsoft Learn)
For the first version, Lokad.Utf8Regex supports only RegexOptions.CultureInvariant semantics. In practice, that means IgnoreCase must match the runtime when CultureInvariant is in effect, and construction or execution for culture-sensitive modes should fail explicitly or route through an explicit compatibility fallback during development rather than silently approximating behavior. Do not hand-roll “close enough” case folding. Broader culture-sensitive parity, including CurrentCulture-dependent cases such as Turkish-I behavior, is a later expansion item. (GitHub)
Timeout behavior must mirror .NET’s model: the timeout is approximate, and the engine throws RegexMatchTimeoutException at its next timing check after the interval elapses. Timeouts exist to prevent excessive backtracking from monopolizing the process. (Microsoft Learn)
The public UTF-8 APIs accept well-formed UTF-8 only. Invalid UTF-8 must not be silently repaired or replacement-decoded on the main API. The low-level .NET UTF-8 transcoder exposes invalid-data signaling when replacement is disabled; Lokad.Utf8Regex should follow that spirit and reject malformed input. Public failure mode: throw ArgumentException with a precise message and the offending byte offset when the input is ill-formed. (Microsoft Learn)
This is the one place where the spec must be explicit: .NET regex semantics are defined over UTF-16 code units, and .NET strings are sequences of char, with supplementary Unicode characters occupying two UTF-16 code units. Because of that, a .NET-semantic match, capture, or replacement can theoretically begin or end between the two surrogates of one supplementary scalar. A byte-oriented UTF-8 API cannot represent that boundary as a valid byte slice. Therefore the engine must keep exact internal UTF-16 semantics, but if a public byte-oriented result is not representable as valid UTF-8, the byte API must fail explicitly rather than silently diverge. Provide a slow-path compatibility API that can still return the exact UTF-16 result as string. This is a design consequence of the documented UTF-16 string model and char-based regex surface. (Microsoft Learn)
This requirement should not be misread as an instruction to optimize the rarest edge cases first. It is a statement of the final semantic invariant. The implementation should prioritize the common aligned cases first, while keeping these non-byte-aligned edge cases correct, measurable, and progressively moved from fallback/slower paths into the native engine.
The public API is split into:
- the baseline .NET-mirroring surface
- intentional
Lokad.Utf8Regexextensions
Use this public surface.
namespace Lokad.Utf8Regex;
using System.Buffers;
using System.Text;
using System.Text.RegularExpressions;
public sealed class Utf8Regex
{
public Utf8Regex(string pattern);
public Utf8Regex(string pattern, RegexOptions options);
public Utf8Regex(string pattern, RegexOptions options, TimeSpan matchTimeout);
public static TimeSpan DefaultMatchTimeout { get; set; }
public string Pattern { get; }
public RegexOptions Options { get; }
public TimeSpan MatchTimeout { get; }
public bool IsMatch(ReadOnlySpan<byte> input);
public int Count(ReadOnlySpan<byte> input);
public Utf8ValueMatch Match(ReadOnlySpan<byte> input);
public Utf8MatchContext MatchDetailed(ReadOnlySpan<byte> input);
public Utf8ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<byte> input);
public Utf8ValueSplitEnumerator EnumerateSplits(ReadOnlySpan<byte> input, int count = int.MaxValue);
public byte[] Replace(ReadOnlySpan<byte> input, string replacement);
public byte[] Replace(ReadOnlySpan<byte> input, ReadOnlySpan<byte> replacementPatternUtf8);
public byte[] Replace<TState>(ReadOnlySpan<byte> input, TState state, Utf8MatchEvaluator<TState> evaluator);
public string ReplaceToString(ReadOnlySpan<byte> input, string replacement);
public string ReplaceToString<TState>(ReadOnlySpan<byte> input, TState state, Utf16MatchEvaluator<TState> evaluator);
public OperationStatus TryReplace(
ReadOnlySpan<byte> input,
string replacement,
Span<byte> destination,
out int bytesWritten);
public OperationStatus TryReplace(
ReadOnlySpan<byte> input,
ReadOnlySpan<byte> replacementPatternUtf8,
Span<byte> destination,
out int bytesWritten);
public int GroupNumberFromName(string name);
public string GroupNameFromNumber(int i);
public string[] GetGroupNames();
public int[] GetGroupNumbers();
public static bool IsMatch(
ReadOnlySpan<byte> input,
string pattern,
RegexOptions options = RegexOptions.None,
TimeSpan matchTimeout = default);
public static int Count(
ReadOnlySpan<byte> input,
string pattern,
RegexOptions options = RegexOptions.None,
TimeSpan matchTimeout = default);
public static Utf8ValueMatchEnumerator EnumerateMatches(
ReadOnlySpan<byte> input,
string pattern,
RegexOptions options = RegexOptions.None,
TimeSpan matchTimeout = default);
public static Utf8MatchContext MatchDetailed(
ReadOnlySpan<byte> input,
string pattern,
RegexOptions options = RegexOptions.None,
TimeSpan matchTimeout = default);
public static Utf8ValueSplitEnumerator EnumerateSplits(
ReadOnlySpan<byte> input,
string pattern,
int count = int.MaxValue,
RegexOptions options = RegexOptions.None,
TimeSpan matchTimeout = default);
public static byte[] Replace(
ReadOnlySpan<byte> input,
string pattern,
string replacement,
RegexOptions options = RegexOptions.None,
TimeSpan matchTimeout = default);
}
public readonly struct Utf8ValueMatch
{
public bool Success { get; }
public bool IsByteAligned { get; }
public int IndexInBytes { get; }
public int LengthInBytes { get; }
public int IndexInUtf16 { get; }
public int LengthInUtf16 { get; }
}
public readonly ref struct Utf8MatchContext
{
public bool Success { get; }
public int IndexInUtf16 { get; }
public int LengthInUtf16 { get; }
public bool IsByteAligned { get; }
public int IndexInBytes { get; } // throws if not byte-aligned
public int LengthInBytes { get; } // throws if not byte-aligned
public int GroupCount { get; }
public Utf8GroupContext GetGroup(int number);
public Utf8GroupContext GetGroup(string name);
public bool TryGetGroup(int number, out Utf8GroupContext group);
public bool TryGetGroup(string name, out Utf8GroupContext group);
public string GetValueString();
}
public readonly ref struct Utf8GroupContext
{
public bool Success { get; }
public int IndexInUtf16 { get; }
public int LengthInUtf16 { get; }
public int CaptureCount { get; }
public bool IsByteAligned { get; }
public int IndexInBytes { get; } // throws if not byte-aligned
public int LengthInBytes { get; } // throws if not byte-aligned
public ReadOnlySpan<byte> GetValueBytes(); // throws if not byte-aligned
public string GetValueString();
public Utf8CaptureContext GetCapture(int index);
}
public readonly ref struct Utf8CaptureContext
{
public bool Success { get; }
public int IndexInUtf16 { get; }
public int LengthInUtf16 { get; }
public bool IsByteAligned { get; }
public int IndexInBytes { get; } // throws if not byte-aligned
public int LengthInBytes { get; } // throws if not byte-aligned
public ReadOnlySpan<byte> GetValueBytes(); // throws if not byte-aligned
public string GetValueString();
}
public readonly ref struct Utf8ValueSplit
{
public bool IsByteAligned { get; }
public int IndexInBytes { get; } // throws if not byte-aligned
public int LengthInBytes { get; } // throws if not byte-aligned
public int IndexInUtf16 { get; }
public int LengthInUtf16 { get; }
public ReadOnlySpan<byte> GetValueBytes(); // throws if not byte-aligned
public string GetValueString();
}
public delegate void Utf8MatchEvaluator<TState>(
in Utf8MatchContext match,
ref Utf8ReplacementWriter writer,
ref TState state);
public delegate string Utf16MatchEvaluator<TState>(
in Utf8MatchContext match,
ref TState state);
public ref struct Utf8ReplacementWriter
{
public void Append(ReadOnlySpan<byte> utf8);
public void Append(ReadOnlySpan<char> utf16);
public void AppendAsciiByte(byte value);
public void Append(Rune value);
}Baseline API notes:
- This surface mirrors span-based
.NET RegexAPIs where possible, but withReadOnlySpan<byte>as the primary input. MaterializedMatches/Splitcollections are intentionally deferred.TryReplaceusesOperationStatus. (Microsoft Learn) - Options whose semantics depend on runtime compilation strategy rather than regex behavior are rejected explicitly. In particular,
RegexOptions.Compiledis not accepted. - Overloads without an explicit timeout use
Utf8Regex.DefaultMatchTimeout, whose initial value must beRegex.InfiniteMatchTimeout. Utf8ValueMatchis the cheap, storable coordinate result. Rich group/capture inspection is provided byMatchDetailed(...)and by evaluator contexts.- Successful matches and captures may be non-byte-aligned. In that case
IsByteAligned == false, UTF-16 coordinates remain exact, and byte-oriented accessors throwInvalidOperationException. Utf8GroupContext.CaptureCountandGetCapture(int index)must match .NET capture history ordering.Utf8ValueSplitEnumeratormust mirror.NET 10 Regex.EnumerateSplitsexactly:counthas the same stopping semantics, empty elements are preserved, andRightToLeftyields the same observable ordering. Match captures are not surfaced as separate split elements unless the baseline span API does so.replacementPatternUtf8overloads interpret bytes as UTF-8 encoded .NET replacement patterns, not as literal bytes.- All
ref structmatch/group/capture/split views are ephemeral over the original input span and cannot outlive the calling frame or input buffer. TryReplacereturnsDoneorDestinationTooSmallonly for destination sizing. OnDestinationTooSmall,bytesWritten == 0and callers must treatdestinationas unchanged. Invalid UTF-8, invalid replacement text, invalid regex construction, timeout, and non-representable byte results still throw.- Evaluator-based replacement is currently the allocating replacement path; allocation control is provided only for fixed replacement-pattern overloads.
These APIs are explicit compatibility affordances for .NET Regex
startat semantics. They take UTF-16 code-unit offsets, not byte
offsets. They are not the preferred hot-path surface for native UTF-8
callers; native callers should slice the input span instead.
public sealed class Utf8Regex
{
public bool IsMatchFromUtf16Offset(ReadOnlySpan<byte> input, int utf16Offset);
public int CountFromUtf16Offset(ReadOnlySpan<byte> input, int utf16Offset);
public Utf8ValueMatch MatchFromUtf16Offset(ReadOnlySpan<byte> input, int utf16Offset);
public Utf8MatchContext MatchDetailedFromUtf16Offset(ReadOnlySpan<byte> input, int utf16Offset);
public Utf8ValueMatchEnumerator EnumerateMatchesFromUtf16Offset(ReadOnlySpan<byte> input, int utf16Offset);
}Rules for these compatibility APIs:
- The
utf16Offsetparameter is measured in UTF-16 code units, exactly like.NET Regex. - These methods exist to preserve
.NETsemantic compatibility across UTF-8 input, not to define the preferred native execution model. - Method names must keep the UTF-16 coordinate system explicit; do not expose
these as plain
startatoverloads on otherwise byte-native methods.
public sealed class Utf8Regex
{
public int MatchMany(
ReadOnlySpan<byte> input,
Span<Utf8ValueMatch> destination,
out bool isMore);
public Utf8RegexAnalysis Analyze();
}
public readonly struct Utf8RegexAnalysis
{
public bool IsFullyNative { get; }
public bool MayFallbackOnNonAsciiInput { get; }
public bool IsExactLiteral { get; }
public int MinRequiredLength { get; }
public int CopyRequiredLiterals(
Span<Utf8RequiredLiteral> destination,
out bool isMore);
}
public readonly struct Utf8RequiredLiteral
{
public bool IsCaseInsensitive { get; }
public int FixedOffsetInBytes { get; } // -1 when not fixed
public ReadOnlyMemory<byte> Utf8Bytes { get; }
}These additions are intentional:
MatchMany(...)is the allocation-control multi-match API: it writes up todestination.Lengthmatches, returnswritten, and setsisMorewhen additional matches exist beyond what fit.Analyze()exposes a small stable search/planning surface for higher-level engines: native eligibility, minimum required length, exact-literal detection, and required literals suitable for prechecks and indexing.
Rules for these extensions:
MatchMany(...)does not imply a continuation protocol.isMoreis only a clean termination signal for callers that intentionally cap result count.CopyRequiredLiterals(...)exposes only stable prefilter facts. It is not a public dump of the internal execution plan.Utf8RequiredLiteral.FixedOffsetInBytes == -1means the literal is required but not at a fixed byte offset.Analyze()is advisory for planning and prechecks; matching semantics are still defined only by the regex APIs themselves.
The architecture must keep these invariants:
- one runtime-compatible semantic front-end
- separate lowerings for candidate search and full execution
- internal UTF-16-exact semantics over UTF-8 input
- native
RightToLeft; never emulate it by reversing the input RegexOptions.NonBacktrackingremains a separate official mode, not a hidden default optimizationUtf8Regexinstances are immutable and thread-safe
Implementation details may evolve, but fast paths must be derived from the same semantic front-end rather than from separate pattern parsers.
Performance rules:
- production paths must stay UTF-8-native; do not route normal execution through
Encoding.UTF8.GetString(...)+ built-inRegex - search and full matching are distinct concerns; search-first execution is expected
- prefer BCL vectorized primitives first, then
SearchValues<T>, then manual intrinsics only when measured - avoid whole-input string materialization except for explicit compatibility APIs such as
ReplaceToString
Testing and benchmarks:
- differential behavior is measured against .NET 10
Regex - tests must cover parser parity, matching, replacement, right-to-left behavior, timeout behavior, and non-byte-aligned edge cases
- benchmark methodology must use warm throughput, not cold-start
Drynumbers, for parity decisions - benchmark suite must report both baselines:
- Baseline A:
Encoding.UTF8.GetString(input)-> built-inRegex-> UTF-8 re-encoding where relevant - Baseline B: built-in
Regexon a predecodedstring
- Baseline A:
- benchmark catalog must include both native-target families and explicit fallback families
The implementation is done when all of the following are true:
- For supported features on valid UTF-8 input, the engine matches .NET 10
Regexbehavior. ReplaceToStringremains the exact compatibility escape hatch.- Byte-oriented APIs either return the correct UTF-8 result or fail explicitly when the exact UTF-16-semantic result is not representable as UTF-8.
RegexOptions.NonBacktrackingbehaves like the official mode, including its documented restrictions and capture differences.- Hot native paths are allocation-light, and the benchmark suite meets explicit per-scenario thresholds against both baselines.