Implement Thompson NFA-based Regular Expressions#1172
Implement Thompson NFA-based Regular Expressions#1172JAi-SATHVIK wants to merge 113 commits intofortran-lang:masterfrom
Conversation
Fix CMakeLists.txt for the addition of stdlib_storting_pca
Master cpy
optimized for performance and stability
…ster with upstream
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1172 +/- ##
==========================================
+ Coverage 68.66% 69.21% +0.55%
==========================================
Files 408 412 +4
Lines 13619 14056 +437
Branches 1537 1603 +66
==========================================
+ Hits 9351 9729 +378
- Misses 4268 4327 +59 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
UpdateI have finalized the core implementation of the pure Fortran regex engine. Here is a summary of what I've completed:
The engine is now stable, zero-dependency, and ready for your feedback! @arjenmarkus @jvdp1 @jalvesz |
|
|
||
| call regcomp(re, "(cat|dog)s?", stat) | ||
| if (stat /= 0) error stop "Invalid regex pattern" | ||
| ``` |
There was a problem hiding this comment.
This should ideally be an executable example program in the examples folder
There was a problem hiding this comment.
Done. I have removed the inline Fortran code blocks in doc/specs/stdlib_regex.md and created standalone executable examples: example/regex/example_regex_regcomp.f90 .
|
|
||
| ### Example | ||
|
|
||
| ```fortran |
There was a problem hiding this comment.
same as before, this should be an executable program in the examples folder
There was a problem hiding this comment.
Done. I have removed the inline Fortran code blocks in doc/specs/stdlib_regex.md and created standalone executable example: example/regex/example_regex_regmatch.f90 .
src/regex/stdlib_regex.f90
Outdated
|
|
||
| contains | ||
|
|
||
| logical function is_term_ender(tag) |
There was a problem hiding this comment.
can this be made pure or even better elemental ?
There was a problem hiding this comment.
Added the pure prefix to is_term_ender, is_term_starter, and prec
src/regex/stdlib_regex.f90
Outdated
| tag == TOK_START) | ||
| end function is_term_ender | ||
|
|
||
| logical function is_term_starter(tag) |
There was a problem hiding this comment.
can this be made pure or even better elemental ?
There was a problem hiding this comment.
done. Added the pure prefix to is_term_ender, is_term_starter, and prec .
src/regex/stdlib_regex.f90
Outdated
| tag == TOK_START .or. tag == TOK_END) | ||
| end function is_term_starter | ||
|
|
||
| integer function prec(tag) |
There was a problem hiding this comment.
can this be made pure or even better elemental ?
There was a problem hiding this comment.
done. Added the pure prefix to is_term_ender, is_term_starter, and prec .
src/regex/stdlib_regex.f90
Outdated
| end do | ||
| end subroutine parse_to_postfix | ||
|
|
||
| integer function new_out(s, o, pool, p_size) |
There was a problem hiding this comment.
This function has side-effects on pool, I totally agree with this https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects.
Is it possible to consider subroutines when facing mutation on input derived types?
There was a problem hiding this comment.
Converted integer function new_out(...), which was mutating its pool argument, into a pure subroutine new_out(..., return_idx)
| new_out = p_size | ||
| end function new_out | ||
|
|
||
| subroutine merge_lists(l1, l2, res, pool) |
There was a problem hiding this comment.
src/regex/stdlib_regex.f90
Outdated
| end subroutine merge_lists | ||
|
|
||
| subroutine do_patch(states, list, target, pool) | ||
| type(state_type), intent(inout) :: states(:) |
There was a problem hiding this comment.
Another style comment: usually, with subroutines, it is customary to have the non-mutable inputs first (strict intent(in)), then the intent(out) or intent(inout)s, and finally the optionals.
Would it be possible to keep this recommendation?
There was a problem hiding this comment.
Standardized the argument order on all subroutines (like do_patch, merge_lists, and add_thread) so that intent(in) arguments exclusively appear first, followed by intent(out) and intent(inout)
| end subroutine do_patch | ||
|
|
||
| subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat) | ||
| type(token_type), intent(in) :: postfix(:) |
There was a problem hiding this comment.
src/regex/stdlib_regex.f90
Outdated
| if (present(status)) status = stat | ||
| end subroutine regcomp | ||
|
|
||
| recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited) |
There was a problem hiding this comment.
same as https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
On the implementation level, is recursiveness absolutly necessary? or could there be a way to implement this without using recursivity?
There was a problem hiding this comment.
Good point. NFA depths could theoretically cause bounds issues on recursion limits. I've re-written add_thread completely to strictly use an explicit local NFA stack array to track epsilon transitions, meaning NFA compilation and execution is fully linear iteration without risk of recursion depth crashes!
|
I had a first look at the test program you provided some days ago. I
noticed that the indices are off by one:
=== Testing Fortran Regex (Thompson NFA) ===
regcomp 'abc': status = 0
Match 'xyz_abc_def' -> T 4 7
The substring "abc" starts at 5, not 4. This off by one error occurs in
another test as well.
Another one:
Match 'aaaab' with 'a*b' -> T 4 5
The match starts at 1, not 4..
foo123bar: the matching substring is too short - the reported substring is
from 3 to 4, instead of 4 to 6.
cats: the matching subststring is reported as 7 to 11 (five characters)
whereas the matching substring is "cats", so four characters only.
So, some work to be done, unless you have already fixed these bugs ;), but
in any case a good start.
Op za 4 apr 2026 om 22:43 schreef José Alves ***@***.***>:
… ***@***.**** requested changes on this pull request.
------------------------------
In doc/specs/stdlib_regex.md
<#1172 (comment)>:
> +The regular expression pattern string to compile.
+
+`status` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
+Returns 0 on success, or a non-zero value if the pattern is invalid
+(e.g., mismatched parentheses or brackets).
+
+### Example
+
+```fortran
+use stdlib_regex, only: regex_type, regcomp
+type(regex_type) :: re
+integer :: stat
+
+call regcomp(re, "(cat|dog)s?", stat)
+if (stat /= 0) error stop "Invalid regex pattern"
+```
This should ideally be an executable example program in the examples folder
------------------------------
In doc/specs/stdlib_regex.md
<#1172 (comment)>:
> +
+`string`: Shall be of type `character(len=*)`. It is an `intent(in)` argument.
+The input string to search for a match.
+
+`is_match`: Shall be of type `logical`. It is an `intent(out)` argument.
+Set to `.true.` if a match is found, `.false.` otherwise.
+
+`match_start` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
+The 1-based index of the first character of the match.
+
+`match_end` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
+The 1-based index of the last character of the match.
+
+### Example
+
+```fortran
same as before, this should be an executable program in the examples folder
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + integer :: tail
+ end type out_list_type
+
+ type :: frag_type
+ integer :: start
+ type(out_list_type) :: out_list
+ end type frag_type
+
+ type :: thread
+ integer :: state
+ integer :: start_pos
+ end type thread
+
+contains
+
+ logical function is_term_ender(tag)
can this be made pure or even better elemental ?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + integer :: state
+ integer :: start_pos
+ end type thread
+
+contains
+
+ logical function is_term_ender(tag)
+ integer, intent(in) :: tag
+ is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
+ tag == TOK_CLASS .or. tag == TOK_STAR .or. &
+ tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
+ tag == TOK_RPAREN .or. tag == TOK_END .or. &
+ tag == TOK_START)
+ end function is_term_ender
+
+ logical function is_term_starter(tag)
can this be made pure or even better elemental ?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + integer, intent(in) :: tag
+ is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
+ tag == TOK_CLASS .or. tag == TOK_STAR .or. &
+ tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
+ tag == TOK_RPAREN .or. tag == TOK_END .or. &
+ tag == TOK_START)
+ end function is_term_ender
+
+ logical function is_term_starter(tag)
+ integer, intent(in) :: tag
+ is_term_starter = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
+ tag == TOK_CLASS .or. tag == TOK_LPAREN .or. &
+ tag == TOK_START .or. tag == TOK_END)
+ end function is_term_starter
+
+ integer function prec(tag)
can this be made pure or even better elemental ?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + c = pattern(i:i)
+ t%tag = TOK_CHAR
+ t%c = ' '
+ t%bmap = .false.
+ t%invert = .false.
+
+ if (c == '\') then
+ if (i < len_p) then
+ i = i + 1
+ c = pattern(i:i)
+ end if
+ t%tag = TOK_CHAR
+ t%c = c
+ if (c == 'd') then
+ t%tag = TOK_CLASS
+ do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do
Two things:
1.
Could you make all calls to iachar( ) affect module level parameters
that then are usable accross the module ? you can check out
https://github.com/fortran-lang/stdlib/blob/master/src/core/stdlib_ascii.fypp
for inspiration. Maybe some of those constants would be worth to be stored
there ?
2.
Please avoid one-liners for anything that is not a scalar constant:
-
one = 1; zero =0 is tolerable
-
do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do is not
very "debugger friendly" as one can not easily set break points to follow
iterations.
For this specific case, t%bmap(iachar('0'):iachar('9')) = .true. would
be equivalent, fortranic, and no need for semicolons.
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + stack(top) = tokens(i)
+ end if
+ end do
+
+ do while (top > 0)
+ if (stack(top)%tag == TOK_LPAREN) then
+ stat = 1
+ return
+ end if
+ num_postfix = num_postfix + 1
+ postfix(num_postfix) = stack(top)
+ top = top - 1
+ end do
+ end subroutine parse_to_postfix
+
+ integer function new_out(s, o, pool, p_size)
This function has side-effects on pool, I totally agree with this
https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects
.
Is it possible to consider subroutines when facing mutation on input
derived types?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + top = top - 1
+ end do
+ end subroutine parse_to_postfix
+
+ integer function new_out(s, o, pool, p_size)
+ integer, intent(in) :: s, o
+ type(out_node), intent(inout) :: pool(:)
+ integer, intent(inout) :: p_size
+ p_size = p_size + 1
+ pool(p_size)%s = s
+ pool(p_size)%o = o
+ pool(p_size)%next = 0
+ new_out = p_size
+ end function new_out
+
+ subroutine merge_lists(l1, l2, res, pool)
same https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036020902
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + type(out_list_type), intent(in) :: l1, l2
+ type(out_list_type), intent(out) :: res
+ type(out_node), intent(inout) :: pool(:)
+ if (l1%head == 0) then
+ res = l2
+ else if (l2%head == 0) then
+ res = l1
+ else
+ pool(l1%tail)%next = l2%head
+ res%head = l1%head
+ res%tail = l2%tail
+ end if
+ end subroutine merge_lists
+
+ subroutine do_patch(states, list, target, pool)
+ type(state_type), intent(inout) :: states(:)
Another style comment: usually, with subroutines, it is customary to have
the non-mutable inputs first (strict intent(in)), then the intent(out) or
intent(inout)s, and finally the optionals.
Would it be possible to keep this recommendation?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + integer, intent(in) :: target
+ type(out_node), intent(in) :: pool(:)
+ integer :: curr
+ curr = list%head
+ do while (curr /= 0)
+ if (pool(curr)%o == 1) then
+ states(pool(curr)%s)%out1 = target
+ else
+ states(pool(curr)%s)%out2 = target
+ end if
+ curr = pool(curr)%next
+ end do
+ end subroutine do_patch
+
+ subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat)
+ type(token_type), intent(in) :: postfix(:)
same as
https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + if (stat /= 0) then
+ if (present(status)) status = stat
+ return
+ end if
+
+ call parse_to_postfix(tokens, n_tok, postfix, n_post, stat)
+ if (stat /= 0) then
+ if (present(status)) status = stat
+ return
+ end if
+
+ call build_nfa(postfix, n_post, re%states, re%n_states, re%start_state, stat)
+ if (present(status)) status = stat
+ end subroutine regcomp
+
+ recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited)
same as
https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
On the implementation level, is recursiveness absolutly necessary? or
could there be a way to implement this without using recursivity?
—
Reply to this email directly, view it on GitHub
<#1172 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YRY2IFR2JVVQLWR6RGD4UFXXVAVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANJYHAYTONBVG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I also tried a test
call regcomp(re, "A+", stat)
call regmatch(re, "A AAAAAAAAAAAAAAA", is_match, match_start, match_end)
print *, "Match 'A AAAAAAAAAAAA' with 'A+' -> ", is_match, match_start,
match_end
The result was a match with the first subststring, not the longest. That is
not the classical behaviour of a regular expression matcher. I will read
the documentation to see if this was expected 😇
Op zo 5 apr 2026 om 15:04 schreef Arjen Markus ***@***.***>:
… I had a first look at the test program you provided some days ago. I
noticed that the indices are off by one:
=== Testing Fortran Regex (Thompson NFA) ===
regcomp 'abc': status = 0
Match 'xyz_abc_def' -> T 4 7
The substring "abc" starts at 5, not 4. This off by one error occurs in
another test as well.
Another one:
Match 'aaaab' with 'a*b' -> T 4 5
The match starts at 1, not 4..
foo123bar: the matching substring is too short - the reported substring is
from 3 to 4, instead of 4 to 6.
cats: the matching subststring is reported as 7 to 11 (five characters)
whereas the matching substring is "cats", so four characters only.
So, some work to be done, unless you have already fixed these bugs ;), but
in any case a good start.
Op za 4 apr 2026 om 22:43 schreef José Alves ***@***.***>:
> ***@***.**** requested changes on this pull request.
> ------------------------------
>
> In doc/specs/stdlib_regex.md
> <#1172 (comment)>
> :
>
> > +The regular expression pattern string to compile.
> +
> +`status` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
> +Returns 0 on success, or a non-zero value if the pattern is invalid
> +(e.g., mismatched parentheses or brackets).
> +
> +### Example
> +
> +```fortran
> +use stdlib_regex, only: regex_type, regcomp
> +type(regex_type) :: re
> +integer :: stat
> +
> +call regcomp(re, "(cat|dog)s?", stat)
> +if (stat /= 0) error stop "Invalid regex pattern"
> +```
>
> This should ideally be an executable example program in the examples
> folder
> ------------------------------
>
> In doc/specs/stdlib_regex.md
> <#1172 (comment)>
> :
>
> > +
> +`string`: Shall be of type `character(len=*)`. It is an `intent(in)` argument.
> +The input string to search for a match.
> +
> +`is_match`: Shall be of type `logical`. It is an `intent(out)` argument.
> +Set to `.true.` if a match is found, `.false.` otherwise.
> +
> +`match_start` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
> +The 1-based index of the first character of the match.
> +
> +`match_end` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
> +The 1-based index of the last character of the match.
> +
> +### Example
> +
> +```fortran
>
> same as before, this should be an executable program in the examples
> folder
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + integer :: tail
> + end type out_list_type
> +
> + type :: frag_type
> + integer :: start
> + type(out_list_type) :: out_list
> + end type frag_type
> +
> + type :: thread
> + integer :: state
> + integer :: start_pos
> + end type thread
> +
> +contains
> +
> + logical function is_term_ender(tag)
>
> can this be made pure or even better elemental ?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + integer :: state
> + integer :: start_pos
> + end type thread
> +
> +contains
> +
> + logical function is_term_ender(tag)
> + integer, intent(in) :: tag
> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
> + tag == TOK_CLASS .or. tag == TOK_STAR .or. &
> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
> + tag == TOK_RPAREN .or. tag == TOK_END .or. &
> + tag == TOK_START)
> + end function is_term_ender
> +
> + logical function is_term_starter(tag)
>
> can this be made pure or even better elemental ?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + integer, intent(in) :: tag
> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
> + tag == TOK_CLASS .or. tag == TOK_STAR .or. &
> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
> + tag == TOK_RPAREN .or. tag == TOK_END .or. &
> + tag == TOK_START)
> + end function is_term_ender
> +
> + logical function is_term_starter(tag)
> + integer, intent(in) :: tag
> + is_term_starter = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
> + tag == TOK_CLASS .or. tag == TOK_LPAREN .or. &
> + tag == TOK_START .or. tag == TOK_END)
> + end function is_term_starter
> +
> + integer function prec(tag)
>
> can this be made pure or even better elemental ?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + c = pattern(i:i)
> + t%tag = TOK_CHAR
> + t%c = ' '
> + t%bmap = .false.
> + t%invert = .false.
> +
> + if (c == '\') then
> + if (i < len_p) then
> + i = i + 1
> + c = pattern(i:i)
> + end if
> + t%tag = TOK_CHAR
> + t%c = c
> + if (c == 'd') then
> + t%tag = TOK_CLASS
> + do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do
>
> Two things:
>
> 1.
>
> Could you make all calls to iachar( ) affect module level parameters
> that then are usable accross the module ? you can check out
> https://github.com/fortran-lang/stdlib/blob/master/src/core/stdlib_ascii.fypp
> for inspiration. Maybe some of those constants would be worth to be stored
> there ?
> 2.
>
> Please avoid one-liners for anything that is not a scalar constant:
>
>
> -
>
> one = 1; zero =0 is tolerable
> -
>
> do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do is not
> very "debugger friendly" as one can not easily set break points to follow
> iterations.
> For this specific case, t%bmap(iachar('0'):iachar('9')) = .true.
> would be equivalent, fortranic, and no need for semicolons.
>
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + stack(top) = tokens(i)
> + end if
> + end do
> +
> + do while (top > 0)
> + if (stack(top)%tag == TOK_LPAREN) then
> + stat = 1
> + return
> + end if
> + num_postfix = num_postfix + 1
> + postfix(num_postfix) = stack(top)
> + top = top - 1
> + end do
> + end subroutine parse_to_postfix
> +
> + integer function new_out(s, o, pool, p_size)
>
> This function has side-effects on pool, I totally agree with this
> https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects
> .
>
> Is it possible to consider subroutines when facing mutation on input
> derived types?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + top = top - 1
> + end do
> + end subroutine parse_to_postfix
> +
> + integer function new_out(s, o, pool, p_size)
> + integer, intent(in) :: s, o
> + type(out_node), intent(inout) :: pool(:)
> + integer, intent(inout) :: p_size
> + p_size = p_size + 1
> + pool(p_size)%s = s
> + pool(p_size)%o = o
> + pool(p_size)%next = 0
> + new_out = p_size
> + end function new_out
> +
> + subroutine merge_lists(l1, l2, res, pool)
>
> same https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036020902
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + type(out_list_type), intent(in) :: l1, l2
> + type(out_list_type), intent(out) :: res
> + type(out_node), intent(inout) :: pool(:)
> + if (l1%head == 0) then
> + res = l2
> + else if (l2%head == 0) then
> + res = l1
> + else
> + pool(l1%tail)%next = l2%head
> + res%head = l1%head
> + res%tail = l2%tail
> + end if
> + end subroutine merge_lists
> +
> + subroutine do_patch(states, list, target, pool)
> + type(state_type), intent(inout) :: states(:)
>
> Another style comment: usually, with subroutines, it is customary to have
> the non-mutable inputs first (strict intent(in)), then the intent(out)
> or intent(inout)s, and finally the optionals.
>
> Would it be possible to keep this recommendation?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + integer, intent(in) :: target
> + type(out_node), intent(in) :: pool(:)
> + integer :: curr
> + curr = list%head
> + do while (curr /= 0)
> + if (pool(curr)%o == 1) then
> + states(pool(curr)%s)%out1 = target
> + else
> + states(pool(curr)%s)%out2 = target
> + end if
> + curr = pool(curr)%next
> + end do
> + end subroutine do_patch
> +
> + subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat)
> + type(token_type), intent(in) :: postfix(:)
>
> same as
> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + if (stat /= 0) then
> + if (present(status)) status = stat
> + return
> + end if
> +
> + call parse_to_postfix(tokens, n_tok, postfix, n_post, stat)
> + if (stat /= 0) then
> + if (present(status)) status = stat
> + return
> + end if
> +
> + call build_nfa(postfix, n_post, re%states, re%n_states, re%start_state, stat)
> + if (present(status)) status = stat
> + end subroutine regcomp
> +
> + recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited)
>
> same as
> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
>
> On the implementation level, is recursiveness absolutly necessary? or
> could there be a way to implement this without using recursivity?
>
> —
> Reply to this email directly, view it on GitHub
> <#1172 (review)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAN6YRY2IFR2JVVQLWR6RGD4UFXXVAVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANJYHAYTONBVG4>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
Ah, that was my mistake. The engine should look for the longest matching
substring when a start has been found, unless non-greedy expressions are
selected. Always a tricky part of regular expressions: to know precisely
what is to be matched and what not.
Op zo 5 apr 2026 om 15:11 schreef Arjen Markus ***@***.***>:
… I also tried a test
call regcomp(re, "A+", stat)
call regmatch(re, "A AAAAAAAAAAAAAAA", is_match, match_start, match_end)
print *, "Match 'A AAAAAAAAAAAA' with 'A+' -> ", is_match, match_start,
match_end
The result was a match with the first subststring, not the longest. That
is not the classical behaviour of a regular expression matcher. I will read
the documentation to see if this was expected 😇
Op zo 5 apr 2026 om 15:04 schreef Arjen Markus ***@***.***
>:
> I had a first look at the test program you provided some days ago. I
> noticed that the indices are off by one:
>
> === Testing Fortran Regex (Thompson NFA) ===
> regcomp 'abc': status = 0
> Match 'xyz_abc_def' -> T 4 7
>
> The substring "abc" starts at 5, not 4. This off by one error occurs in
> another test as well.
>
> Another one:
>
> Match 'aaaab' with 'a*b' -> T 4 5
>
> The match starts at 1, not 4..
>
> foo123bar: the matching substring is too short - the reported substring
> is from 3 to 4, instead of 4 to 6.
>
> cats: the matching subststring is reported as 7 to 11 (five characters)
> whereas the matching substring is "cats", so four characters only.
>
> So, some work to be done, unless you have already fixed these bugs ;),
> but in any case a good start.
>
> Op za 4 apr 2026 om 22:43 schreef José Alves ***@***.***>:
>
>> ***@***.**** requested changes on this pull request.
>> ------------------------------
>>
>> In doc/specs/stdlib_regex.md
>> <#1172 (comment)>
>> :
>>
>> > +The regular expression pattern string to compile.
>> +
>> +`status` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
>> +Returns 0 on success, or a non-zero value if the pattern is invalid
>> +(e.g., mismatched parentheses or brackets).
>> +
>> +### Example
>> +
>> +```fortran
>> +use stdlib_regex, only: regex_type, regcomp
>> +type(regex_type) :: re
>> +integer :: stat
>> +
>> +call regcomp(re, "(cat|dog)s?", stat)
>> +if (stat /= 0) error stop "Invalid regex pattern"
>> +```
>>
>> This should ideally be an executable example program in the examples
>> folder
>> ------------------------------
>>
>> In doc/specs/stdlib_regex.md
>> <#1172 (comment)>
>> :
>>
>> > +
>> +`string`: Shall be of type `character(len=*)`. It is an `intent(in)` argument.
>> +The input string to search for a match.
>> +
>> +`is_match`: Shall be of type `logical`. It is an `intent(out)` argument.
>> +Set to `.true.` if a match is found, `.false.` otherwise.
>> +
>> +`match_start` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
>> +The 1-based index of the first character of the match.
>> +
>> +`match_end` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
>> +The 1-based index of the last character of the match.
>> +
>> +### Example
>> +
>> +```fortran
>>
>> same as before, this should be an executable program in the examples
>> folder
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + integer :: tail
>> + end type out_list_type
>> +
>> + type :: frag_type
>> + integer :: start
>> + type(out_list_type) :: out_list
>> + end type frag_type
>> +
>> + type :: thread
>> + integer :: state
>> + integer :: start_pos
>> + end type thread
>> +
>> +contains
>> +
>> + logical function is_term_ender(tag)
>>
>> can this be made pure or even better elemental ?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + integer :: state
>> + integer :: start_pos
>> + end type thread
>> +
>> +contains
>> +
>> + logical function is_term_ender(tag)
>> + integer, intent(in) :: tag
>> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
>> + tag == TOK_CLASS .or. tag == TOK_STAR .or. &
>> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
>> + tag == TOK_RPAREN .or. tag == TOK_END .or. &
>> + tag == TOK_START)
>> + end function is_term_ender
>> +
>> + logical function is_term_starter(tag)
>>
>> can this be made pure or even better elemental ?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + integer, intent(in) :: tag
>> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
>> + tag == TOK_CLASS .or. tag == TOK_STAR .or. &
>> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
>> + tag == TOK_RPAREN .or. tag == TOK_END .or. &
>> + tag == TOK_START)
>> + end function is_term_ender
>> +
>> + logical function is_term_starter(tag)
>> + integer, intent(in) :: tag
>> + is_term_starter = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
>> + tag == TOK_CLASS .or. tag == TOK_LPAREN .or. &
>> + tag == TOK_START .or. tag == TOK_END)
>> + end function is_term_starter
>> +
>> + integer function prec(tag)
>>
>> can this be made pure or even better elemental ?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + c = pattern(i:i)
>> + t%tag = TOK_CHAR
>> + t%c = ' '
>> + t%bmap = .false.
>> + t%invert = .false.
>> +
>> + if (c == '\') then
>> + if (i < len_p) then
>> + i = i + 1
>> + c = pattern(i:i)
>> + end if
>> + t%tag = TOK_CHAR
>> + t%c = c
>> + if (c == 'd') then
>> + t%tag = TOK_CLASS
>> + do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do
>>
>> Two things:
>>
>> 1.
>>
>> Could you make all calls to iachar( ) affect module level parameters
>> that then are usable accross the module ? you can check out
>> https://github.com/fortran-lang/stdlib/blob/master/src/core/stdlib_ascii.fypp
>> for inspiration. Maybe some of those constants would be worth to be stored
>> there ?
>> 2.
>>
>> Please avoid one-liners for anything that is not a scalar constant:
>>
>>
>> -
>>
>> one = 1; zero =0 is tolerable
>> -
>>
>> do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do is not
>> very "debugger friendly" as one can not easily set break points to follow
>> iterations.
>> For this specific case, t%bmap(iachar('0'):iachar('9')) = .true.
>> would be equivalent, fortranic, and no need for semicolons.
>>
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + stack(top) = tokens(i)
>> + end if
>> + end do
>> +
>> + do while (top > 0)
>> + if (stack(top)%tag == TOK_LPAREN) then
>> + stat = 1
>> + return
>> + end if
>> + num_postfix = num_postfix + 1
>> + postfix(num_postfix) = stack(top)
>> + top = top - 1
>> + end do
>> + end subroutine parse_to_postfix
>> +
>> + integer function new_out(s, o, pool, p_size)
>>
>> This function has side-effects on pool, I totally agree with this
>> https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects
>> .
>>
>> Is it possible to consider subroutines when facing mutation on input
>> derived types?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + top = top - 1
>> + end do
>> + end subroutine parse_to_postfix
>> +
>> + integer function new_out(s, o, pool, p_size)
>> + integer, intent(in) :: s, o
>> + type(out_node), intent(inout) :: pool(:)
>> + integer, intent(inout) :: p_size
>> + p_size = p_size + 1
>> + pool(p_size)%s = s
>> + pool(p_size)%o = o
>> + pool(p_size)%next = 0
>> + new_out = p_size
>> + end function new_out
>> +
>> + subroutine merge_lists(l1, l2, res, pool)
>>
>> same
>> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036020902
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + type(out_list_type), intent(in) :: l1, l2
>> + type(out_list_type), intent(out) :: res
>> + type(out_node), intent(inout) :: pool(:)
>> + if (l1%head == 0) then
>> + res = l2
>> + else if (l2%head == 0) then
>> + res = l1
>> + else
>> + pool(l1%tail)%next = l2%head
>> + res%head = l1%head
>> + res%tail = l2%tail
>> + end if
>> + end subroutine merge_lists
>> +
>> + subroutine do_patch(states, list, target, pool)
>> + type(state_type), intent(inout) :: states(:)
>>
>> Another style comment: usually, with subroutines, it is customary to
>> have the non-mutable inputs first (strict intent(in)), then the
>> intent(out) or intent(inout)s, and finally the optionals.
>>
>> Would it be possible to keep this recommendation?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + integer, intent(in) :: target
>> + type(out_node), intent(in) :: pool(:)
>> + integer :: curr
>> + curr = list%head
>> + do while (curr /= 0)
>> + if (pool(curr)%o == 1) then
>> + states(pool(curr)%s)%out1 = target
>> + else
>> + states(pool(curr)%s)%out2 = target
>> + end if
>> + curr = pool(curr)%next
>> + end do
>> + end subroutine do_patch
>> +
>> + subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat)
>> + type(token_type), intent(in) :: postfix(:)
>>
>> same as
>> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + if (stat /= 0) then
>> + if (present(status)) status = stat
>> + return
>> + end if
>> +
>> + call parse_to_postfix(tokens, n_tok, postfix, n_post, stat)
>> + if (stat /= 0) then
>> + if (present(status)) status = stat
>> + return
>> + end if
>> +
>> + call build_nfa(postfix, n_post, re%states, re%n_states, re%start_state, stat)
>> + if (present(status)) status = stat
>> + end subroutine regcomp
>> +
>> + recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited)
>>
>> same as
>> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
>>
>> On the implementation level, is recursiveness absolutly necessary? or
>> could there be a way to implement this without using recursivity?
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#1172 (review)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AAN6YRY2IFR2JVVQLWR6RGD4UFXXVAVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANJYHAYTONBVG4>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
|
|
Thanks @jalvesz @arjenmarkus ! I’ve updated the code and addressed those issues: Off-by-one and match lengths: Fixed! Among all matches starting at that same leftmost position, the engine will strictly select the longest one before concluding. |
|
I have written a small "interpreter" that will allow you to easily extend
the set of tests. See the attachments. The sample tests are just a start,
of course, but I already found one incompleteness in the checking for a
proper regular expression. You are welcome to use it (or just to ignore it,
if you think it is not useful).
Op zo 5 apr 2026 om 21:56 schreef JAYA SATHVIK TANGA <
***@***.***>:
*JAi-SATHVIK* left a comment (fortran-lang/stdlib#1172)
<#1172 (comment)>
Thanks @jalvesz <https://github.com/jalvesz> @arjenmarkus
<https://github.com/arjenmarkus> ! I’ve updated the code and addressed
those issues:
*Off-by-one and match lengths:* Fixed! abc now correctly returns ms=5,
me=7, and aaaab with a*b correctly returns ms=1, me=5.
*Leftmost-Longest Priority:* The engine follows the standard where the
leftmost start always wins first. Because an "A" matches starting at
index 1, it is chosen over any subsequent matches elsewhere in the string.
Among all matches starting at that same leftmost position, the engine will
strictly select the longest one before concluding.
—
Reply to this email directly, view it on GitHub
<#1172 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR3GF5VUKRF6AEL5MDL4UK277AVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCOBZGQZTMOBRGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
! catalogue_regex.f90 --
!
! By Arjen Markus, dd. 7 april 2026.
!
! A catalogue of tests for the regular expression module
!
! The program reads a file with test cases:
! regexp "expression" - the expression to be compiled and used
! input "string" - the string to be matched against the last regular expression
! expected "string" - the string that is expected to match (hence outpur from the match routine)
! error-exp - expecting an error from the compilation of the last regular expression
! no-match - expecting a "no match" result
!
! These lines instruct the program to compile and use the regular expression.
! The results are reported in the output file.
! The lines in the file should be no more than 100 characters long.
! Also: the double quotes should surround the strings, so that they are properly delimited.
!
! The order of the lines is expected to be:
! the expected output comes after the expression and the input
!
program catalogue_regex
use stdlib_regex
implicit none
type(regex_type) :: re
character(len=100) :: line
character(len=20) :: keyword
character(len=:), allocatable :: value
character(len=:), allocatable :: expression
character(len=:), allocatable :: string
character(len=:), allocatable :: expected
integer :: match_start, match_end, status, ierr
integer :: mismatches
logical :: matched
open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )
if ( ierr /= 0 ) then
write( *, '(a)' ) 'Could not open the file "catalogue_regex.inp"'
write( *, '(a)' ) 'It should exist - please check'
error stop
endif
open( 20, file = 'catalogue_regex.report' )
mismatches = 0
do
read( 10, '(a)', iostat = ierr ) line
if ( ierr /= 0 ) then
exit
endif
call extract_information( line, keyword, value )
select case( keyword )
case( 'expression' )
expression = value
case( 'input' )
string = value
case( 'expected' )
write( 20, '(a)' ) ''
expected = value
call regcomp( re, expression, status )
if ( status /= 0 ) then
mismatches = mismatches + 1
write( 20, '(a,i0)' ) 'Error compiling the expression: status = ', status
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
else
call regmatch( re, string, matched, match_start, match_end )
if ( matched ) then
write( 20, '(a,2a)' ) 'Match found:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
write( 20, '(a,2a)' ) ' Input string: "', string, '"'
write( 20, '(a,2a)' ) ' Substring: "', string(match_start:match_end), '"'
write( 20, '(a,2a)' ) ' Expected: "', expected, '"'
if ( expected == string(match_start:match_end) ) then
write( 20, '(a,2a)' ) ' Success!'
else
mismatches = mismatches + 1
write( 20, '(a,2a)' ) ' MISMATCH!'
endif
else
mismatches = mismatches + 1
write( 20, '(a,2a)' ) 'NO match found:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
write( 20, '(a,2a)' ) ' Input string: "', string, '"'
write( 20, '(a,2a)' ) ' Substring: (none)'
write( 20, '(a,2a)' ) ' Expected: "', expected, '"'
endif
endif
case( 'error-exp' )
write( 20, '(a)' ) ''
call regcomp( re, expression, status )
if ( status /= 0 ) then
write( 20, '(a)' ) 'Error detected as expected:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
else
mismatches = mismatches + 1
write( 20, '(a)' ) 'An error was expected but not detected:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
endif
case( 'no-match' )
write( 20, '(a)' ) ''
call regcomp( re, expression, status )
if ( status /= 0 ) then
mismatches = mismatches + 1
write( 20, '(a,i0)' ) 'Error compiling the expression: status = ', status
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
else
call regmatch( re, string, matched, match_start, match_end )
if ( matched ) then
mismatches = mismatches + 1
write( 20, '(a,2a)' ) 'Match found where none expected:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
write( 20, '(a,2a)' ) ' Input string: "', string, '"'
write( 20, '(a,2a)' ) ' Substring: "', string(match_start:match_end), '"'
write( 20, '(a,2a)' ) ' Expected: (none)'
else
write( 20, '(a,2a)' ) 'No match found, as expected:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
write( 20, '(a,2a)' ) ' Input string: "', string, '"'
write( 20, '(a,2a)' ) ' Expected: (none)'
endif
endif
case default
! Treat any other keyword as comment
end select
enddo
write( 20, '(/,a,i0)' ) 'Number of mismatches or other errors: ', mismatches
write( *, '(a)' ) 'Program completed'
contains
subroutine extract_information( line, keyword, value )
character(len=*), intent(in) :: line
character(len=*), intent(out) :: keyword
character(len=:), intent(out), allocatable :: value
character(len=20), dimension(5) :: known_keywords = &
[ 'expression ', &
'input ', &
'expected ', &
'error-exp ', &
'no-match ' ]
integer :: k1, k2
if ( line == " " ) then
keyword = ""
value = ""
return
endif
read( line, *, iostat = ierr ) keyword
if ( keyword == 'error-exp' .or. keyword == 'no-match' ) then
value = ""
return
endif
if ( any( keyword == known_keywords ) ) then
allocate( value, mold = line )
k1 = index( line, '"' )
if ( k1 > 0 ) then
k2 = k1 + index( line(k1+1:), '"' )
if ( k2 > 0 ) then
value = line(k1+1:k2-1)
else
write( 20, '(a)' ) 'Error interpreting the input line:'
write( 20, '(2a)' ) ' "', trim(line), '"'
write( 20, '(2a)' ) 'Program stopped'
write( *, '(2a)' ) 'Program stopped - error reading input. Please check'
error stop
endif
endif
else
value = ""
endif
end subroutine extract_information
end program catalogue_regex
|
issue #1163