Skip to content

Implement Thompson NFA-based Regular Expressions#1172

Open
JAi-SATHVIK wants to merge 113 commits intofortran-lang:masterfrom
JAi-SATHVIK:regex
Open

Implement Thompson NFA-based Regular Expressions#1172
JAi-SATHVIK wants to merge 113 commits intofortran-lang:masterfrom
JAi-SATHVIK:regex

Conversation

@JAi-SATHVIK
Copy link
Copy Markdown
Contributor

issue #1163

@JAi-SATHVIK JAi-SATHVIK marked this pull request as draft March 31, 2026 19:48
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 86.49886% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.21%. Comparing base (3b447b9) to head (29f598b).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
src/regex/stdlib_regex.f90 87.88% 35 Missing ⚠️
example/regex/example_regex_regmatch.f90 0.00% 15 Missing ⚠️
example/regex/example_regex_regcomp.f90 0.00% 7 Missing ⚠️
test/regex/test_regex.f90 98.41% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1172      +/-   ##
==========================================
+ Coverage   68.66%   69.21%   +0.55%     
==========================================
  Files         408      412       +4     
  Lines       13619    14056     +437     
  Branches     1537     1603      +66     
==========================================
+ Hits         9351     9729     +378     
- Misses       4268     4327      +59     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAi-SATHVIK JAi-SATHVIK marked this pull request as ready for review April 3, 2026 21:24
@JAi-SATHVIK
Copy link
Copy Markdown
Contributor Author

Hi @jalvesz @jvdp1, The "cmake-3.14" job in CI is failing because pip install cmake==3.14.3 requested a version which is no longer available on the PyPI index. can we update the .github/workflows/CI.yml to use cmake==3.14.3.post1?

@JAi-SATHVIK
Copy link
Copy Markdown
Contributor Author

JAi-SATHVIK commented Apr 4, 2026

Update

I have finalized the core implementation of the pure Fortran regex engine. Here is a summary of what I've completed:

  • Correct Shunting-Yard Parsing: Fixed logic bugs in the parser to properly handle parentheses (( and )) and operator precedence during postfix conversion.
  • Lexer Anchor Handling: Updated the lexer to accurately handle start (^) and end ($) anchors with correct implicit concatenation logic.
  • Accurate Match Reporting: Fixed an off-by-one error in regmatch to ensure correct 1-based match_start indices.
  • Safety and Stability: Hardened the engine against out-of-bounds access and memory issues by refactoring eager logical evaluations and utilizing local state management for thread safety.
  • Unit Test Integration: Migrated the test suite to the repository's standard test-drive framework, with 10 comprehensive test cases covering literals, character classes, anchors, and alternation.

The engine is now stable, zero-dependency, and ready for your feedback! @arjenmarkus @jvdp1 @jalvesz


call regcomp(re, "(cat|dog)s?", stat)
if (stat /= 0) error stop "Invalid regex pattern"
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ideally be an executable example program in the examples folder

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have removed the inline Fortran code blocks in doc/specs/stdlib_regex.md and created standalone executable examples: example/regex/example_regex_regcomp.f90 .


### Example

```fortran
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before, this should be an executable program in the examples folder

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have removed the inline Fortran code blocks in doc/specs/stdlib_regex.md and created standalone executable example: example/regex/example_regex_regmatch.f90 .


contains

logical function is_term_ender(tag)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be made pure or even better elemental ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the pure prefix to is_term_ender, is_term_starter, and prec

tag == TOK_START)
end function is_term_ender

logical function is_term_starter(tag)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be made pure or even better elemental ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Added the pure prefix to is_term_ender, is_term_starter, and prec .

tag == TOK_START .or. tag == TOK_END)
end function is_term_starter

integer function prec(tag)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be made pure or even better elemental ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Added the pure prefix to is_term_ender, is_term_starter, and prec .

end do
end subroutine parse_to_postfix

integer function new_out(s, o, pool, p_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has side-effects on pool, I totally agree with this https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects.

Is it possible to consider subroutines when facing mutation on input derived types?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted integer function new_out(...), which was mutating its pool argument, into a pure subroutine new_out(..., return_idx)

new_out = p_size
end function new_out

subroutine merge_lists(l1, l2, res, pool)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end subroutine merge_lists

subroutine do_patch(states, list, target, pool)
type(state_type), intent(inout) :: states(:)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another style comment: usually, with subroutines, it is customary to have the non-mutable inputs first (strict intent(in)), then the intent(out) or intent(inout)s, and finally the optionals.

Would it be possible to keep this recommendation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standardized the argument order on all subroutines (like do_patch, merge_lists, and add_thread) so that intent(in) arguments exclusively appear first, followed by intent(out) and intent(inout)

end subroutine do_patch

subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat)
type(token_type), intent(in) :: postfix(:)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (present(status)) status = stat
end subroutine regcomp

recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863

On the implementation level, is recursiveness absolutly necessary? or could there be a way to implement this without using recursivity?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. NFA depths could theoretically cause bounds issues on recursion limits. I've re-written add_thread completely to strictly use an explicit local NFA stack array to track epsilon transitions, meaning NFA compilation and execution is fully linear iteration without risk of recursion depth crashes!

@arjenmarkus
Copy link
Copy Markdown
Member

arjenmarkus commented Apr 5, 2026 via email

@arjenmarkus
Copy link
Copy Markdown
Member

arjenmarkus commented Apr 5, 2026 via email

@arjenmarkus
Copy link
Copy Markdown
Member

arjenmarkus commented Apr 5, 2026 via email

@JAi-SATHVIK
Copy link
Copy Markdown
Contributor Author

Thanks @jalvesz @arjenmarkus ! I’ve updated the code and addressed those issues:

Off-by-one and match lengths: Fixed! abc now correctly returns ms=5, me=7, and aaaab with a*b correctly returns ms=1, me=5.
Leftmost-Longest Priority: The engine follows the standard where the leftmost start always wins first. Because an "A" matches starting at index 1, it is chosen over any subsequent matches elsewhere in the string.

Among all matches starting at that same leftmost position, the engine will strictly select the longest one before concluding.

@JAi-SATHVIK JAi-SATHVIK requested a review from jalvesz April 5, 2026 19:59
@arjenmarkus
Copy link
Copy Markdown
Member

arjenmarkus commented Apr 7, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants