Skip to content

Fix REGEXP_EXTRACT to return NULL instead of empty string on no match#31

Open
ggjh-159 wants to merge 1 commit into
bigo-sg:gluten-0530from
ggjh-159:fix/regexp-extract-null-on-no-match
Open

Fix REGEXP_EXTRACT to return NULL instead of empty string on no match#31
ggjh-159 wants to merge 1 commit into
bigo-sg:gluten-0530from
ggjh-159:fix/regexp-extract-null-on-no-match

Conversation

@ggjh-159

@ggjh-159 ggjh-159 commented Jun 9, 2026

Copy link
Copy Markdown

What changes are proposed in this pull request?

Override regexp_extract to return NULL on no match instead of empty string. The Spark version passes emptyNoMatch=true, returning empty string on no match. Flink's REGEXP_EXTRACT expects NULL on no match. This matters for IS NOT NULL filtering which breaks when empty string is returned instead of NULL.

How was this patch tested?

Tested by running Nexmark Q21 on Gluten-Flink with 10000 events. The query uses WHERE REGEXP_EXTRACT(url, '(&|^)channel_id=([^&]*)', 2) IS NOT NULL. Previously rows without channel_id in URL were incorrectly included (REGEXP_EXTRACT returned "" making IS NOT NULL always true). After this fix, those rows are correctly filtered out.

Note: A companion fix in gluten-flink is also needed. See gluten-pr-12267.

related issue: gluten-issue-12266.

@ggjh-159 ggjh-159 force-pushed the fix/regexp-extract-null-on-no-match branch from 89dfc96 to ded03e3 Compare June 9, 2026 08:05
@ggjh-159

Copy link
Copy Markdown
Author

@lgbo-ustc @zhanglistar @KevinyhZou Would you mind reviewing this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant