Commit cc9f869
Optimize
## Which issue does this PR close?
- Closes: #21382
## Rationale for this change
`regexp_replace` with anchored patterns like
`^https?://(?:www\.)?([^/]+)/.*$` spends time scanning the trailing
`.*$` and using `captures()` + `expand()` with `String` allocation on
every row.
It just happens this `SELECT regexp_replace(url,
'^https?://(?:www\.)?([^/]+)/.*$', '\1')` query benefits from this
optimization (2.4x faster)
## What changes are included in this PR?
- Strip trailing `.*$` from the pattern string for anchored patterns
where the replacement is `\1`
- Use `captures_read` with pre-allocated `CaptureLocations` for direct
byte-slice extraction
## Are these changes tested?
Yes, covered by existing `regexp_replace` unit tests, ClickBench
sqllogictests, and the new URL domain extraction sqllogictest.
## Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
No.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>regexp_replace by stripping trailing .* from anchored patterns. 2.4x improvement (ClickBench Q28) (#21379)1 parent e1ad871 commit cc9f869
File tree
2 files changed
+121
-13
lines changed- datafusion
- functions/src/regex
- sqllogictest/test_files/regexp
2 files changed
+121
-13
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
| 20 | + | |
19 | 21 | | |
20 | 22 | | |
21 | 23 | | |
| |||
199 | 201 | | |
200 | 202 | | |
201 | 203 | | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
202 | 222 | | |
203 | 223 | | |
204 | 224 | | |
| |||
457 | 477 | | |
458 | 478 | | |
459 | 479 | | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
460 | 488 | | |
461 | 489 | | |
462 | 490 | | |
| |||
473 | 501 | | |
474 | 502 | | |
475 | 503 | | |
476 | | - | |
477 | | - | |
478 | | - | |
479 | | - | |
480 | | - | |
481 | | - | |
482 | | - | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
483 | 535 | | |
484 | 536 | | |
485 | 537 | | |
| |||
494 | 546 | | |
495 | 547 | | |
496 | 548 | | |
497 | | - | |
498 | | - | |
499 | | - | |
500 | | - | |
501 | | - | |
502 | | - | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
503 | 582 | | |
504 | 583 | | |
505 | 584 | | |
| |||
Lines changed: 29 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
131 | 160 | | |
132 | 161 | | |
133 | 162 | | |
| |||
0 commit comments