Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user configurable regex library for queries #13005

Merged
merged 2 commits into from
May 17, 2024

Conversation

itschrispeck
Copy link
Collaborator

@itschrispeck itschrispeck commented Apr 25, 2024

This PR addresses #12628

This provides functionality to configure a regex library that will be used during query execution.

  • PatternFactory.compile(...) is used in place of Pattern.compile(...)
  • Pattern interface added maintains the same semantics as java.util.Pattern
  • Matcher interface added maintains the same semantics as java.util.Matcher

A new config is added: pinot.server.query.regex.class
Valid values for this config are: JAVA_UTIL or RE2J (it might be better to accept class names instead?).

Some testing showed regex libraries have very different strengths/weaknesses, so it seemed best to allow users to choose an implementation that works well for them.

Default behavior is unchanged.

tag: feature

@codecov-commenter
Copy link

codecov-commenter commented Apr 25, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 62.20%. Comparing base (59551e4) to head (72cff23).
Report is 431 commits behind head on master.

Files Patch % Lines
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 2 Missing ⚠️
...rg/apache/pinot/common/utils/regex/RegexClass.java 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13005      +/-   ##
============================================
+ Coverage     61.75%   62.20%   +0.44%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2521      +85     
  Lines        133233   137901    +4668     
  Branches      20636    21327     +691     
============================================
+ Hits          82274    85776    +3502     
- Misses        44911    45734     +823     
- Partials       6048     6391     +343     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.15% <93.75%> (+0.44%) ⬆️
java-21 62.07% <93.75%> (+0.45%) ⬆️
skip-bytebuffers-false 62.18% <93.75%> (+0.43%) ⬆️
skip-bytebuffers-true 62.04% <93.75%> (+34.31%) ⬆️
temurin 62.20% <93.75%> (+0.44%) ⬆️
unittests 62.19% <93.75%> (+0.44%) ⬆️
unittests1 46.84% <97.82%> (-0.05%) ⬇️
unittests2 27.77% <0.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@richardstartin
Copy link
Member

Why not just replace the implementation to keep the interface simple and easy to use? If Java's built in regular expressions have the edge somewhere, it's probably a pretty small edge, and it would be nice to avoid backtracking by default.

@itschrispeck
Copy link
Collaborator Author

itschrispeck commented Apr 27, 2024

Why not just replace the implementation to keep the interface simple and easy to use? If Java's built in regular expressions have the edge somewhere, it's probably a pretty small edge, and it would be nice to avoid backtracking by default.

I don't have a good understanding of how Pinot's users use regex, but from searching it seems that the median latency/memory usage of re2j is higher than Java's built in library. It'd also become a backwards incompatible change due to feature gaps. In general I don't think users would be primarily using Pinot for regex filtering (i.e. should we have an optimized path for the average case or focus on worst case performance?), but that's my conjecture.

Some references:
https://github.com/DaniilRoman/re2j_test
google/re2j#162
google/re2j#12

If there is a consensus that an outright replacement is acceptable I'm happy to go that route.

@deemoliu
Copy link
Contributor

+1 on making this configurable in the first version. And this change, it will be easier for us to benchmark with different library.

There are some known cases re2j doesn't outperform the current implementation, so it's a good idea to on hold the replacement of the existing library.

We can do another thorough analysis on when re2j is more efficient and when it's not, and make a proposal on the default behavior of regex.

Copy link
Contributor

@chenboat chenboat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great feature to add to Pinot.

@chenboat chenboat merged commit b86d1a0 into apache:master May 17, 2024
22 checks passed
gortiz pushed a commit to gortiz/pinot that referenced this pull request Jun 14, 2024
* allow user configurable regex library

* fix import ordering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants