Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config to skip record ingestion on string column length exceeding configured max schema length #13103

Merged
merged 10 commits into from
Jun 4, 2024

Conversation

tibrewalpratik17
Copy link
Contributor

@tibrewalpratik17 tibrewalpratik17 commented May 7, 2024

label:
ingestion
feature
release-notes

  • At present, SanitizationTransformer just trims the string length based on configured schema max length. The process is very silent with no observability if the field got trimmed or an option to skip rather than trimming. This becomes critical if the field even if string is treated for json-indexing as post-trimming it can lead to trimmed json / malformed json eventually stopping ingestion.
  • Here, we add a new table-schema config in FieldSpec maxLengthExceedStrategy. Default value is to TRIM_LENGTH to ensure backward compatibility. If configured to be ERROR, in case of incoming-string-value exceeding the configured max length, it will skip the record, return error and log accordingly. If set as SUBSTITUTE_DEFAULT_VALUE, it will substitute the configured default value in place of the huge string. There is another strategy as NO_ACTION which justs ingests the string.
  • We have extended similar functionality to JSON and BYTES column type with default being NO_ACTION to preserve backward incompatibility.

Sample FieldSpec:

{
        "name": "Attr",
        "dataType": "STRING",
        "maxLength": 10000,
        "maxLengthExceedStrategy": "FAIL_INGESTION"
}

@codecov-commenter
Copy link

codecov-commenter commented May 7, 2024

Codecov Report

Attention: Patch coverage is 67.64706% with 33 lines in your changes are missing coverage. Please review.

Project coverage is 62.14%. Comparing base (59551e4) to head (e2a63f8).
Report is 523 commits behind head on master.

Files Patch % Lines
...cal/recordtransformer/SanitizationTransformer.java 67.94% 21 Missing and 4 partials ⚠️
...gment/local/segment/creator/TransformPipeline.java 16.66% 4 Missing and 1 partial ⚠️
.../org/apache/pinot/spi/data/DimensionFieldSpec.java 0.00% 2 Missing ⚠️
...a/manager/realtime/RealtimeSegmentDataManager.java 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13103      +/-   ##
============================================
+ Coverage     61.75%   62.14%   +0.39%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2536     +100     
  Lines        133233   139488    +6255     
  Branches      20636    21566     +930     
============================================
+ Hits          82274    86681    +4407     
- Misses        44911    46311    +1400     
- Partials       6048     6496     +448     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.11% <67.64%> (+0.40%) ⬆️
java-21 62.02% <67.64%> (+0.40%) ⬆️
skip-bytebuffers-false 62.13% <67.64%> (+0.38%) ⬆️
skip-bytebuffers-true 62.00% <67.64%> (+34.27%) ⬆️
temurin 62.14% <67.64%> (+0.39%) ⬆️
unittests 62.13% <67.64%> (+0.39%) ⬆️
unittests1 46.68% <14.70%> (-0.21%) ⬇️
unittests2 27.75% <52.94%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snleee @swaminathanmanish Can you help also review this?

Copy link
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tibrewalpratik17 for the PR, overall LGTM. Left a minor comment.

Copy link
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tibrewalpratik17 for addressing the review comments. LGTM.

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest extending this to make it applicable to BYTES and JSON as well

@tibrewalpratik17
Copy link
Contributor Author

I'd suggest extending this to make it applicable to BYTES and JSON as well

I see #13181 already covering JSON scenario.

@tibrewalpratik17 tibrewalpratik17 force-pushed the fail_on_sanitization branch 3 times, most recently from 0bbd7de to 9183264 Compare May 22, 2024 10:07
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall direction looks good!
If we allow maxLengthExceedStrategy to be null, a lot of test changes can be simplified

@tibrewalpratik17
Copy link
Contributor Author

If we allow maxLengthExceedStrategy to be null, a lot of test changes can be simplified

Yes this simplifies the PR a lot! Updated accordingly! Thanks.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good. Let's also add support for BYTES and JSON since we have all the required strategy supported

@Jackie-Jiang Jackie-Jiang added documentation release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) labels May 24, 2024
@tibrewalpratik17 tibrewalpratik17 force-pushed the fail_on_sanitization branch 2 times, most recently from 3b24d12 to 2c3c923 Compare May 27, 2024 10:15
case BYTES:
maxLengthExceedStrategy = fieldSpec.getMaxLengthExceedStrategy() == null
? FieldSpec.MaxLengthExceedStrategy.NO_ACTION : fieldSpec.getMaxLengthExceedStrategy();
_columnToColumnInfoMap.put(fieldSpec.getName(), new SanitizedColumnInfo(fieldSpec.getName(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to include the column when it is NO_ACTION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of string do we want to ingest NULL_CHARACTER in case of NO_ACTION ?

@tibrewalpratik17
Copy link
Contributor Author

hey @Jackie-Jiang addressed comments! Can you please help with the review?

@Jackie-Jiang Jackie-Jiang merged commit 9822100 into apache:master Jun 4, 2024
19 checks passed
gortiz pushed a commit to gortiz/pinot that referenced this pull request Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) documentation feature ingestion observability release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants