Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using 'serverReturnFinalResult' to optimize server partitioned table #13208

Merged

Conversation

Jackie-Jiang
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang commented May 23, 2024

When a column is partitioned on each server (i.e. the same value always show up on the same server), the following queries can be optimized by asking server to directly return final aggregate result instead of intermediate aggregate result.
1: SELECT DISTINCT_COUNT(partitionedCol) FROM myTable
2: SELECT DISTINCT_COUNT(partitionedCol) FROM myTable GROUP BY col
3: SELECT AGG(col) FROM myTable GROUP BY partitionedCol

For all 3 queries, we can ask server to return final aggregate result, but there are some difference between 2 and 3. For 2, server can return final aggregate result, but should still keep enough groups because the aggregate result is not global final result, but only the final result for a partition; For 3, server only needs to keep LIMIT groups because the aggregate result is global final result for the group.

In this PR, user can SET serverReturnFinalResult = true; to accelerate 1 and 3; user can SET serverReturnFinalResultKeyUnpartitioned = true; to accelerate 2

This PR also fixes the following bugs:

  • Exception is swallowed in GroupByDataTableReducer
  • Null handing for MinMaxRangeAggregationFunction

@codecov-commenter
Copy link

codecov-commenter commented May 23, 2024

Codecov Report

Attention: Patch coverage is 44.60094% with 118 lines in your changes are missing coverage. Please review.

Project coverage is 62.16%. Comparing base (59551e4) to head (78fee19).
Report is 497 commits behind head on master.

Files Patch % Lines
...core/query/reduce/AggregationDataTableReducer.java 11.76% 29 Missing and 1 partial ⚠️
...not/core/query/reduce/GroupByDataTableReducer.java 65.33% 21 Missing and 5 partials ⚠️
...aggregation/function/AggregationFunctionUtils.java 0.00% 14 Missing and 1 partial ⚠️
...org/apache/pinot/core/data/table/IndexedTable.java 70.58% 2 Missing and 3 partials ⚠️
...unction/funnel/FunnelCountAggregationFunction.java 28.57% 5 Missing ⚠️
...tor/streaming/StreamingGroupByCombineOperator.java 0.00% 4 Missing ⚠️
...org/apache/pinot/core/data/table/TableResizer.java 75.00% 0 Missing and 3 partials ⚠️
.../core/operator/combine/GroupByCombineOperator.java 25.00% 1 Missing and 2 partials ⚠️
...ation/function/MinMaxRangeAggregationFunction.java 60.00% 1 Missing and 1 partial ⚠️
...pache/pinot/common/utils/grpc/GrpcQueryClient.java 0.00% 1 Missing ⚠️
... and 24 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13208      +/-   ##
============================================
+ Coverage     61.75%   62.16%   +0.41%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2534      +98     
  Lines        133233   139008    +5775     
  Branches      20636    21535     +899     
============================================
+ Hits          82274    86420    +4146     
- Misses        44911    46126    +1215     
- Partials       6048     6462     +414     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 35.25% <44.60%> (-26.46%) ⬇️
java-21 62.04% <44.60%> (+0.41%) ⬆️
skip-bytebuffers-false 62.15% <44.60%> (+0.40%) ⬆️
skip-bytebuffers-true 62.02% <44.60%> (+34.29%) ⬆️
temurin 62.16% <44.60%> (+0.41%) ⬆️
unittests 62.16% <44.60%> (+0.41%) ⬆️
unittests1 46.69% <44.60%> (-0.20%) ⬇️
unittests2 27.81% <0.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Jackie-Jiang Jackie-Jiang force-pushed the server_partitioned_optimization branch 2 times, most recently from af4159f to 0eefa89 Compare May 23, 2024 22:47
public ColumnDataType getFinalResultColumnType() {
return ColumnDataType.LONG_ARRAY;
public LongArrayList mergeFinalResult(LongArrayList finalResult1, LongArrayList finalResult2) {
long[] result = new long[_numSteps];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not directly add finalResult2 to finalResult1 then return finalResult1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@xiangfu0
Copy link
Contributor

xiangfu0 commented May 25, 2024

Overall LGTM.
I feel these are good function level annotations, maybe we can try to annotate aggregation functions to allow return Final if aggregate column or groupBy key is server partitioned automatically if possible.

@@ -38,6 +38,7 @@
@SuppressWarnings({"rawtypes", "unchecked"})
public abstract class IndexedTable extends BaseTable {
protected final Map<Key, Record> _lookupMap;
protected final boolean _hasFinalInput;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this IndexedTable has multiple aggregations, do you need this _hasFinalInput also to be per agg function or it's just an indicate ?
E.g. SELECT distinctCount(pk), avg(other_column) from myTable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In V1 engine we don't have a way to make per function hint. We can make it more smart in V2 which can carry more information.

@Jackie-Jiang Jackie-Jiang force-pushed the server_partitioned_optimization branch from 0eefa89 to ce47def Compare May 25, 2024 17:49
@Jackie-Jiang Jackie-Jiang force-pushed the server_partitioned_optimization branch from ce47def to 78fee19 Compare May 25, 2024 17:54
@Jackie-Jiang Jackie-Jiang merged commit b6e8135 into apache:master May 25, 2024
19 checks passed
@Jackie-Jiang Jackie-Jiang deleted the server_partitioned_optimization branch May 25, 2024 19:08
gortiz pushed a commit to gortiz/pinot that referenced this pull request Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants