Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Sketch value aggregator performance #13020

Merged

Conversation

davecromberge
Copy link
Member

@davecromberge davecromberge commented Apr 29, 2024

This addresses excessive resource consumption and performance problems with the following Apache Datasketches value aggregators:

  • Theta
  • CPC
  • Tuple (Integer)

When merging sketches via union, it is often a costly operation that creates intermediate book-keeping structures. In addition, the operation performs better when the cost is amortized over many sketch inputs instead of pairwise merges on two sketches at a time.

When tested in a production environment, I measured a median reduction of 10x in the time spent creating StarTrees.

Related to issue:
#13019

release-notes

  • Signature changes to Theta, Tuple and CPC value aggregators; now use Object as the aggregated value type

Use union as intermediate state instead of Sketch.
Use union as intermediate state instead of Sketch.
Use union as intermediate state instead of Sketch.
@codecov-commenter
Copy link

codecov-commenter commented Apr 29, 2024

Codecov Report

Attention: Patch coverage is 0% with 129 lines in your changes are missing coverage. Please review.

Project coverage is 35.05%. Comparing base (59551e4) to head (a504757).
Report is 375 commits behind head on master.

Files Patch % Lines
...gator/DistinctCountThetaSketchValueAggregator.java 0.00% 52 Missing ⚠️
...regator/DistinctCountCPCSketchValueAggregator.java 0.00% 40 Missing ⚠️
.../aggregator/IntegerTupleSketchValueAggregator.java 0.00% 37 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #13020       +/-   ##
=============================================
- Coverage     61.75%   35.05%   -26.70%     
+ Complexity      207        6      -201     
=============================================
  Files          2436     2427        -9     
  Lines        133233   133089      -144     
  Branches      20636    20631        -5     
=============================================
- Hits          82274    46653    -35621     
- Misses        44911    83044    +38133     
+ Partials       6048     3392     -2656     
Flag Coverage Δ
custom-integration1 ?
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 0.00% <0.00%> (-61.71%) ⬇️
java-21 35.05% <0.00%> (-26.58%) ⬇️
skip-bytebuffers-false 35.04% <0.00%> (-26.71%) ⬇️
skip-bytebuffers-true 35.01% <0.00%> (+7.28%) ⬆️
temurin 35.05% <0.00%> (-26.70%) ⬇️
unittests 46.57% <0.00%> (-15.18%) ⬇️
unittests1 46.57% <0.00%> (-0.32%) ⬇️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@davecromberge davecromberge marked this pull request as draft April 29, 2024 11:10
@davecromberge davecromberge marked this pull request as ready for review April 29, 2024 13:02
@Jackie-Jiang Jackie-Jiang added release-notes Referenced by PRs that need attention when compiling the next release notes performance labels Apr 29, 2024
CpcSketch input = new CpcSketch();
IntStream.range(0, 1000).forEach(input::update);
CpcSketch input = new CpcSketch(12);
IntStream.range(0, 100).forEach(input::update);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the impact of this change? decreasing test initialization time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sketch estimate was off by one on the previous configuration. Rather than introducing a delta for the variable error, I decreased the number of inputs making the test more deterministic.

@gortiz
Copy link
Contributor

gortiz commented Apr 30, 2024

The PR looks fine to me, but it is a bit strange to approve a performance wise PR without an actual benchmark we can reproduce. Could you include a JMH benchmark as part of the PR?

@davecromberge
Copy link
Member Author

davecromberge commented May 1, 2024

The PR looks fine to me, but it is a bit strange to approve a performance wise PR without an actual benchmark we can reproduce. Could you include a JMH benchmark as part of the PR?

Sure, that is a fair request. I would propose running the StarTree test under a JMH benchmark under previous and current configuration and including the results in this PR. How does that sound? Or, are you looking for a JMH test as part of the repo please could you point me to an existing test that would serve as a good example.

@davecromberge
Copy link
Member Author

@gortiz as a starting point, here are some empirical measurements that I took from a run in our production environment. The build time refers to the StarTree build time for a segment and the number of records refers to the number of records to aggregate.

Screenshot 2024-05-01 at 08 53 20 Screenshot 2024-05-01 at 08 53 15

The median differs by a factor of 10.

@Jackie-Jiang Jackie-Jiang merged commit ad70686 into apache:master May 1, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants