-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Sketch value aggregator performance #13020
Enhancement: Sketch value aggregator performance #13020
Conversation
Use union as intermediate state instead of Sketch.
Use union as intermediate state instead of Sketch.
Use union as intermediate state instead of Sketch.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13020 +/- ##
=============================================
- Coverage 61.75% 35.05% -26.70%
+ Complexity 207 6 -201
=============================================
Files 2436 2427 -9
Lines 133233 133089 -144
Branches 20636 20631 -5
=============================================
- Hits 82274 46653 -35621
- Misses 44911 83044 +38133
+ Partials 6048 3392 -2656
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
CpcSketch input = new CpcSketch(); | ||
IntStream.range(0, 1000).forEach(input::update); | ||
CpcSketch input = new CpcSketch(12); | ||
IntStream.range(0, 100).forEach(input::update); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the impact of this change? decreasing test initialization time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sketch estimate was off by one on the previous configuration. Rather than introducing a delta for the variable error, I decreased the number of inputs making the test more deterministic.
The PR looks fine to me, but it is a bit strange to approve a performance wise PR without an actual benchmark we can reproduce. Could you include a JMH benchmark as part of the PR? |
Sure, that is a fair request. I would propose running the StarTree test under a JMH benchmark under previous and current configuration and including the results in this PR. How does that sound? Or, are you looking for a JMH test as part of the repo please could you point me to an existing test that would serve as a good example. |
@gortiz as a starting point, here are some empirical measurements that I took from a run in our production environment. The build time refers to the StarTree build time for a segment and the number of records refers to the number of records to aggregate. The median differs by a factor of 10. |
This addresses excessive resource consumption and performance problems with the following Apache Datasketches value aggregators:
When merging sketches via union, it is often a costly operation that creates intermediate book-keeping structures. In addition, the operation performs better when the cost is amortized over many sketch inputs instead of pairwise merges on two sketches at a time.
When tested in a production environment, I measured a median reduction of 10x in the time spent creating StarTrees.
Related to issue:
#13019
release-notes