-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add offset based lag metrics #13298
Add offset based lag metrics #13298
Conversation
76229e5
to
ea59d41
Compare
ea59d41
to
3396af7
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13298 +/- ##
=============================================
- Coverage 61.75% 35.21% -26.54%
=============================================
Files 2436 2469 +33
Lines 133233 136105 +2872
Branches 20636 21059 +423
=============================================
- Hits 82274 47933 -34341
- Misses 44911 84618 +39707
+ Partials 6048 3554 -2494
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm otherwise.
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Show resolved
Hide resolved
@@ -174,6 +188,22 @@ private long getIngestionDelayMs(long ingestionTimeMs) { | |||
return agedIngestionDelayMs; | |||
} | |||
|
|||
private long getPartitionOffsetLag(IngestionOffsets offset) { | |||
if (offset == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, can this even be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we pass that when we want to reset the metric
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Outdated
Show resolved
Hide resolved
// Compute aged delay for current partition | ||
// TODO: Support other types of offsets | ||
if (!(msgOffset instanceof LongMsgOffset && latestOffset instanceof LongMsgOffset)) { | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are other types that can fall here? Basically we need to be able to explain this metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than this, Kinesis returns a BigInteger offset and Pulsar returns a string offset that is composed of 3 longs
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Outdated
Show resolved
Hide resolved
* Add offset based lag metrics * Add tests * Refactoring --------- Co-authored-by: Kartik Khare <[email protected]>
We are making one extra metadata read per message batch. Have you verified the overhead? Should we add a config to turn this on/off? |
we're upgrading to 1.2 now, and this increased our kafka offset requests in our QA environment by 3 orders of magnitude. I think at minimum we need a way to turn this off (internally we're just commenting the metric out). But we likely need a way to ensure offset requests to kafka don't scale with the number of clusters/tables/partitions being consumed. |
the other big issue I see is if we fail to request the latest offset, we stop publishing both the time based and offset based ingestion lag. |
hi @jadami10 thanks for bringing this up. the earlier metric was really noisy since it relies on time column value instead of ingestion time which lead to false positives. would making this metric configurable help? that way you'd be able to disable it without code changes let me also see it there's a way to reduce frequency |
That sounds like a misuse of the
Only if we don't have a way to cap frequency. And it should be off by default.
I think a key part here is we need to cap the frequency. For large Pinot deployments, you may have thousands of tables and hundreds of thousands of partitions consumed. So the baseline is O(100k) calls. But adding a new table consuming N partitions shouldn't add N more calls. We effectively need a global throttle, though I don't think there's a way to prevent starvation with large enough scale. |
Adds new methods and metric that reports ingestion delay as difference in offset.
Only works for KafkaConsumer because Kinesis would need methods to get latest offset from upstream which are not present.