-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Adaptive Server Selector] Add metrics for Stats Manager Queue Size #12340
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12340 +/- ##
=========================================
Coverage 61.74% 61.74%
Complexity 207 207
=========================================
Files 2436 2436
Lines 133219 133225 +6
Branches 20635 20635
=========================================
+ Hits 82251 82261 +10
+ Misses 44924 44914 -10
- Partials 6044 6050 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this useful change. lgtm overall.
Curious to understand what prompted this change? Have you observed any routing delays with the executor queue?
@@ -102,7 +102,8 @@ public enum BrokerMeter implements AbstractMetrics.Meter { | |||
NETTY_CONNECTION_BYTES_RECEIVED("nettyConnection", true), | |||
|
|||
PROACTIVE_CLUSTER_CHANGE_CHECK("proactiveClusterChangeCheck", true), | |||
DIRECT_MEMORY_OOM("directMemoryOOMCount", true); | |||
DIRECT_MEMORY_OOM("directMemoryOOMCount", true), | |||
STATS_MANAGER_DELAY_UPDATE("statsManagerDelayUpdate", true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this tracking the number of times the queue size exceeded the set threshold? Can we give it a more meaningful name? probably -> ROUTING_STATS_MANAGER_Q_LIMIT_REACHED?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Addressed
@@ -377,4 +390,15 @@ public Double fetchHybridScoreForServer(String server) { | |||
stats.getServerReadLock().unlock(); | |||
} | |||
} | |||
|
|||
private void alertIfQueueSizeIsAboveWarnThreshold() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename alertIfQueueSizeIsAboveWarnThreshold
-> recordQueueSizeMetrics()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
@@ -171,6 +183,7 @@ public void recordStatsUponResponseArrival(long requestId, String serverInstance | |||
|
|||
_executorService.execute(() -> { | |||
try { | |||
alertIfQueueSizeIsAboveWarnThreshold(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might not be necessary to collect stats at both points -> query submission and query response. Only submission should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed. Only emit metrics after submission.
pinot-common/src/main/java/org/apache/pinot/common/metrics/BrokerGauge.java
Outdated
Show resolved
Hide resolved
@vvivekiyer Thanks for feedbacks. Based on our performance tests (up to 2k QPS per broker in disaster scenarios), we didn't observe significant routing delays. But we have concerns on the delayed/outdated stats, so want to have the metrics/alerts in place before rolling this feature to production. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Please address the UT failures.
b66806a
to
69664d3
Compare
LOGGER.warn(String.format("Stats Manager queue size exceeds warn threshold = %d. " | ||
+ "Current queue size = %d, completed task count = %d.", | ||
_executorQueueSizeWarnThreshold, queueSize, getCompletedTaskCount())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can lead to a lot of logs during high-qps use-cases, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could. For high qps, we can increase the warn threshold as needed or disable the logging if it can cause issues.
This diff mainly focuses on having the visibility on stats updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I think we should remove this log line. We have metrics already for this and stats manager queue size being high is not really a major issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, remove the logging
69664d3
to
2a4aa08
Compare
Thanks, UTs are fixed. @vvivekiyer can you help merge the PR? |
@Jackie-Jiang can you help merge the PR? Thanks! |
LOGGER.warn(String.format("Stats Manager queue size exceeds warn threshold = %d. " | ||
+ "Current queue size = %d, completed task count = %d.", | ||
_executorQueueSizeWarnThreshold, queueSize, getCompletedTaskCount())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I think we should remove this log line. We have metrics already for this and stats manager queue size being high is not really a major issue.
int queueSize = getQueueSize(); | ||
_brokerMetrics.setValueOfGlobalGauge(BrokerGauge.ROUTING_STATS_MANAGER_QUEUE_SIZE, queueSize); | ||
if (queueSize > _executorQueueSizeWarnThreshold) { | ||
_brokerMetrics.addMeteredGlobalValue(BrokerMeter.ROUTING_STATS_MANAGER_Q_LIMIT_REACHED, 1L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think emitting the warning metric is not needed. We already have a gauge. Users should set warning/critical alerts on their end. Though the gauge value will get overwritten frequently and we will only have 1 value per bucket, I think it should be good enough since this is only measuring the stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ankit, addressed
2a4aa08
to
504c5b5
Compare
pinot.broker.adaptive.server.selector.stats.manager.queue.size.warn.threshold
observability