Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Adaptive Server Selector] Add metrics for Stats Manager Queue Size #12340

Merged
merged 5 commits into from
Feb 23, 2024

Conversation

MeihanLi
Copy link
Contributor

  1. Add metrics for ServerRoutingStatsManager executor service queue size
  2. Add metrics to alert when ServerRoutingStatsManager executor service queue size is above warn threshold
  3. make warn threshold configurable via pinot.broker.adaptive.server.selector.stats.manager.queue.size.warn.threshold

observability

@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2024

Codecov Report

Attention: Patch coverage is 90.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 61.74%. Comparing base (3d7bc6f) to head (949d29b).

Files Patch % Lines
...r/spark/common/reader/PinotServerDataFetcher.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #12340   +/-   ##
=========================================
  Coverage     61.74%   61.74%           
  Complexity      207      207           
=========================================
  Files          2436     2436           
  Lines        133219   133225    +6     
  Branches      20635    20635           
=========================================
+ Hits          82251    82261   +10     
+ Misses        44924    44914   -10     
- Partials       6044     6050    +6     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 ?
integration2 0.00% <0.00%> (ø)
java-11 61.71% <90.00%> (+0.02%) ⬆️
java-21 34.86% <88.88%> (-26.75%) ⬇️
skip-bytebuffers-false 61.71% <90.00%> (-0.02%) ⬇️
skip-bytebuffers-true 34.86% <88.88%> (-26.72%) ⬇️
temurin 61.74% <90.00%> (+<0.01%) ⬆️
unittests 61.74% <90.00%> (+<0.01%) ⬆️
unittests1 46.90% <100.00%> (+0.04%) ⬆️
unittests2 27.71% <10.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@vvivekiyer vvivekiyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this useful change. lgtm overall.

Curious to understand what prompted this change? Have you observed any routing delays with the executor queue?

@@ -102,7 +102,8 @@ public enum BrokerMeter implements AbstractMetrics.Meter {
NETTY_CONNECTION_BYTES_RECEIVED("nettyConnection", true),

PROACTIVE_CLUSTER_CHANGE_CHECK("proactiveClusterChangeCheck", true),
DIRECT_MEMORY_OOM("directMemoryOOMCount", true);
DIRECT_MEMORY_OOM("directMemoryOOMCount", true),
STATS_MANAGER_DELAY_UPDATE("statsManagerDelayUpdate", true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tracking the number of times the queue size exceeded the set threshold? Can we give it a more meaningful name? probably -> ROUTING_STATS_MANAGER_Q_LIMIT_REACHED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Addressed

@@ -377,4 +390,15 @@ public Double fetchHybridScoreForServer(String server) {
stats.getServerReadLock().unlock();
}
}

private void alertIfQueueSizeIsAboveWarnThreshold() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename alertIfQueueSizeIsAboveWarnThreshold -> recordQueueSizeMetrics()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@@ -171,6 +183,7 @@ public void recordStatsUponResponseArrival(long requestId, String serverInstance

_executorService.execute(() -> {
try {
alertIfQueueSizeIsAboveWarnThreshold();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be necessary to collect stats at both points -> query submission and query response. Only submission should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed. Only emit metrics after submission.

@MeihanLi
Copy link
Contributor Author

MeihanLi commented Jan 31, 2024

Thanks for making this useful change. lgtm overall.

Curious to understand what prompted this change? Have you observed any routing delays with the executor queue?

@vvivekiyer Thanks for feedbacks. Based on our performance tests (up to 2k QPS per broker in disaster scenarios), we didn't observe significant routing delays. But we have concerns on the delayed/outdated stats, so want to have the metrics/alerts in place before rolling this feature to production.

Copy link
Contributor

@vvivekiyer vvivekiyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Please address the UT failures.

Comment on lines 398 to 400
LOGGER.warn(String.format("Stats Manager queue size exceeds warn threshold = %d. "
+ "Current queue size = %d, completed task count = %d.",
_executorQueueSizeWarnThreshold, queueSize, getCompletedTaskCount()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can lead to a lot of logs during high-qps use-cases, isn't it?

Copy link
Contributor Author

@MeihanLi MeihanLi Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could. For high qps, we can increase the warn threshold as needed or disable the logging if it can cause issues.

This diff mainly focuses on having the visibility on stats updates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I think we should remove this log line. We have metrics already for this and stats manager queue size being high is not really a major issue.

Copy link
Contributor Author

@MeihanLi MeihanLi Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, remove the logging

@MeihanLi
Copy link
Contributor Author

MeihanLi commented Feb 20, 2024

lgtm. Please address the UT failures.

Thanks, UTs are fixed. @vvivekiyer can you help merge the PR?

@MeihanLi
Copy link
Contributor Author

@Jackie-Jiang can you help merge the PR? Thanks!

Comment on lines 398 to 400
LOGGER.warn(String.format("Stats Manager queue size exceeds warn threshold = %d. "
+ "Current queue size = %d, completed task count = %d.",
_executorQueueSizeWarnThreshold, queueSize, getCompletedTaskCount()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I think we should remove this log line. We have metrics already for this and stats manager queue size being high is not really a major issue.

int queueSize = getQueueSize();
_brokerMetrics.setValueOfGlobalGauge(BrokerGauge.ROUTING_STATS_MANAGER_QUEUE_SIZE, queueSize);
if (queueSize > _executorQueueSizeWarnThreshold) {
_brokerMetrics.addMeteredGlobalValue(BrokerMeter.ROUTING_STATS_MANAGER_Q_LIMIT_REACHED, 1L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think emitting the warning metric is not needed. We already have a gauge. Users should set warning/critical alerts on their end. Though the gauge value will get overwritten frequently and we will only have 1 value per bucket, I think it should be good enough since this is only measuring the stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ankit, addressed

@ankitsultana ankitsultana merged commit 7f09cc8 into apache:master Feb 23, 2024
19 checks passed
@Jackie-Jiang Jackie-Jiang added observability Configuration Config changes (addition/deletion/change in behavior) labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) metrics observability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants