[Adaptive Server Selector] Add metrics for Stats Manager Queue Size #12340

MeihanLi · 2024-01-30T10:08:01Z

Add metrics for ServerRoutingStatsManager executor service queue size
Add metrics to alert when ServerRoutingStatsManager executor service queue size is above warn threshold
make warn threshold configurable via pinot.broker.adaptive.server.selector.stats.manager.queue.size.warn.threshold

observability

codecov-commenter · 2024-01-30T11:22:14Z

Codecov Report

Attention: Patch coverage is 90.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 61.74%. Comparing base (3d7bc6f) to head (949d29b).

Files	Patch %	Lines
...r/spark/common/reader/PinotServerDataFetcher.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #12340   +/-   ##
=========================================
  Coverage     61.74%   61.74%           
  Complexity      207      207           
=========================================
  Files          2436     2436           
  Lines        133219   133225    +6     
  Branches      20635    20635           
=========================================
+ Hits          82251    82261   +10     
+ Misses        44924    44914   -10     
- Partials       6044     6050    +6

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (ø)`
integration	`<0.01% <0.00%> (ø)`
integration1	`?`
integration2	`0.00% <0.00%> (ø)`
java-11	`61.71% <90.00%> (+0.02%)`	⬆️
java-21	`34.86% <88.88%> (-26.75%)`	⬇️
skip-bytebuffers-false	`61.71% <90.00%> (-0.02%)`	⬇️
skip-bytebuffers-true	`34.86% <88.88%> (-26.72%)`	⬇️
temurin	`61.74% <90.00%> (+<0.01%)`	⬆️
unittests	`61.74% <90.00%> (+<0.01%)`	⬆️
unittests1	`46.90% <100.00%> (+0.04%)`	⬆️
unittests2	`27.71% <10.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

vvivekiyer

Thanks for making this useful change. lgtm overall.

Curious to understand what prompted this change? Have you observed any routing delays with the executor queue?

vvivekiyer · 2024-01-30T21:52:29Z

pinot-common/src/main/java/org/apache/pinot/common/metrics/BrokerMeter.java

@@ -102,7 +102,8 @@ public enum BrokerMeter implements AbstractMetrics.Meter {
  NETTY_CONNECTION_BYTES_RECEIVED("nettyConnection", true),

  PROACTIVE_CLUSTER_CHANGE_CHECK("proactiveClusterChangeCheck", true),
-  DIRECT_MEMORY_OOM("directMemoryOOMCount", true);
+  DIRECT_MEMORY_OOM("directMemoryOOMCount", true),
+  STATS_MANAGER_DELAY_UPDATE("statsManagerDelayUpdate", true);


Is this tracking the number of times the queue size exceeded the set threshold? Can we give it a more meaningful name? probably -> ROUTING_STATS_MANAGER_Q_LIMIT_REACHED?

Thanks for the suggestion. Addressed

vvivekiyer · 2024-01-30T22:10:02Z

...ain/java/org/apache/pinot/core/transport/server/routing/stats/ServerRoutingStatsManager.java

@@ -377,4 +390,15 @@ public Double fetchHybridScoreForServer(String server) {
      stats.getServerReadLock().unlock();
    }
  }
+
+  private void alertIfQueueSizeIsAboveWarnThreshold() {


nit: rename alertIfQueueSizeIsAboveWarnThreshold -> recordQueueSizeMetrics()

vvivekiyer · 2024-01-30T22:14:10Z

...ain/java/org/apache/pinot/core/transport/server/routing/stats/ServerRoutingStatsManager.java

@@ -171,6 +183,7 @@ public void recordStatsUponResponseArrival(long requestId, String serverInstance

    _executorService.execute(() -> {
      try {
+        alertIfQueueSizeIsAboveWarnThreshold();


Might not be necessary to collect stats at both points -> query submission and query response. Only submission should be sufficient.

addressed. Only emit metrics after submission.

pinot-common/src/main/java/org/apache/pinot/common/metrics/BrokerGauge.java

MeihanLi · 2024-01-31T07:06:32Z

Thanks for making this useful change. lgtm overall.

Curious to understand what prompted this change? Have you observed any routing delays with the executor queue?

@vvivekiyer Thanks for feedbacks. Based on our performance tests (up to 2k QPS per broker in disaster scenarios), we didn't observe significant routing delays. But we have concerns on the delayed/outdated stats, so want to have the metrics/alerts in place before rolling this feature to production.

vvivekiyer

lgtm. Please address the UT failures.

tibrewalpratik17 · 2024-02-20T19:54:26Z

...ain/java/org/apache/pinot/core/transport/server/routing/stats/ServerRoutingStatsManager.java

+      LOGGER.warn(String.format("Stats Manager queue size exceeds warn threshold = %d. "
+              + "Current queue size = %d, completed task count = %d.",
+          _executorQueueSizeWarnThreshold, queueSize, getCompletedTaskCount()));


This can lead to a lot of logs during high-qps use-cases, isn't it?

It could. For high qps, we can increase the warn threshold as needed or disable the logging if it can cause issues.

This diff mainly focuses on having the visibility on stats updates.

Good call. I think we should remove this log line. We have metrics already for this and stats manager queue size being high is not really a major issue.

Make sense, remove the logging

MeihanLi · 2024-02-20T22:19:42Z

lgtm. Please address the UT failures.

Thanks, UTs are fixed. @vvivekiyer can you help merge the PR?

MeihanLi · 2024-02-22T18:14:33Z

@Jackie-Jiang can you help merge the PR? Thanks!

ankitsultana · 2024-02-23T20:12:46Z

...ain/java/org/apache/pinot/core/transport/server/routing/stats/ServerRoutingStatsManager.java

+      LOGGER.warn(String.format("Stats Manager queue size exceeds warn threshold = %d. "
+              + "Current queue size = %d, completed task count = %d.",
+          _executorQueueSizeWarnThreshold, queueSize, getCompletedTaskCount()));


Good call. I think we should remove this log line. We have metrics already for this and stats manager queue size being high is not really a major issue.

ankitsultana · 2024-02-23T20:18:39Z

...ain/java/org/apache/pinot/core/transport/server/routing/stats/ServerRoutingStatsManager.java

+    int queueSize = getQueueSize();
+    _brokerMetrics.setValueOfGlobalGauge(BrokerGauge.ROUTING_STATS_MANAGER_QUEUE_SIZE, queueSize);
+    if (queueSize > _executorQueueSizeWarnThreshold) {
+      _brokerMetrics.addMeteredGlobalValue(BrokerMeter.ROUTING_STATS_MANAGER_Q_LIMIT_REACHED, 1L);


I think emitting the warning metric is not needed. We already have a gauge. Users should set warning/critical alerts on their end. Though the gauge value will get overwritten frequently and we will only have 1 value per bucket, I think it should be good enough since this is only measuring the stats.

Thanks Ankit, addressed

vvivekiyer reviewed Jan 30, 2024

View reviewed changes

walterddr added the metrics label Feb 2, 2024

vvivekiyer approved these changes Feb 2, 2024

View reviewed changes

MeihanLi force-pushed the stats_manager_metrics branch from b66806a to 69664d3 Compare February 20, 2024 19:32

tibrewalpratik17 reviewed Feb 20, 2024

View reviewed changes

MeihanLi force-pushed the stats_manager_metrics branch from 69664d3 to 2a4aa08 Compare February 20, 2024 21:26

ankitsultana reviewed Feb 23, 2024

View reviewed changes

MeihanLi added 4 commits February 23, 2024 12:24

[Adaptive Server Selector] Add metrics for Stats Manager Queue Size

76e7b80

address comments

e09fb17

fix unit tests

d6c2f27

remove logging when queue size exceeds warn threshold

504c5b5

MeihanLi force-pushed the stats_manager_metrics branch from 2a4aa08 to 504c5b5 Compare February 23, 2024 20:29

Remove Meter for ROUTING_STATS_MANAGER_Q_LIMIT_REACHED

949d29b

ankitsultana approved these changes Feb 23, 2024

View reviewed changes

ankitsultana merged commit 7f09cc8 into apache:master Feb 23, 2024
19 checks passed

Jackie-Jiang added observability Configuration Config changes (addition/deletion/change in behavior) labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Adaptive Server Selector] Add metrics for Stats Manager Queue Size #12340

[Adaptive Server Selector] Add metrics for Stats Manager Queue Size #12340

MeihanLi commented Jan 30, 2024

codecov-commenter commented Jan 30, 2024 •

edited

Loading

vvivekiyer left a comment

vvivekiyer Jan 30, 2024

MeihanLi Jan 31, 2024

vvivekiyer Jan 30, 2024

MeihanLi Jan 31, 2024

vvivekiyer Jan 30, 2024

MeihanLi Jan 31, 2024

MeihanLi commented Jan 31, 2024 •

edited

Loading

vvivekiyer left a comment

tibrewalpratik17 Feb 20, 2024

MeihanLi Feb 20, 2024 •

edited

Loading

ankitsultana Feb 23, 2024

MeihanLi Feb 23, 2024 •

edited

Loading

MeihanLi commented Feb 20, 2024 •

edited

Loading

MeihanLi commented Feb 22, 2024

ankitsultana Feb 23, 2024

ankitsultana Feb 23, 2024

MeihanLi Feb 23, 2024

[Adaptive Server Selector] Add metrics for Stats Manager Queue Size #12340

[Adaptive Server Selector] Add metrics for Stats Manager Queue Size #12340

Conversation

MeihanLi commented Jan 30, 2024

codecov-commenter commented Jan 30, 2024 • edited Loading

Codecov Report

vvivekiyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MeihanLi commented Jan 31, 2024 • edited Loading

vvivekiyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MeihanLi Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MeihanLi Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

MeihanLi commented Feb 20, 2024 • edited Loading

MeihanLi commented Feb 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 30, 2024 •

edited

Loading

MeihanLi commented Jan 31, 2024 •

edited

Loading

MeihanLi Feb 20, 2024 •

edited

Loading

MeihanLi Feb 23, 2024 •

edited

Loading

MeihanLi commented Feb 20, 2024 •

edited

Loading