Metrics for Realtime Rows Fetched and Stream Consumer Create Exceptions #12522

suddendust · 2024-02-29T09:06:50Z

This PR adds two metrics in RealtimeSegmentDataManager:

~~1. REALTIME_ROWS_FETCHED: This is a new gauge that can assume the following values:~~

~~- 0: When there were no messages in the last batch. This can happen when there are no messages or very low vol of messages in the stream partition.~~
~~- > 0 : When messages were fetched in the last batch.~~
~~- -1: When there were exceptions fetching the message batch from the stream partition.~~

~~This metric can be used to identify the 1st case of no messages in a stream partition as:~~

max_over_time(pinot_server_realtimeLastFetchedBatchSize_Value{}[15m]): This expr checks the max batch size consumed in the last 15m. If it's 0, it means no ingestion has happened in the last 15m. We can use this as a source alert to silence other alerts such as RealtimeIngestionStopped.

REALTIME_ROWS_FETCHED: This is a new meter that tracks the number of rows fetched from the stream partition. This basically tracks how many rows there are in the batch, before processing anything.

This metric can be use to detect is there is no data in the stream partition as:

rate(pinot_server_realtimeRowsFetched_Count{}[1m]) == 0

Why can't we reuse pinot_server_realtimeRowsConsumed?

pinot_server_realtimeRowsConsumed tracks the number of rows that were successfully indexed. If there were problems transforming/indexing the row, those rows aren't counted in this. So it makes it hard to calculate the total number of rows being fetched from the partition stream.

STREAM_CONSUMER_CREATE_EXCEPTIONS: If we face exceptions trying to create a stream consumer.

Testing:

rate(pinot_server_realtimeRowsFetched_Count{}[1m]):

When there were messages in the last fetched batch:

No data in input stream

Exposed metric:

STREAM_CONSUMER_CREATE_EXCEPTIONS:

codecov-commenter · 2024-02-29T09:43:30Z

Codecov Report

Attention: Patch coverage is 33.33333% with 16 lines in your changes are missing coverage. Please review.

Project coverage is 61.60%. Comparing base (59551e4) to head (8068a1b).
Report is 42 commits behind head on master.

Files	Patch %	Lines
...a/manager/realtime/RealtimeSegmentDataManager.java	27.27%	16 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #12522      +/-   ##
============================================
- Coverage     61.75%   61.60%   -0.15%     
  Complexity      207      207              
============================================
  Files          2436     2451      +15     
  Lines        133233   133646     +413     
  Branches      20636    20689      +53     
============================================
+ Hits          82274    82331      +57     
- Misses        44911    45226     +315     
- Partials       6048     6089      +41

Flag	Coverage Δ
custom-integration1	`?`
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`0.00% <0.00%> (-61.71%)`	⬇️
java-21	`61.60% <33.33%> (-0.03%)`	⬇️
skip-bytebuffers-false	`34.89% <33.33%> (-26.85%)`	⬇️
skip-bytebuffers-true	`61.59% <33.33%> (+33.86%)`	⬆️
temurin	`61.60% <33.33%> (-0.15%)`	⬇️
unittests	`61.60% <33.33%> (-0.15%)`	⬇️
unittests1	`46.77% <33.33%> (-0.12%)`	⬇️
unittests2	`27.64% <0.00%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gortiz · 2024-02-29T10:32:26Z

I like the idea, but I'm not sure if the current implementation is correct. Specifically, the scenario that concerns me is the one where:

we fetch batches far often than the Prometheus polling rate
the source mostly emits ~~not~~ empty batches

In that scenario it seems possible that REALTIME_LAST_FETCHED_BATCH_SIZE would be 0 each time Prometheus polls it even if we are actually ingesting.

Instead of being a gauge, we could have something like: REALTIME_FETCHED_ROWS which could be a counter. This metric would be non decrement and we could just apply rate (or some other operation) to that metric. If it doesn't increase in a while, we can be sure that there are no data in the source and therefore we can skip some alerts.

suddendust · 2024-02-29T11:37:31Z

That's a valid case.

If it doesn't increase in a while, we can be sure that there are no data in the source and therefore we can skip some alerts.

We can't differentiate b/w this (valid) case and when there are exceptions fetching a batch. Although such exceptions can be tracked using REALTIME_CONSUMPTION_EXCEPTIONS.

ege-st · 2024-02-29T13:39:26Z

This seems like a situation where a histogram would be right choice: it would let us do percentile breakdowns of rows read and I believe it also includes the counter that @gortiz suggested.

gortiz · 2024-02-29T15:23:43Z

Small tip, instead of rate you can also use increase. I don't think the difference in performance would be important enough to change it, but it may be cleaner what you want to express.

Anyway I think both functions are fine.

gortiz · 2024-02-29T15:25:44Z

LGTM. Unless there is some urgency, I'll wait til tomorrow to merge it just in case other people want to review it.

zhtaoxiang

LGTM

zhtaoxiang · 2024-02-29T17:38:35Z

pinot-common/src/main/java/org/apache/pinot/common/metrics/ServerMeter.java

  REALTIME_ROWS_FILTERED("rows", false),
  INVALID_REALTIME_ROWS_DROPPED("rows", false),
  INCOMPLETE_REALTIME_ROWS_CONSUMED("rows", false),
  REALTIME_CONSUMPTION_EXCEPTIONS("exceptions", true),
  REALTIME_OFFSET_COMMITS("commits", true),
  REALTIME_OFFSET_COMMIT_EXCEPTIONS("exceptions", false),
+  STREAM_CONSUMER_CREATE_EXCEPTIONS("exceptions", false),


this newly created metric is not used anywhere

That code wasn't checked-in, sorry about that!

zhtaoxiang · 2024-02-29T18:23:01Z

...re/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java

+      _currentOffset = _partitionGroupConsumer.checkpoint(_currentOffset);
+      closePartitionGroupConsumer();


Is it possible that those 2 lines throw exceptions? If possible, then the exception is not related to consumer creation.

Should we only try catch line 1690 and 1691?

If they throw an exception, the consumer wouldn't get created right? And _currentOffset = _partitionGroupConsumer.checkpoint(_currentOffset); does throw a runtime exception (at least in code).

if they throw an exception, is the old consumer closed? if not, the consumer is still alive, right?

I guess to be sure we should re-throw the exception in the catch part.

can you please help me understand why re-throw the exception can address the potential issue I mentioned above?

In

1. _currentOffset = _partitionGroupConsumer.checkpoint(_currentOffset); 2. closePartitionGroupConsumer(); 3. _partitionGroupConsumer = _streamConsumerFactory.createPartitionGroupConsumer(_clientId, _partitionGroupConsumptionStatus); _partitionGroupConsumer.start(_currentOffset);

If 1st or 2nd throws an exception, 3rd won't be executed - This is a valid case of incrementing this metric, no?

then the exception is not related to consumer creation.

This metric tracks if a consumer cannot be created for any reason. In this case as well, the consumer isn't created. So we should bump it?

Earlier, I was swallowing this ingestion which would not be bubbled up any runtime exception up the call stack. Re-throwing addressed that.

Please correct me if I am wrong.

If the 1st or 2nd line throws an exception,

that means the old consumer is still running, may be it can still be used for consuming data?

the error is not really about creating consumer, in this case, maybe we don't want to increase the metric, WDYT?

that means the old consumer is still running, may be it can still be used for consuming data?

I am not sure of the control flow here, if it can be re-used. If it can be reused, then yes, we shouldn't bump the metric. If it can't, then we should.

But look at it this way: We want to be alerted in both cases because the control tried creating a new consumer and hoped using that, but it couldn't for some reason. We def want to look into why we couldn't even close the old consumer.

This reverts commit 10766b6.

gortiz · 2024-03-04T08:37:26Z

docker/images/pinot/etc/jmx_prometheus_javaagent/configs/server.yml

@@ -77,7 +77,7 @@ rules:
 - pattern: "\"org.apache.pinot.common.metrics\"<type=\"ServerMetrics\", name=\"pinot.server.realtimeConsumptionExceptions\"><>(\\w+)"
  name: "pinot_server_realtime_consumptionExceptions_$1"
  cache: true
- pattern: "\"org.apache.pinot.common.metrics\"<type=\"ServerMetrics\", name=\"pinot.server.([^\\.]*?)_(OFFLINE|REALTIME)\\-(.+)\\-(\\w+).(invalidRealtimeRowsDropped|incompleteRealtimeRowsConsumed|rowsWithErrors|realtimeRowsConsumed)\"><>(\\w+)"
+- pattern: "\"org.apache.pinot.common.metrics\"<type=\"ServerMetrics\", name=\"pinot.server.([^\\.]*?)_(OFFLINE|REALTIME)\\-(.+)\\-(\\w+).(invalidRealtimeRowsDropped|incompleteRealtimeRowsConsumed|rowsWithErrors|realtimeRowsConsumed|realtimeRowsFetched|streamConsumerCreateExceptions)\"><>(\\w+)"


Not related to this PR but... Why do we need to add an enumeration here? I think we should just say any word here to fetch everything.

Yeah didn't want to add another change to this PR that would've increased the regression surface. This will be fixed in another PR.

gortiz · 2024-03-04T08:41:40Z

...re/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java

@@ -427,6 +427,9 @@ protected boolean consumeLoop()
      try {
        messageBatch =
            _partitionGroupConsumer.fetchMessages(_currentOffset, null, _streamConfig.getFetchTimeoutMillis());
+        //track realtime rows fetched on a table level. This included valid + invalid rows
+        _serverMetrics.addMeteredTableValue(_clientId, ServerMeter.REALTIME_ROWS_FETCHED,
+            messageBatch.getMessageCount());


I think this is not correct. It should be messageBatch.getUnfilteredMessageCount().

From the interface:

/** * @return number of available messages */ int getMessageCount(); /** * @return number of messages returned from the stream */ default int getUnfilteredMessageCount() { return getMessageCount(); }

Thanks for pointing this out

gortiz

I think we are not using the correct metric and we are ignoring the filtered documents

… batch size

gortiz · 2024-03-04T16:57:56Z

LGTM (this time for sure! :D). I'll wait a bit in case some other committer can take a look

zhtaoxiang

LGTM

suddendust added 2 commits February 29, 2024 12:24

Metrics for realtime rows fetched and stream consumer create exceptions

325ef00

WIP

499222b

suddendust added 3 commits February 29, 2024 15:27

Regex

e1a8adf

Rollback inadvertent changes

9f88044

WIP

406ca40

Use meter instead of a gauge

735c319

gortiz approved these changes Feb 29, 2024

View reviewed changes

zhtaoxiang approved these changes Feb 29, 2024

View reviewed changes

zhtaoxiang suggested changes Feb 29, 2024

View reviewed changes

suddendust added 2 commits February 29, 2024 23:32

WIP

818748d

Checkstyle

519be2e

zhtaoxiang reviewed Feb 29, 2024

View reviewed changes

Jackie-Jiang added metrics observability labels Mar 1, 2024

suddendust added 2 commits March 4, 2024 13:39

Trigger CI

10766b6

Revert "Trigger CI"

88d7234

This reverts commit 10766b6.

gortiz reviewed Mar 4, 2024

View reviewed changes

gortiz requested changes Mar 4, 2024

View reviewed changes

suddendust added 2 commits March 4, 2024 19:23

Use messageBatch.getUnfilteredMessageCount() to calculate the fetched…

d096cd6

… batch size

Rethrow exception when recreate stream consumer fails

d6165a4

suddendust requested review from gortiz and zhtaoxiang March 4, 2024 13:59

gortiz approved these changes Mar 4, 2024

View reviewed changes

Address comments

8068a1b

zhtaoxiang approved these changes Mar 4, 2024

View reviewed changes

xiangfu0 merged commit dbf55b3 into apache:master Mar 4, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics for Realtime Rows Fetched and Stream Consumer Create Exceptions #12522

Metrics for Realtime Rows Fetched and Stream Consumer Create Exceptions #12522

suddendust commented Feb 29, 2024 •

edited

Loading

codecov-commenter commented Feb 29, 2024 •

edited

Loading

gortiz commented Feb 29, 2024 •

edited

Loading

suddendust commented Feb 29, 2024

ege-st commented Feb 29, 2024

gortiz commented Feb 29, 2024

gortiz commented Feb 29, 2024

zhtaoxiang left a comment

zhtaoxiang Feb 29, 2024

suddendust Feb 29, 2024

zhtaoxiang Feb 29, 2024

suddendust Feb 29, 2024

zhtaoxiang Mar 1, 2024

gortiz Mar 4, 2024

zhtaoxiang Mar 4, 2024

suddendust Mar 4, 2024 •

edited

Loading

zhtaoxiang Mar 4, 2024

suddendust Mar 4, 2024

gortiz Mar 4, 2024

suddendust Mar 4, 2024

gortiz Mar 4, 2024

gortiz Mar 4, 2024

suddendust Mar 4, 2024

gortiz left a comment

gortiz commented Mar 4, 2024

zhtaoxiang left a comment

		_currentOffset = _partitionGroupConsumer.checkpoint(_currentOffset);
		closePartitionGroupConsumer();

Metrics for Realtime Rows Fetched and Stream Consumer Create Exceptions #12522

Metrics for Realtime Rows Fetched and Stream Consumer Create Exceptions #12522

Conversation

suddendust commented Feb 29, 2024 • edited Loading

codecov-commenter commented Feb 29, 2024 • edited Loading

Codecov Report

gortiz commented Feb 29, 2024 • edited Loading

suddendust commented Feb 29, 2024

ege-st commented Feb 29, 2024

gortiz commented Feb 29, 2024

gortiz commented Feb 29, 2024

zhtaoxiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suddendust Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gortiz left a comment

Choose a reason for hiding this comment

gortiz commented Mar 4, 2024

zhtaoxiang left a comment

Choose a reason for hiding this comment

suddendust commented Feb 29, 2024 •

edited

Loading

codecov-commenter commented Feb 29, 2024 •

edited

Loading

gortiz commented Feb 29, 2024 •

edited

Loading

suddendust Mar 4, 2024 •

edited

Loading