Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add round-robin logic during downloadSegmentFromPeer #12353

Merged

Conversation

tibrewalpratik17
Copy link
Contributor

@tibrewalpratik17 tibrewalpratik17 commented Feb 2, 2024

label: bugfix

During downloadSegmentFromPeer :

if (isPeerSegmentDownloadEnabled(tableConfig)) {
downloadSegmentFromPeer(segmentName, tableConfig.getValidationConfig().getPeerSegmentDownloadScheme(),
indexLoadingConfig);

We saw that in some scenarios, the download segment request went to the same peer even if the segment had multiple replicas.

Logs:

host1	WARN	Download and move segment tableName__3__7940__20240201T1322Z from peer with scheme http failed.	2024-02-02T04:40:43.510+00:00
host1	WARN	Caught exception while fetching segment from: http://host2:25602/segments/tableName_REALTIME/tableName__3__7940__20240201T1322Z to: /data/upinot/stream-pinot-server/dataDir/tableName_REALTIME/tmp/tmp-tableName__3__7940__20240201T1322Z.1706848841111/tableName__3__7940__20240201T1322Z.tar.gz	2024-02-02T04:40:43.504+00:00
host1	WARN	Caught exception while downloading segment from: http://host2:25602/segments/tableName_REALTIME/tableName__3__7940__20240201T1322Z to: /data/upinot/stream-pinot-server/dataDir/tableName_REALTIME/tmp/tmp-tableName__3__7940__20240201T1322Z.1706848841111/tableName__3__7940__20240201T1322Z.tar.gz	2024-02-02T04:40:43.503+00:00
host2	INFO	Received a request to download segment tableName__3__7940__20240201T1322Z for table tableName_REALTIME	2024-02-02T04:40:43.502+00:00
host1	WARN	Caught exception while downloading segment from: http://host2:25602/segments/tableName_REALTIME/tableName__3__7940__20240201T1322Z to: /data/upinot/stream-pinot-server/dataDir/tableName_REALTIME/tmp/tmp-tableName__3__7940__20240201T1322Z.1706848841111/tableName__3__7940__20240201T1322Z.tar.gz	2024-02-02T04:40:41.583+00:00
host1	WARN	Caught exception while fetching segment from: http://host2:25602/segments/tableName_REALTIME/tableName__3__7940__20240201T1322Z to: /data/upinot/stream-pinot-server/dataDir/tableName_REALTIME/tmp/tmp-tableName__3__7940__20240201T1322Z.1706848841111/tableName__3__7940__20240201T1322Z.tar.gz	2024-02-02T04:40:41.583+00:00
host2	INFO	Received a request to download segment tableName__3__7940__20240201T1322Z for table tableName_REALTIME	2024-02-02T04:40:41.581+00:00
host1	WARN	Caught exception while downloading segment from: http://host2:25602/segments/tableName_REALTIME/tableName__3__7940__20240201T1322Z to: /data/upinot/stream-pinot-server/dataDir/tableName_REALTIME/tmp/tmp-tableName__3__7940__20240201T1322Z.1706848841111/tableName__3__7940__20240201T1322Z.tar.gz	2024-02-02T04:40:41.273+00:00
host1	WARN	Caught exception while fetching segment from: http://host2:25602/segments/tableName_REALTIME/tableName__3__7940__20240201T1322Z to: /data/upinot/stream-pinot-server/dataDir/tableName_REALTIME/tmp/tmp-tableName__3__7940__20240201T1322Z.1706848841111/tableName__3__7940__20240201T1322Z.tar.gz	2024-02-02T04:40:41.273+00:00

This patch adds a round-robin logic to ensure we hit other peers as well to download replica.

@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2024

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (e1e22bd) 61.74% compared to head (91025ec) 61.78%.

Files Patch % Lines
...ache/pinot/common/utils/RoundRobinURIProvider.java 85.00% 2 Missing and 1 partial ⚠️
...pinot/common/utils/fetcher/BaseSegmentFetcher.java 0.00% 2 Missing ⚠️
...pinot/common/utils/fetcher/HttpSegmentFetcher.java 0.00% 2 Missing ⚠️
...ot/plugin/minion/tasks/SegmentConversionUtils.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12353      +/-   ##
============================================
+ Coverage     61.74%   61.78%   +0.03%     
  Complexity      207      207              
============================================
  Files          2436     2436              
  Lines        133089   133098       +9     
  Branches      20615    20617       +2     
============================================
+ Hits          82172    82228      +56     
+ Misses        44878    44837      -41     
+ Partials       6039     6033       -6     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 61.72% <68.00%> (+0.03%) ⬆️
java-21 61.63% <68.00%> (+0.01%) ⬆️
skip-bytebuffers-false 61.76% <68.00%> (+0.02%) ⬆️
skip-bytebuffers-true 61.59% <68.00%> (+0.01%) ⬆️
temurin 61.78% <68.00%> (+0.03%) ⬆️
unittests 61.77% <68.00%> (+0.03%) ⬆️
unittests1 46.92% <70.83%> (+0.03%) ⬆️
unittests2 27.74% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tibrewalpratik17
Copy link
Contributor Author

cc @Jackie-Jiang

@ankitsultana
Copy link
Contributor

@Jackie-Jiang : can you also take a look? Thanks

@ankitsultana
Copy link
Contributor

@tibrewalpratik17 : one of the tests is failing here across re-runs: https://github.com/apache/pinot/blob/master/pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/PartialUpsertTableRebalanceIntegrationTest.java#L199

Looks like rebalance is not converging, and the rebalance may be using Segment Fetcher which in turn may be using the RoundRobinURIProvider. Can you take a look?

@tibrewalpratik17
Copy link
Contributor Author

@tibrewalpratik17 : one of the tests is failing here across re-runs: https://github.com/apache/pinot/blob/master/pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/PartialUpsertTableRebalanceIntegrationTest.java#L199
Looks like rebalance is not converging, and the rebalance may be using Segment Fetcher which in turn may be using the RoundRobinURIProvider. Can you take a look?

Looks flaky to me! I rebased and re-ran the workflows. In the current run: https://github.com/apache/pinot/actions/runs/7891406566/job/21535679431?pr=12353, some other integration test failed with this unrelated reason where server-starter failed:

PeerDownloadLLCRealtimeClusterIntegrationTest.setUp:101->BaseRealtimeClusterIntegrationTest.setUp:51->startServer:115->ClusterTest.startServers:270->ClusterTest.startOneServer:278 » Runtime java.io.IOException: Failed to bind to address 0.0.0.0/0.0.0.0:8475

@ankitsultana ankitsultana merged commit 9321eab into apache:master Feb 13, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants