Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: prevent background merges on the realtime lucene index #13050

Merged
merged 1 commit into from
May 3, 2024

Conversation

itschrispeck
Copy link
Collaborator

#12744 changed the Lucene index build process to commit the realtime Lucene index, and reuse it in the immutable segment.

.commit() is called at the beginning of segment conversion, and the Lucene index directory is copied to another location. Since IndexWriter can merge segments in the background even without insertions/updates to the index, these merges non-deterministically can trigger a FileNotFoundException:

java.lang.RuntimeException: Caught exception while instantiating the LuceneTextIndexCreator for column: __mergedTextIndex 
     at org.apache.pinot.segment.local.segment.creator.impl.text.LuceneTextIndexCreator.<init>(LuceneTextIndexCreator.java:149) 
<...>
Caused by: java.lang.RuntimeException: Failed to copy the mutable lucene index: java.io.FileNotFoundException: File system element for parameter 'source' does not exist: '/.../dataDir/table_REALTIME/consumers/table__0__3170__20240430T2313Z/col.lucene.v9.index/_3r.si' 
     at org.apache.pinot.segment.local.segment.creator.impl.text.LuceneTextIndexCreator.convertMutableSegment(LuceneTextIndexCreator.java:192) ...

By using NoMergeScheduler, the async merges will never occur. This patch was tested in our staging/prod env and the intermittent segment build failures disappeared.

tag: bugfix

@codecov-commenter
Copy link

codecov-commenter commented May 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.17%. Comparing base (59551e4) to head (ef290b9).
Report is 396 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13050      +/-   ##
============================================
+ Coverage     61.75%   62.17%   +0.41%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2502      +66     
  Lines        133233   136658    +3425     
  Branches      20636    21179     +543     
============================================
+ Hits          82274    84961    +2687     
- Misses        44911    45397     +486     
- Partials       6048     6300     +252     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.13% <100.00%> (+0.42%) ⬆️
java-21 62.05% <100.00%> (+0.42%) ⬆️
skip-bytebuffers-false 62.15% <100.00%> (+0.40%) ⬆️
skip-bytebuffers-true 62.00% <100.00%> (+34.27%) ⬆️
temurin 62.17% <100.00%> (+0.41%) ⬆️
unittests 62.16% <100.00%> (+0.41%) ⬆️
unittests1 46.69% <0.00%> (-0.20%) ⬇️
unittests2 27.96% <100.00%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Jackie-Jiang
Copy link
Contributor

I saw this test being flaky. Do you think it is related?

Error:  Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.580 s <<< FAILURE! -- in org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest
Error:  org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest.testQueries -- Time elapsed: 0.289 s
Error:  org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest.testQueryCancellationIsSuccessful -- Time elapsed: 0.032 s <<< FAILURE!
org.testng.TestException: 

The exception was thrown with the wrong message: expected ".*TEXT_MATCH query timeout on realtime consuming segment.*" but got "java.util.concurrent.RejectedExecutionException: java.lang.InterruptedException"
	at org.testng.internal.invokers.ExpectedExceptionsHolder.wrongException(ExpectedExceptionsHolder.java:69)
	at org.testng.internal.invokers.TestInvoker.considerExceptionsInternal(TestInvoker.java:885)
	at org.testng.internal.invokers.TestInvoker.considerExceptions(TestInvoker.java:860)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:741)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:230)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:63)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:992)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:203)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:154)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:134)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at org.testng.TestRunner.privateRun(TestRunner.java:739)
	at org.testng.TestRunner.run(TestRunner.java:614)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:421)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:413)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:373)
	at org.testng.SuiteRunner.run(SuiteRunner.java:312)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:95)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1274)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1208)
	at org.testng.TestNG.runSuites(TestNG.java:1112)
	at org.testng.TestNG.run(TestNG.java:1079)
	at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:155)
	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:102)
	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:91)
	at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:137)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:385)
	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
	at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:507)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:495)
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.RejectedExecutionException: java.lang.InterruptedException
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest.testQueryCancellationIsSuccessful(LuceneMutableTextIndexTest.java:105)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:141)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:686)
	... 27 more
Caused by: java.util.concurrent.RejectedExecutionException: java.lang.InterruptedException
	at org.apache.pinot.common.utils.ScalingThreadPoolExecutor$ForceQueuePolicy.rejectedExecution(ScalingThreadPoolExecutor.java:93)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:145)
	at org.apache.pinot.segment.local.realtime.impl.invertedindex.RealtimeLuceneTextIndex.getDocIds(RealtimeLuceneTextIndex.java:149)
	at org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest.lambda$testQueryCancellationIsSuccessful$0(LuceneMutableTextIndexTest.java:103)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.InterruptedException
	at java.base/java.util.concurrent.locks.ReentrantLock$Sync.lockInterruptibly(ReentrantLock.java:159)
	at java.base/java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:372)
	at java.base/java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:332)
	at org.apache.pinot.common.utils.ScalingThreadPoolExecutor$ForceQueuePolicy.rejectedExecution(ScalingThreadPoolExecutor.java:90)
	... 9 more

@itschrispeck
Copy link
Collaborator Author

itschrispeck commented May 3, 2024

I don't think it's related, but is probably caused by this behavior: https://julien.ponge.org/blog/not-all-java-single-threaded-executors-are-created-equal-a-java-finalizer-horror-story/

Did you see the flakiness is Java 21? It should be fixed there, but could misbehave in the suites run with 11/17.

Do you mind if I put the fix into another PR since it's not related to the fix for the reuse feature, or want me to bundle it into this one? (there's a couple other nits I'd rather bundle together if no preference).

@chenboat
Copy link
Contributor

chenboat commented May 3, 2024

A general question: if we turn off merge, will it result in too many Lucene segments for realtime part and cause query perf issues?

@itschrispeck
Copy link
Collaborator Author

A general question: if we turn off merge, will it result in too many Lucene segments for realtime part and cause query perf issues?

In theory it could, but in practice Pinot's typical segment sizes prevent this (>5-10GB segments isn't normal), and we don't see a query perf regression with even a small 10MB buffer size and 1.5GB segment size.

FDs are probably more of a concern for systems w/ low limits, but I think can be addressed if needed w/ lucene buffer size/compound file format/Pinot segment size tuning - all of which have existing configs.

@Jackie-Jiang Jackie-Jiang merged commit 6fb315d into apache:master May 3, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants