Improved segment build time for Lucene text index realtime to offline conversion #12744

itschrispeck · 2024-03-28T22:24:39Z

Motivation/implementation doc: https://docs.google.com/document/d/1Leo4mQvR-6Gscseq50oqSnddX8Y3up6fnysKaly-eAU/edit?usp=sharing

Internally we've seen roughly 40-60% improvement in overall segment build time. The lower peaks are from a table/tenant with this change, the higher ingestion delay peaks are from an identical table in a tenant without this change:

Testing: deployed internally, local testing, validated basic pause/restart/reload operations on a table to ensure no regression in TextIndexHandler index build.

tags: ingestion performance

codecov-commenter · 2024-03-28T23:04:02Z

Codecov Report

Attention: Patch coverage is 86.20690% with 16 lines in your changes are missing coverage. Please review.

Project coverage is 62.24%. Comparing base (59551e4) to head (2bddff1).
Report is 298 commits behind head on master.

Files	Patch %	Lines
...ment/creator/impl/text/LuceneTextIndexCreator.java	86.95%	7 Missing and 2 partials ⚠️
...me/impl/invertedindex/RealtimeLuceneTextIndex.java	42.85%	4 Missing ⚠️
...ot/segment/spi/creator/SegmentGeneratorConfig.java	50.00%	2 Missing ⚠️
.../pinot/segment/spi/index/mutable/MutableIndex.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #12744      +/-   ##
============================================
+ Coverage     61.75%   62.24%   +0.49%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2503      +67     
  Lines        133233   136497    +3264     
  Branches      20636    21117     +481     
============================================
+ Hits          82274    84964    +2690     
- Misses        44911    45248     +337     
- Partials       6048     6285     +237

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`62.18% <86.20%> (+0.47%)`	⬆️
java-21	`62.10% <86.20%> (+0.48%)`	⬆️
skip-bytebuffers-false	`62.22% <86.20%> (+0.47%)`	⬆️
skip-bytebuffers-true	`62.08% <86.20%> (+34.35%)`	⬆️
temurin	`62.24% <86.20%> (+0.49%)`	⬆️
unittests	`62.24% <86.20%> (+0.49%)`	⬆️
unittests1	`46.70% <10.34%> (-0.19%)`	⬇️
unittests2	`28.02% <75.86%> (+0.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java

chenboat · 2024-04-04T23:21:59Z

...in/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java

@@ -74,7 +85,8 @@ public static HashSet<String> getDefaultEnglishStopWordsSet() {
   * @param column column name
   * @param segmentIndexDir segment index directory
   * @param commit true if the index should be committed (at the end after all documents have
-   *               been added), false if index should not be committed
+   *               been added), false if index should not be
+   * @param sortedDocIds sortedDocIds from segment conversion


same as my earlier comment on the naming. Also since this is a public method, we need more explanation and some example.

...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java

...rc/main/java/org/apache/pinot/segment/local/realtime/converter/RealtimeSegmentConverter.java

chenboat · 2024-04-09T18:43:28Z

...java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java

+      _indexCreator.getIndexWriter().commit();
+    } catch (Exception e) {
+      LOGGER.error("Failed to commit the realtime lucene text index for column {}, exception {}", _column,
+          e.getMessage());


why not just log the error e?

didn't quite follow this, can you elaborate?

I meant LOGGER.err("Failed.. .", _column, e);

...in/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java

chenboat · 2024-04-09T19:02:05Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableIndex.java

+
+  /**
+   * Commits the mutable index artifacts to disk. This is used in preparation for realtime segment conversion.
+   * commit() should perform any required actions before using mutable segment artifacts to optimize immutable


Comments in line 65-66 seem to be self-conflicting: commit to disk is just one possible action? the implementation can actually choose other action for preparation.

commit to disk should be the main mechanism for index reuse for any indexes using this path - disk is an intermediary. Other actions refers to actions a MutableIndex implementation might want to perform before writing some artifacts to disk

klsince · 2024-07-09T22:42:20Z

hi @itschrispeck is there a feature flag to disable this optimization?

We found the heap usage got higher than before after upgrading to recent code. From server logs, we found "Reusing the realtime lucene index for segment" which led me here. And from heap dump, we found lucene.index.SegmentCommitInfo was a top consumer of heap space, and they were mainly referenced by RealtimeLuceneTextIndex.

num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:       5481198      686979272  [B ([email protected])
   2:       9944219      477322512  java.util.HashMap ([email protected])
   3:       4496967      467684568  org.apache.lucene.index.SegmentCommitInfo
   4:       8873625      283956000  java.util.HashMap$Node ([email protected])
...

We didn't config TextIndex for those tables across the upgrades, so I'd assume this optimization was enabled by default. So I'd like to check if there is a feature flag to disable this, so we can validate if this had caused the higher heap usage.

If there is no such feature flag, I can try to add one for your review. I'm thinking to add one around here

// Optimization for realtime segment conversion
    if (dataSource instanceof RealtimeSegmentSegmentCreationDataSource) { <--- looks like hard coded to enable this optimization when committing mutable segment
      _config.setRealtimeConversion(true);
      _config.setConsumerDir(((RealtimeSegmentSegmentCreationDataSource) dataSource).getConsumerDir());
    }

cc @Jackie-Jiang

itschrispeck marked this pull request as draft March 28, 2024 23:24

itschrispeck force-pushed the lucene_testing branch from 4c42c6d to 5e0d5c2 Compare April 1, 2024 19:08

itschrispeck marked this pull request as ready for review April 1, 2024 19:54

wirybeaver mentioned this pull request Apr 2, 2024

Speed up immutable json index build time #12774

Open

chenboat reviewed Apr 4, 2024

View reviewed changes

...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java Outdated Show resolved Hide resolved

chenboat reviewed Apr 4, 2024

View reviewed changes

Jackie-Jiang added ingestion performance real-time enhancement labels Apr 8, 2024

chenboat reviewed Apr 9, 2024

View reviewed changes

...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java Outdated Show resolved Hide resolved

chenboat reviewed Apr 9, 2024

View reviewed changes

...rc/main/java/org/apache/pinot/segment/local/realtime/converter/RealtimeSegmentConverter.java Outdated Show resolved Hide resolved

chenboat reviewed Apr 9, 2024

View reviewed changes

...in/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java Outdated Show resolved Hide resolved

chenboat reviewed Apr 9, 2024

View reviewed changes

itschrispeck requested a review from chenboat April 12, 2024 02:12

itschrispeck added 6 commits April 12, 2024 12:58

reuse mutable lucene index during segment conversion

2388793

realtime segment conversion only

9de28bb

add RealtimeSegmentConverter test for index reuse path

60b5386

clarify naming

2ce3e28

fix missed renaming

c3dd605

address comments, close all resources

2bddff1

itschrispeck force-pushed the lucene_testing branch from ce8f690 to 2bddff1 Compare April 12, 2024 21:32

chenboat approved these changes Apr 13, 2024

View reviewed changes

chenboat merged commit 2c6a84b into apache:master Apr 15, 2024
18 of 19 checks passed

SabrinaZhaozyf mentioned this pull request Apr 30, 2024

Fix Lucene text index realtime to offline conversion #13039

Closed

itschrispeck mentioned this pull request May 2, 2024

bugfix: prevent background merges on the realtime lucene index #13050

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved segment build time for Lucene text index realtime to offline conversion #12744

Improved segment build time for Lucene text index realtime to offline conversion #12744

itschrispeck commented Mar 28, 2024 •

edited

Loading

codecov-commenter commented Mar 28, 2024 •

edited

Loading

chenboat Apr 4, 2024

chenboat Apr 9, 2024

itschrispeck Apr 10, 2024

chenboat Apr 13, 2024

chenboat Apr 9, 2024

itschrispeck Apr 10, 2024

klsince commented Jul 9, 2024

Improved segment build time for Lucene text index realtime to offline conversion #12744

Improved segment build time for Lucene text index realtime to offline conversion #12744

Conversation

itschrispeck commented Mar 28, 2024 • edited Loading

codecov-commenter commented Mar 28, 2024 • edited Loading

Codecov Report

chenboat Apr 4, 2024

Choose a reason for hiding this comment

chenboat Apr 9, 2024

Choose a reason for hiding this comment

itschrispeck Apr 10, 2024

Choose a reason for hiding this comment

chenboat Apr 13, 2024

Choose a reason for hiding this comment

chenboat Apr 9, 2024

Choose a reason for hiding this comment

itschrispeck Apr 10, 2024

Choose a reason for hiding this comment

klsince commented Jul 9, 2024

itschrispeck commented Mar 28, 2024 •

edited

Loading

codecov-commenter commented Mar 28, 2024 •

edited

Loading