Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract json individual array elements from json index for the transform function jsonExtractIndex #12466

Merged
merged 4 commits into from
Mar 12, 2024

Conversation

wirybeaver
Copy link
Contributor

@wirybeaver wirybeaver commented Feb 22, 2024

Take an example, the json records is shown as follows:

    String[] records = {
        "{\"foo\":[{\"bar\":[\"x\",\"y\"]},{\"bar\":[\"a\",\"b\"]}],\"foo2\":[\"u\"]}",
        "{\"foo\":[{\"bar\":[\"y\",\"z\"]}],\"foo2\":[\"u\"]}"
    };

Given a json path $.foo[0].bar[1], we want to get extract all values under this json path and associated docId (not the flattened docID) as follows: {"y":[0], "z":[1]}.

When setDisableCrossArrayUnnest(true), the conceptual flattened doc model is listed below

Flatten Row  Flatten ID Doc Id
{ foo[0].bar[0]:x, foo2[0]:u} 0 0
{ foo[0]bar[1]:y, foo2[0]:u} 1 0
{foo[1]bar[0]:a, foo2[0]:u} 2 0
{foo[1]bar[0]:b, foo2[0]:u} 3 0
{ foo[0].bar[0]:y, foo2[0]:u} 4 1
{ foo[0].bar[1]:z, foo2[0]:u} 5 1

The data structure of json indexing below

Solution
Step 1: Get the flattenDocIds associated to the given json path. Basically, we will reuse the existing code to process json path containing array index. Use the example above, the array index keys will be .foo.$index</u0000>0 and .foo..bar.$index</u0000>1, bitmap = the conjunction of {0,1,4,5} and {1,3,5} = {1,5}

Step 2: Traverse the inverted indexing dictionary, find all keys with the prefix .foo..bar.</u0000>. If the associated bitmap intersecting with the bitmap coming from step 1 is not empty, then parse the value from the key and add the intersection result into the result map.

.foo..bar.</u0000>a -> {2}; intersect {1,5} => empty
.foo..bar.</u0000>b -> {3}; intersect {1,5} => empty
.foo..bar.</u0000>x -> {0}; intersect {1,5} => empty
.foo..bar.</u0000>y -> {1,4}; intersect {1,5} => {1}; 
.foo..bar.</u0000>z -> {5}; intersect {1,5} => {5};

Step 3: transfer the flattened docId into the original docId by looking up the dodIdMapping
{"y":1, "z": 5} => {"y":0, "z": 1}

@@ -209,41 +210,11 @@ private RoaringBitmap getMatchingFlattenedDocIds(Predicate predicate) {
} else {
key = JsonUtils.KEY_SEPARATOR + key;
}

Copy link
Contributor Author

@wirybeaver wirybeaver Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Move the processing of array index to the static function processArrayIndex for code reusing

@wirybeaver wirybeaver marked this pull request as draft February 22, 2024 05:22
@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2024

Codecov Report

Attention: Patch coverage is 69.56522% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 61.77%. Comparing base (59551e4) to head (12ddc72).
Report is 101 commits behind head on master.

Files Patch % Lines
...t/index/readers/json/ImmutableJsonIndexReader.java 59.64% 19 Missing and 4 partials ⚠️
...local/realtime/impl/json/MutableJsonIndexImpl.java 90.90% 1 Missing and 2 partials ⚠️
...rm/function/JsonExtractIndexTransformFunction.java 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12466      +/-   ##
============================================
+ Coverage     61.75%   61.77%   +0.02%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2450      +14     
  Lines        133233   133781     +548     
  Branches      20636    20753     +117     
============================================
+ Hits          82274    82643     +369     
- Misses        44911    45034     +123     
- Partials       6048     6104      +56     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.74% <69.56%> (+0.03%) ⬆️
java-21 61.65% <69.56%> (+0.03%) ⬆️
skip-bytebuffers-false 61.76% <69.56%> (+0.01%) ⬆️
skip-bytebuffers-true 61.63% <69.56%> (+33.90%) ⬆️
temurin 61.77% <69.56%> (+0.02%) ⬆️
unittests 61.77% <69.56%> (+0.02%) ⬆️
unittests1 46.88% <0.00%> (-0.01%) ⬇️
unittests2 27.70% <69.56%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

key = key.substring(1);
} else {
key = JsonUtils.KEY_SEPARATOR + key;
}

// Process the array index within the key if exists
Copy link
Contributor Author

@wirybeaver wirybeaver Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Keep the key preprocessing but move the generation of matchingDocIds over array index json path to the static function processArrayIndex.

@wirybeaver wirybeaver marked this pull request as ready for review February 26, 2024 03:36
@Jackie-Jiang
Copy link
Contributor

@itschrispeck Can you please help take a look?

for (int dictId = dictIds[0]; dictId < dictIds[1]; dictId++) {
// get docIds from posting list, convert these to the actual docIds
ImmutableRoaringBitmap flattenedDocIds = _invertedIndex.getDocIds(dictId);
PeekableIntIterator it = flattenedDocIds.getIntIterator();
PeekableIntIterator it = arrayIndexFlattenDocIds == null ? flattenedDocIds.getIntIterator()
: intersect(arrayIndexFlattenDocIds.clone(), flattenedDocIds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RoaringBitmap.and(arrayIndexFlattenDocIds, flattenedDocIds).getIntIterator()

Copy link
Contributor Author

@wirybeaver wirybeaver Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, RoaringBitmap.and works for MutableJsonIndex. But doesn't work for ImmutableJsonIndexReader;
The RoaringBitmap.and() require the input type is RoaringBitmap;
MutableRoaringBitmap (arrayIndexFlattenDocIds) and ImmutableRoaringBitamp (flattenedDocIds) are not the subclass of RoaringBitmap.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about MutableRoaringBitmap.and(arrayIndexFlattenDocIds, flattenedDocIds)?

@wirybeaver
Copy link
Contributor Author

The integration test fails on a mutlistage query irrelevant to json. the temeurin-11 succeeds.

@wirybeaver wirybeaver changed the title Add the support to extract json array elements from json index for the transform function jsonExtractIndex Extract json individual elements from json index for the transform function jsonExtractIndex Mar 11, 2024
@wirybeaver wirybeaver changed the title Extract json individual elements from json index for the transform function jsonExtractIndex Extract json individual array elements from json index for the transform function jsonExtractIndex Mar 11, 2024
@Jackie-Jiang Jackie-Jiang merged commit f25f889 into apache:master Mar 12, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants