0.4.0

0.4.0 release introduced the theta-sketch based distinct count function, an S3 filesystem plugin, a unified star-tree index implementation, migration from TimeFieldSpec to DateTimeFieldSpec, etc.

Summary

0.4.0 release introduced various new features, including the theta-sketch based distinct count aggregation function, an S3 filesystem plugin, a unified star-tree index implementation, deprecation of TimeFieldSpec in favor of DateTimeFieldSpec, etc. Miscellaneous refactoring, performance improvement and bug fixes were also included in this release. See details below.

Notable New Features

Made DateTimeFieldSpecs mainstream and deprecated TimeFieldSpec (#2756)
- Used time column from table config instead of schema (#5320)
- Included dateTimeFieldSpec in schema columns of Pinot Query Console #5392
- Used DATE_TIME as the primary time column for Pinot tables (#5399)
Supported range queries using indexes (#5240)
Supported complex aggregation functions
- Supported Aggregation functions with multiple arguments (#5261)
- Added api in AggregationFunction to get compiled input expressions (#5339)
Added a simple PinotFS benchmark driver (#5160)
Supported default star-tree (#5147)
Added an initial implementation for theta-sketch based distinct count aggregation function (#5316)
- One minor side effect: DataSchemaPruner won't work for DistinctCountThetaSketchAggregatinoFunction (#5382)
Added access control for Pinot server segment download api (#5260)
Added Pinot S3 Filesystem Plugin (#5249)
Text search improvement
- Pruned stop words for text index (#5297)
- Used 8byte offsets in chunk based raw index creator (#5285)
- Derived num docs per chunk from max column value length for varbyte raw index creator (#5256)
- Added inter segment tests for text search and fixed a bug for Lucene query parser creation (#5226)
- Made text index query cache a configurable option (#5176)
- Added Lucene DocId to PinotDocId cache to improve performance (#5177)
- Removed the construction of second bitmap in text index reader to improve performance (#5199)
Tooling/usability improvement
- Added template support for Pinot Ingestion Job Spec (#5341)
- Allowed user to specify zk data dir and don't do clean up during zk shutdown (#5295)
- Allowed configuring minion task timeout in the PinotTaskGenerator (#5317)
- Update JVM settings for scripts (#5127)
- Added Stream github events demo (#5189)
- Moved docs link from gitbook to docs.pinot.apache.org (#5193)
Re-implemented ORCRecordReader (#5267)
Evaluated schema transform expressions during ingestion (#5238)
Handled count distinct query in selection list (#5223)
Enabled async processing in pinot broker query api (#5229)
Supported bootstrap mode for table rebalance (#5224)
Supported order-by on BYTES column (#5213)
Added Nightly publish to binary (#5190)
Shuffled the segments when rebalancing the table to avoid creating hotspot servers (#5197)
Supported built-in transform functions (#5312)
- Added date time transform functions (#5326)
Deepstore by-pass in LLC: introduced segment uploader (#5277, #5314)
APIs Additions/Changes
- Added a new server api for download of segments
  - /GET /segments/{tableNameWithType}/{segmentName}
Upgraded helix to 0.9.7 (#5411)
Added support to execute functions during query compilation (#5406)
Other notable refactoring
- Moved table config into pinot-spi (#5194)
- Cleaned up integration tests. Standardized the creation of schema, table config and segments (#5385)
- Added jsonExtractScalar function to extract field from json object (#4597)
- Added template support for Pinot Ingestion Job Spec #5372
- Cleaned up AggregationFunctionContext (#5364)
- Optimized real-time range predicate when cardinality is high (#5331)
- Made PinotOutputFormat use table config and schema to create segments (#5350)
- Tracked unavailable segments in InstanceSelector (#5337)
- Added a new best effort segment uploader with bounded upload time (#5314)
- In SegmentPurger, used table config to generate the segment (#5325)
- Decoupled schema from RecordReader and StreamMessageDecoder (#5309)
- Implemented ARRAYLENGTH UDF for multi-valued columns (#5301)
- Improved GroupBy query performance (#5291)
- Optimized ExpressionFilterOperator (#5132)

Major Bug Fixes

Do not release the PinotDataBuffer when closing the index (#5400)
Handled a no-arg function in query parsing and expression tree (#5375)
Fixed compatibility issues during rolling upgrade due to unknown json fields (#5376)
Fixed missing error message from pinot-admin command (#5305)
Fixed HDFS copy logic (#5218)
Fixed spark ingestion issue (#5216)
Fixed the capacity of the DistinctTable (#5204)
Fixed various links in the Pinot website

Work in Progress

Upsert: support overriding data in the real-time table (#4261).
- Add pinot upsert features to pinot common (#5175)
Enhancements for theta-sketch, e.g. multiValue aggregation support, complex predicates, performance tuning, etc

Backward Incompatible Changes

TableConfig no longer support de-serialization from json string of nested json string (i.e. no \" inside the json) (#5194)

The following APIs are changed in AggregationFunction (use TransformExpressionTree instead of String as the key of blockValSetMap) (#5371):

void aggregate(int length, AggregationResultHolder aggregationResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
void aggregateGroupBySV(int length, int[] groupKeyArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
void aggregateGroupByMV(int length, int[][] groupKeysArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);

Previous0.5.0 Next0.3.0

Last updated 1 year ago