SlideShare a Scribd company logo
© 2020 SPLUNK INC.
Interactive Querying of
Streams Using Apache
Pulsar™
Jerry Peng
Pulsar Summit | June 2020
Principal Software Engineer | jerryp@splunk.com
Apache {Pulsar, Heron, Storm} committer and PMC member
© 2020 SPLUNK INC.
Agenda 1) General use cases
2) Existing architectures
3) Apache Pulsar overview
4) Pulsar SQL
5) Concrete use case (Zhaoping.com)
6) Demo!
7) Questions?
© 2020 SPLUNK INC.
What are Streams?
Continuous flows of data…
Almost all data originate in this form
© 2020 SPLUNK INC.
Interactive Querying of Streams?
Querying both latest and historical data
© 2020 SPLUNK INC.
How is it useful?
● Speed (i.e. data-driven processing)
○ Act faster
● Accuracy
○ In many contexts the wrong decision may be made if you do not have visibility
that includes the most current data
○ For example, historical data is useful to predict a user is interested in buying a
particular item, but if my analytics don’t also know that the user just purchased
that item two minutes ago they’re going to make the wrong recommendation
● Simplification
○ Single place to go to access current and historical data
© 2020 SPLUNK INC.
Debugging
● Errors and Exception
● Troubleshooting systems and
networks
● Have we seen these errors before?
General use cases
© 2020 SPLUNK INC.
Monitoring (Audit logs)
● Answering the “What, When, Who,
Why”
● Suspicious access patterns
● Example
○ Auditing CDC logs in financial institutions
General use cases
© 2020 SPLUNK INC.
Exploring
● Raw or enriched data
● Really simplifies access if data is all in
one location
General use cases
© 2020 SPLUNK INC.
Lots of use cases
● Data analytics
● Business Intelligence
● Real-time dashboards
● etc…
General use cases
© 2020 SPLUNK INC.
Stream processing patterns
ComputeMessaging
Storage
Data Ingestion Data Processing / Querying
Results StorageData Storage
Data
Serving
© 2020 SPLUNK INC.
Existing Solutions
HDFS
Messaging Real-time compute
Storage
Data Stream
Querying
Cloud
Storage
Apache Hadoop MR, Apache Spark, Presto, etc.
Cloud
Pub/Sub Apache Storm, Apache Flink, Apache Heron, etc.
© 2020 SPLUNK INC.
Problems with existing solutions
● Multiple Systems
● Duplication of data
○ Data consistency. Where is the source of truth?
● Latency between data ingestion and when data is queryable
© 2020 SPLUNK INC.
THIS IS WHERE APACHE PULSAR
AND PULSAR SQL COMES IN…
© 2020 SPLUNK INC.
Apache Pulsar™
Flexible Messaging + Streaming System
backed by a durable log storage
© 2020 SPLUNK INC.
Apache Pulsar as a Event Store
1
5
© 2020 SPLUNK INC.
Apache Pulsar Overview
© 2020 SPLUNK INC.
Architecture
Multi-layer, scalable architecture
Independent layers for processing, serving and storage
Messaging and processing built on Apache Pulsar
Storage built on Apache BookKeeper
Consumer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Messaging
Broker Broker Broker
Bookie Bookie Bookie Bookie Bookie
Event storage
Function Processing
WorkerWorker
© 2020 SPLUNK INC.
Segment Centric
Storage
● In addition to partitioning, messages are
stored in segments (based on time and
size)
● Segments are independent from each
others and spread across all storage
nodes
● What this means for Pulsar SQL?
○ Allows SQL engine to read multiple
bookies and leverage disk I/O and
bandwidth of multiple machines
even if the data is in one partition
© 2020 SPLUNK INC.
Writes
● Every segment/ledger has an ensemble
● Each entry in ledger has a
○ Write quorum
■ Nodes of the ensemble to which it is written (usually all)
○ Ack quorum
■ Nodes of the write quorum that must respond for that
entry to be acknowledged (usually a majority)
● What this means for Pulsar SQL?
○ Allows users to configure the number of
replicas SQL engine can read from
○ Trade off between read bandwidth and
storage cost
© 2020 SPLUNK INC.
Apache Bookkeeper™ Internals
● Separate IO path for reads and writes
● Optimized for writing, tailing reads,
catch-up reads
● What this means for Pulsar SQL?
○ Queries often involving scanning
the data.
○ Read-a-head cache in BK allows
for fast sequential reads
© 2020 SPLUNK INC.
Tiered Storage
Unlimited topic storage capacity
Achieves the true “stream-storage”:
keep the raw data forever in stream
form
© 2020 SPLUNK INC.
Tiered Storage
● Leverage cloud storage services to offload cold data — Completely transparent to clients
● Extremely cost effective — Backends (S3) (Coming GCS, HDFS)
● Example: Retain all data for 1 month — Offload all messages older than 1 day to S3
● What this means for Pulsar SQL?
○ Pulsar SQL can query not only data in store in Bookies but also offloaded
into a cloud storage service
2
2
© 2020 SPLUNK INC.
Schema Registry
● Store information on the data structure —
Stored in BookKeeper
● Enforce data types on topic
● Allow for compatible schema evolutions
● JSON, Avro, and Protobuf supported
● What this means for Pulsar SQL?
○ Allows data to be structured so
that it becomes queryable by a
SQL language
© 2020 SPLUNK INC.
Pulsar SQL
Interactive SQL queries over data stored in Pulsar
Query old and real-time data
2
4
© 2020 SPLUNK INC.
Pulsar SQL / 2
● Based on Presto by Facebook — https://prestodb.io/
● Presto is a distributed query execution engine
● Fetches the data from multiple sources (HDFS, S3, MySQL, …)
● Full SQL compatibility
2
5
© 2020 SPLUNK INC.
Pulsar SQL / 3
● Pulsar connector for Presto
○ Read data directly from BookKeeper — bypass Pulsar Broker
■ Can also read data offloaded to Tiered Storage (S3, GCS, etc.)
○ Many-to-many data reads
■ Data is split even on a single partition — multiple workers can read data in
parallel from single Pulsar partition
■ Time based indexing — Use “publishTime” in predicates to reduce data
being read from disk
2
6
© 2020 SPLUNK INC.
Pulsar SQL Architecture
© 2020 SPLUNK INC.
Benefits
● Do not need to move data into another
system for querying
● Read data in parallel
○ Performance not impacted by
partitioning
○ Increase throughput by increasing
write quorum
● Newly arrived data able to be queried
immediately
© 2020 SPLUNK INC.
Compared to other message buses?
● Other messaging platforms have Presto integrations
● Typically uses a consumer to read data from brokers
● Topic/partition served by a single broker (limiting disk IO and
network bandwidth)
© 2020 SPLUNK INC.
User interaction
Connect with CLI client
$./bin/pulsar sql
List Pulsar cluster
presto> show catalogs;
Catalog
---------
pulsar
system
(2 rows)
List Pulsar namespaces
presto> show schemas in pulsar;
Schema
-----------------------
information_schema
public/default
public/functions
sample/standalone/ns1
List Pulsar topics
presto> show tables in pulsar."public/default";
Table
----------------
generator_test
(1 row)
Pulsar SQL
© 2020 SPLUNK INC.
User interaction
Query data in topic
presto> select * from pulsar."public/default".generator_test;
firstname | middlename | lastname | email | username | password | telephonenumber | age | companyemail |
-------------+-------------+-------------+----------------------------------+--------------+----------+-----------------+-----+-------------------------------------+
Genesis | Katherine | Wiley | genesis.wiley@gmail.com | genesisw | y9D2dtU3 | 959-197-1860 | 71 | genesis.wiley@interdemconsulting.eu |
Brayden | | Stanton | brayden.stanton@yahoo.com | braydens | ZnjmhXik | 220-027-867 | 81 | brayden.stanton@supermemo.eu |
Benjamin | Julian | Velasquez | benjamin.velasquez@yahoo.com | benjaminv | 8Bc7m3eb | 298-377-0062 | 21 | benjamin.velasquez@hostesltd.biz |
Michael | Thomas | Donovan | donovan@mail.com | michaeld | OqBm9MLs | 078-134-4685 | 55 | michael.donovan@memortech.eu |
Brooklyn | Avery | Roach | brooklynroach@yahoo.com | broach | IxtBLafO | 387-786-2998 | 68 | brooklyn.roach@warst.biz |
Skylar | | Bradshaw | skylarbradshaw@yahoo.com | skylarb | p6eC6cKy | 210-872-608 | 96 | skylar.bradshaw@flyhigh.eu |
.
.
.
Pulsar SQL
© 2020 SPLUNK INC.
Demo
© 2020 SPLUNK INC.
Performance
Setup
• 3 Nodes
• 12 CPU cores
• 128 GB RAM
• 2 X 1.2 TB NVMe disks
Results
• JSON (Compressed)
• ~60 Millions Rows / Second
• Avro (Compressed)
• ~50 Million Rows / Second
© 2020 SPLUNK INC.
Improving query efficiency
● Query by partition
○ Scanning a large amounts of data may be costly and time-consuming
○ If the data is keyed and hashed to a specific partition, you can simply query the specify partition
○ For example ,if you have tweets keyed by author ingested in Pulsar
SELECT tweet.author, tweet.content
WHERE tweet.author = “jerry” AND __partition__ = 1
FROM pulsar.”public/default”.tweets
● Query by publish time
○ Ledgers/segments are naturally sorted by publish time
○ Only data within publish time will be read
○ Select a range of publish times to minimize the data that needs to be read
SELECT tweet.author, tweet.content
WHERE tweet.author = “jerry” AND __partition__ = 1 AND __publish_time__ > timestamp '2020-06-15 09:00:00'
FROM pulsar.”public/default”.tweets
© 2020 SPLUNK INC.
Case study: Job search analytics at
zhaopin.com
© 2020 SPLUNK INC.
Background
● About ZhaoPin
○ Chinese job search website (Linkedin, Indeed, etc)
● Background
○ ZhaoPin is already a heavy user of Apache Pulsar
○ Using Pulsar to power their enterprise event bus
■ Data involving job position searches, job posts, and resume searches
Case study: Job search analytics at zhaopin.com
Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
© 2020 SPLUNK INC.
● Debugging search results
○ “When the search results do not meet expectations…”
● Analyzing and improving search results
○ “Analyze the search criteria associated with a position that a job
seeker applied for, such as when the position was first exposed
to that user, in order to improve the search service.”
● Analyzing search logs
○ “Analyze search logs from different perspectives and generate
charts that summarize data in different ways, such as by city,
vocation, or keyword ranking. In this way, the search service can
be improved by making it more specific.”
Use cases
Case study: Job search analytics at zhaopin.com
Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
© 2020 SPLUNK INC.
● ZhaoPin already using Pulsar
● Pulsar SQL allows queries using SQL syntax
● Pulsar SQL can save a large amount of data and is
easy to scale up
Why Pulsar SQL
Case study: Job search analytics at zhaopin.com
Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
© 2020 SPLUNK INC.
Quick Start guide:
https://pulsar.apache.org/docs/en/sql-getting-started/
How to get started?
© 2020 SPLUNK INC.
● Performance tuning
● Store data in columnar format
○ Improve compression ratio
○ Materialize relevant columns
● Support different indices
Future work
© 2020 SPLUNK INC.
Questions?
Email: jerryp@splunk
4
1

More Related Content

Interactive querying of streams using Apache Pulsar_Jerry peng

  • 1. © 2020 SPLUNK INC. Interactive Querying of Streams Using Apache Pulsar™ Jerry Peng Pulsar Summit | June 2020 Principal Software Engineer | [email protected] Apache {Pulsar, Heron, Storm} committer and PMC member
  • 2. © 2020 SPLUNK INC. Agenda 1) General use cases 2) Existing architectures 3) Apache Pulsar overview 4) Pulsar SQL 5) Concrete use case (Zhaoping.com) 6) Demo! 7) Questions?
  • 3. © 2020 SPLUNK INC. What are Streams? Continuous flows of data… Almost all data originate in this form
  • 4. © 2020 SPLUNK INC. Interactive Querying of Streams? Querying both latest and historical data
  • 5. © 2020 SPLUNK INC. How is it useful? ● Speed (i.e. data-driven processing) ○ Act faster ● Accuracy ○ In many contexts the wrong decision may be made if you do not have visibility that includes the most current data ○ For example, historical data is useful to predict a user is interested in buying a particular item, but if my analytics don’t also know that the user just purchased that item two minutes ago they’re going to make the wrong recommendation ● Simplification ○ Single place to go to access current and historical data
  • 6. © 2020 SPLUNK INC. Debugging ● Errors and Exception ● Troubleshooting systems and networks ● Have we seen these errors before? General use cases
  • 7. © 2020 SPLUNK INC. Monitoring (Audit logs) ● Answering the “What, When, Who, Why” ● Suspicious access patterns ● Example ○ Auditing CDC logs in financial institutions General use cases
  • 8. © 2020 SPLUNK INC. Exploring ● Raw or enriched data ● Really simplifies access if data is all in one location General use cases
  • 9. © 2020 SPLUNK INC. Lots of use cases ● Data analytics ● Business Intelligence ● Real-time dashboards ● etc… General use cases
  • 10. © 2020 SPLUNK INC. Stream processing patterns ComputeMessaging Storage Data Ingestion Data Processing / Querying Results StorageData Storage Data Serving
  • 11. © 2020 SPLUNK INC. Existing Solutions HDFS Messaging Real-time compute Storage Data Stream Querying Cloud Storage Apache Hadoop MR, Apache Spark, Presto, etc. Cloud Pub/Sub Apache Storm, Apache Flink, Apache Heron, etc.
  • 12. © 2020 SPLUNK INC. Problems with existing solutions ● Multiple Systems ● Duplication of data ○ Data consistency. Where is the source of truth? ● Latency between data ingestion and when data is queryable
  • 13. © 2020 SPLUNK INC. THIS IS WHERE APACHE PULSAR AND PULSAR SQL COMES IN…
  • 14. © 2020 SPLUNK INC. Apache Pulsar™ Flexible Messaging + Streaming System backed by a durable log storage
  • 15. © 2020 SPLUNK INC. Apache Pulsar as a Event Store 1 5
  • 16. © 2020 SPLUNK INC. Apache Pulsar Overview
  • 17. © 2020 SPLUNK INC. Architecture Multi-layer, scalable architecture Independent layers for processing, serving and storage Messaging and processing built on Apache Pulsar Storage built on Apache BookKeeper Consumer Producer Producer Producer Consumer Consumer Consumer Messaging Broker Broker Broker Bookie Bookie Bookie Bookie Bookie Event storage Function Processing WorkerWorker
  • 18. © 2020 SPLUNK INC. Segment Centric Storage ● In addition to partitioning, messages are stored in segments (based on time and size) ● Segments are independent from each others and spread across all storage nodes ● What this means for Pulsar SQL? ○ Allows SQL engine to read multiple bookies and leverage disk I/O and bandwidth of multiple machines even if the data is in one partition
  • 19. © 2020 SPLUNK INC. Writes ● Every segment/ledger has an ensemble ● Each entry in ledger has a ○ Write quorum ■ Nodes of the ensemble to which it is written (usually all) ○ Ack quorum ■ Nodes of the write quorum that must respond for that entry to be acknowledged (usually a majority) ● What this means for Pulsar SQL? ○ Allows users to configure the number of replicas SQL engine can read from ○ Trade off between read bandwidth and storage cost
  • 20. © 2020 SPLUNK INC. Apache Bookkeeper™ Internals ● Separate IO path for reads and writes ● Optimized for writing, tailing reads, catch-up reads ● What this means for Pulsar SQL? ○ Queries often involving scanning the data. ○ Read-a-head cache in BK allows for fast sequential reads
  • 21. © 2020 SPLUNK INC. Tiered Storage Unlimited topic storage capacity Achieves the true “stream-storage”: keep the raw data forever in stream form
  • 22. © 2020 SPLUNK INC. Tiered Storage ● Leverage cloud storage services to offload cold data — Completely transparent to clients ● Extremely cost effective — Backends (S3) (Coming GCS, HDFS) ● Example: Retain all data for 1 month — Offload all messages older than 1 day to S3 ● What this means for Pulsar SQL? ○ Pulsar SQL can query not only data in store in Bookies but also offloaded into a cloud storage service 2 2
  • 23. © 2020 SPLUNK INC. Schema Registry ● Store information on the data structure — Stored in BookKeeper ● Enforce data types on topic ● Allow for compatible schema evolutions ● JSON, Avro, and Protobuf supported ● What this means for Pulsar SQL? ○ Allows data to be structured so that it becomes queryable by a SQL language
  • 24. © 2020 SPLUNK INC. Pulsar SQL Interactive SQL queries over data stored in Pulsar Query old and real-time data 2 4
  • 25. © 2020 SPLUNK INC. Pulsar SQL / 2 ● Based on Presto by Facebook — https://prestodb.io/ ● Presto is a distributed query execution engine ● Fetches the data from multiple sources (HDFS, S3, MySQL, …) ● Full SQL compatibility 2 5
  • 26. © 2020 SPLUNK INC. Pulsar SQL / 3 ● Pulsar connector for Presto ○ Read data directly from BookKeeper — bypass Pulsar Broker ■ Can also read data offloaded to Tiered Storage (S3, GCS, etc.) ○ Many-to-many data reads ■ Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition ■ Time based indexing — Use “publishTime” in predicates to reduce data being read from disk 2 6
  • 27. © 2020 SPLUNK INC. Pulsar SQL Architecture
  • 28. © 2020 SPLUNK INC. Benefits ● Do not need to move data into another system for querying ● Read data in parallel ○ Performance not impacted by partitioning ○ Increase throughput by increasing write quorum ● Newly arrived data able to be queried immediately
  • 29. © 2020 SPLUNK INC. Compared to other message buses? ● Other messaging platforms have Presto integrations ● Typically uses a consumer to read data from brokers ● Topic/partition served by a single broker (limiting disk IO and network bandwidth)
  • 30. © 2020 SPLUNK INC. User interaction Connect with CLI client $./bin/pulsar sql List Pulsar cluster presto> show catalogs; Catalog --------- pulsar system (2 rows) List Pulsar namespaces presto> show schemas in pulsar; Schema ----------------------- information_schema public/default public/functions sample/standalone/ns1 List Pulsar topics presto> show tables in pulsar."public/default"; Table ---------------- generator_test (1 row) Pulsar SQL
  • 31. © 2020 SPLUNK INC. User interaction Query data in topic presto> select * from pulsar."public/default".generator_test; firstname | middlename | lastname | email | username | password | telephonenumber | age | companyemail | -------------+-------------+-------------+----------------------------------+--------------+----------+-----------------+-----+-------------------------------------+ Genesis | Katherine | Wiley | [email protected] | genesisw | y9D2dtU3 | 959-197-1860 | 71 | [email protected] | Brayden | | Stanton | [email protected] | braydens | ZnjmhXik | 220-027-867 | 81 | [email protected] | Benjamin | Julian | Velasquez | [email protected] | benjaminv | 8Bc7m3eb | 298-377-0062 | 21 | [email protected] | Michael | Thomas | Donovan | [email protected] | michaeld | OqBm9MLs | 078-134-4685 | 55 | [email protected] | Brooklyn | Avery | Roach | [email protected] | broach | IxtBLafO | 387-786-2998 | 68 | [email protected] | Skylar | | Bradshaw | [email protected] | skylarb | p6eC6cKy | 210-872-608 | 96 | [email protected] | . . . Pulsar SQL
  • 32. © 2020 SPLUNK INC. Demo
  • 33. © 2020 SPLUNK INC. Performance Setup • 3 Nodes • 12 CPU cores • 128 GB RAM • 2 X 1.2 TB NVMe disks Results • JSON (Compressed) • ~60 Millions Rows / Second • Avro (Compressed) • ~50 Million Rows / Second
  • 34. © 2020 SPLUNK INC. Improving query efficiency ● Query by partition ○ Scanning a large amounts of data may be costly and time-consuming ○ If the data is keyed and hashed to a specific partition, you can simply query the specify partition ○ For example ,if you have tweets keyed by author ingested in Pulsar SELECT tweet.author, tweet.content WHERE tweet.author = “jerry” AND __partition__ = 1 FROM pulsar.”public/default”.tweets ● Query by publish time ○ Ledgers/segments are naturally sorted by publish time ○ Only data within publish time will be read ○ Select a range of publish times to minimize the data that needs to be read SELECT tweet.author, tweet.content WHERE tweet.author = “jerry” AND __partition__ = 1 AND __publish_time__ > timestamp '2020-06-15 09:00:00' FROM pulsar.”public/default”.tweets
  • 35. © 2020 SPLUNK INC. Case study: Job search analytics at zhaopin.com
  • 36. © 2020 SPLUNK INC. Background ● About ZhaoPin ○ Chinese job search website (Linkedin, Indeed, etc) ● Background ○ ZhaoPin is already a heavy user of Apache Pulsar ○ Using Pulsar to power their enterprise event bus ■ Data involving job position searches, job posts, and resume searches Case study: Job search analytics at zhaopin.com Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
  • 37. © 2020 SPLUNK INC. ● Debugging search results ○ “When the search results do not meet expectations…” ● Analyzing and improving search results ○ “Analyze the search criteria associated with a position that a job seeker applied for, such as when the position was first exposed to that user, in order to improve the search service.” ● Analyzing search logs ○ “Analyze search logs from different perspectives and generate charts that summarize data in different ways, such as by city, vocation, or keyword ranking. In this way, the search service can be improved by making it more specific.” Use cases Case study: Job search analytics at zhaopin.com Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
  • 38. © 2020 SPLUNK INC. ● ZhaoPin already using Pulsar ● Pulsar SQL allows queries using SQL syntax ● Pulsar SQL can save a large amount of data and is easy to scale up Why Pulsar SQL Case study: Job search analytics at zhaopin.com Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
  • 39. © 2020 SPLUNK INC. Quick Start guide: https://pulsar.apache.org/docs/en/sql-getting-started/ How to get started?
  • 40. © 2020 SPLUNK INC. ● Performance tuning ● Store data in columnar format ○ Improve compression ratio ○ Materialize relevant columns ● Support different indices Future work
  • 41. © 2020 SPLUNK INC. Questions? Email: jerryp@splunk 4 1