Interactive querying of streams using Apache Pulsar_Jerry peng

© 2020 SPLUNK INC.
Interactive Querying of
Streams Using Apache
Pulsar™
Jerry Peng
Pulsar Summit | June 2020
Principal Software Engineer | jerryp@splunk.com
Apache {Pulsar, Heron, Storm} committer and PMC member

© 2020 SPLUNK INC.
Agenda 1) General use cases
2) Existing architectures
3) Apache Pulsar overview
4) Pulsar SQL
5) Concrete use case (Zhaoping.com)
6) Demo!
7) Questions?

© 2020 SPLUNK INC.
What are Streams?
Continuous flows of data…
Almost all data originate in this form

© 2020 SPLUNK INC.
Interactive Querying of Streams?
Querying both latest and historical data

© 2020 SPLUNK INC.
How is it useful?
● Speed (i.e. data-driven processing)
○ Act faster
● Accuracy
○ In many contexts the wrong decision may be made if you do not have visibility
that includes the most current data
○ For example, historical data is useful to predict a user is interested in buying a
particular item, but if my analytics don’t also know that the user just purchased
that item two minutes ago they’re going to make the wrong recommendation
● Simplification
○ Single place to go to access current and historical data

© 2020 SPLUNK INC.
Debugging
● Errors and Exception
● Troubleshooting systems and
networks
● Have we seen these errors before?
General use cases

© 2020 SPLUNK INC.
Monitoring (Audit logs)
● Answering the “What, When, Who,
Why”
● Suspicious access patterns
● Example
○ Auditing CDC logs in financial institutions
General use cases

© 2020 SPLUNK INC.
Exploring
● Raw or enriched data
● Really simplifies access if data is all in
one location
General use cases

© 2020 SPLUNK INC.
Lots of use cases
● Data analytics
● Business Intelligence
● Real-time dashboards
● etc…
General use cases

© 2020 SPLUNK INC.
Stream processing patterns
ComputeMessaging
Storage
Data Ingestion Data Processing / Querying
Results StorageData Storage
Data
Serving

© 2020 SPLUNK INC.
Existing Solutions
HDFS
Messaging Real-time compute
Storage
Data Stream
Querying
Cloud
Storage
Apache Hadoop MR, Apache Spark, Presto, etc.
Cloud
Pub/Sub Apache Storm, Apache Flink, Apache Heron, etc.

© 2020 SPLUNK INC.
Problems with existing solutions
● Multiple Systems
● Duplication of data
○ Data consistency. Where is the source of truth?
● Latency between data ingestion and when data is queryable

© 2020 SPLUNK INC.
THIS IS WHERE APACHE PULSAR
AND PULSAR SQL COMES IN…

© 2020 SPLUNK INC.
Apache Pulsar™
Flexible Messaging + Streaming System
backed by a durable log storage

© 2020 SPLUNK INC.
Apache Pulsar as a Event Store
1
5

© 2020 SPLUNK INC.
Apache Pulsar Overview

© 2020 SPLUNK INC.
Architecture
Multi-layer, scalable architecture
Independent layers for processing, serving and storage
Messaging and processing built on Apache Pulsar
Storage built on Apache BookKeeper
Consumer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Messaging
Broker Broker Broker
Bookie Bookie Bookie Bookie Bookie
Event storage
Function Processing
WorkerWorker

© 2020 SPLUNK INC.
Segment Centric
Storage
● In addition to partitioning, messages are
stored in segments (based on time and
size)
● Segments are independent from each
others and spread across all storage
nodes
● What this means for Pulsar SQL?
○ Allows SQL engine to read multiple
bookies and leverage disk I/O and
bandwidth of multiple machines
even if the data is in one partition

© 2020 SPLUNK INC.
Writes
● Every segment/ledger has an ensemble
● Each entry in ledger has a
○ Write quorum
■ Nodes of the ensemble to which it is written (usually all)
○ Ack quorum
■ Nodes of the write quorum that must respond for that
entry to be acknowledged (usually a majority)
○ Allows users to configure the number of
replicas SQL engine can read from
○ Trade off between read bandwidth and
storage cost

© 2020 SPLUNK INC.
Apache Bookkeeper™ Internals
● Separate IO path for reads and writes
● Optimized for writing, tailing reads,
catch-up reads
○ Queries often involving scanning
the data.
○ Read-a-head cache in BK allows
for fast sequential reads

© 2020 SPLUNK INC.
Tiered Storage
Unlimited topic storage capacity
Achieves the true “stream-storage”:
keep the raw data forever in stream
form

© 2020 SPLUNK INC.
Tiered Storage
● Leverage cloud storage services to offload cold data — Completely transparent to clients
● Extremely cost effective — Backends (S3) (Coming GCS, HDFS)
● Example: Retain all data for 1 month — Offload all messages older than 1 day to S3
○ Pulsar SQL can query not only data in store in Bookies but also offloaded
into a cloud storage service
2
2

© 2020 SPLUNK INC.
Schema Registry
● Store information on the data structure —
Stored in BookKeeper
● Enforce data types on topic
● Allow for compatible schema evolutions
● JSON, Avro, and Protobuf supported
○ Allows data to be structured so
that it becomes queryable by a
SQL language

© 2020 SPLUNK INC.
Pulsar SQL
Interactive SQL queries over data stored in Pulsar
Query old and real-time data
2
4

© 2020 SPLUNK INC.
Pulsar SQL / 2
● Based on Presto by Facebook — https://prestodb.io/
● Presto is a distributed query execution engine
● Fetches the data from multiple sources (HDFS, S3, MySQL, …)
● Full SQL compatibility
2
5

© 2020 SPLUNK INC.
Pulsar SQL / 3
● Pulsar connector for Presto
○ Read data directly from BookKeeper — bypass Pulsar Broker
■ Can also read data offloaded to Tiered Storage (S3, GCS, etc.)
○ Many-to-many data reads
■ Data is split even on a single partition — multiple workers can read data in
parallel from single Pulsar partition
■ Time based indexing — Use “publishTime” in predicates to reduce data
being read from disk
2
6

© 2020 SPLUNK INC.
Pulsar SQL Architecture

© 2020 SPLUNK INC.
Benefits
● Do not need to move data into another
system for querying
● Read data in parallel
○ Performance not impacted by
partitioning
○ Increase throughput by increasing
write quorum
● Newly arrived data able to be queried
immediately

© 2020 SPLUNK INC.
Compared to other message buses?
● Other messaging platforms have Presto integrations
● Typically uses a consumer to read data from brokers
● Topic/partition served by a single broker (limiting disk IO and
network bandwidth)

© 2020 SPLUNK INC.
User interaction
Connect with CLI client
$./bin/pulsar sql
List Pulsar cluster
presto> show catalogs;
Catalog
---------
pulsar
system
(2 rows)
List Pulsar namespaces
presto> show schemas in pulsar;
Schema
-----------------------
information_schema
public/default
public/functions
sample/standalone/ns1
List Pulsar topics
presto> show tables in pulsar."public/default";
Table
----------------
generator_test
(1 row)
Pulsar SQL

© 2020 SPLUNK INC.
Performance
Setup
• 3 Nodes
• 12 CPU cores
• 128 GB RAM
• 2 X 1.2 TB NVMe disks
Results
• JSON (Compressed)
• ~60 Millions Rows / Second
• Avro (Compressed)
• ~50 Million Rows / Second

© 2020 SPLUNK INC.
Improving query efficiency
● Query by partition
○ Scanning a large amounts of data may be costly and time-consuming
○ If the data is keyed and hashed to a specific partition, you can simply query the specify partition
○ For example ,if you have tweets keyed by author ingested in Pulsar
SELECT tweet.author, tweet.content
WHERE tweet.author = “jerry” AND __partition__ = 1
FROM pulsar.”public/default”.tweets
● Query by publish time
○ Ledgers/segments are naturally sorted by publish time
○ Only data within publish time will be read
○ Select a range of publish times to minimize the data that needs to be read
SELECT tweet.author, tweet.content
WHERE tweet.author = “jerry” AND __partition__ = 1 AND __publish_time__ > timestamp '2020-06-15 09:00:00'
FROM pulsar.”public/default”.tweets

© 2020 SPLUNK INC.
Background
● About ZhaoPin
○ Chinese job search website (Linkedin, Indeed, etc)
● Background
○ ZhaoPin is already a heavy user of Apache Pulsar
○ Using Pulsar to power their enterprise event bus
■ Data involving job position searches, job posts, and resume searches
Case study: Job search analytics at zhaopin.com
Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/

© 2020 SPLUNK INC.
● Debugging search results
○ “When the search results do not meet expectations…”
● Analyzing and improving search results
○ “Analyze the search criteria associated with a position that a job
seeker applied for, such as when the position was first exposed
to that user, in order to improve the search service.”
● Analyzing search logs
○ “Analyze search logs from different perspectives and generate
charts that summarize data in different ways, such as by city,
vocation, or keyword ranking. In this way, the search service can
be improved by making it more specific.”
Use cases

© 2020 SPLUNK INC.
● ZhaoPin already using Pulsar
● Pulsar SQL allows queries using SQL syntax
● Pulsar SQL can save a large amount of data and is
easy to scale up
Why Pulsar SQL

© 2020 SPLUNK INC.
● Performance tuning
● Store data in columnar format
○ Improve compression ratio
○ Materialize relevant columns
● Support different indices
Future work

Interactive querying of streams using Apache Pulsar_Jerry peng

More Related Content

Interactive querying of streams using Apache Pulsar_Jerry peng