Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_Vincent

Apache Pulsar
Vincent Xie (Bestpay), Jia Zhai (StreamNative)
Unify Storage Backend for Batch and
Streaming Computation with

About us
Vincent (Weisheng) Xie
❏ Current Director @ Orange Financial
❏ Previous Tech lead of ML engineering
team @ Intel
Jia Zhai
❏ Co-Founder of StreamNative
❏ Apache Pulsar PMC Member
❏ Apache BookKeeper PMC Member

Agenda
❏ Background
❏ Apache Pulsar
❏ Unified Data Processing
❏ Our Practices
❏ Q & A

Orange Financial
Orange Financial Services Group (Chinese: 甜橙金融), formerly known as Bestpay, is an affiliate company of
China Telecom. It reached 1.13 trillion CNY ($18.37 Billion) transaction volume in 2018, with 500 million registered
users and 41.9 million active users.
Subsidiaries:
Bestpay - a mobile wallet and payment app
Jieqian - a consumer loan service
Orange Wealth
Orange Insurance
Orange Credit
Orange Financial Cloud

High Industry Penetration Rate
Source: China Unionpay

Challenges
❏ High concurrency
❏ > 50M transactions, 1 billion events a day (peek: 35K/s)
❏ Low latency demand
❏ response < 200ms
❏ Large number of batch jobs and streaming jobs

“A merchant’s total transaction volume ($) within the past month (30days)
(current transaction included)”
= sum($past_29days) + sum($today_upto_current)
batch streaming

Batch Layer
Speed/Streaming Layer
Architecture V1 - Lambda
API
Gateway
Serving
Layer

Drawbacks
❏ S/W stacks complexity
❏ Realtime / Offline / Serving stacks
❏ Multiple clusters to maintain (Kafka / Hive / Spark / Flink)
❏ Different skill sets to manipulate (Scala / Java / SQL)
❏ Segmented Logics
❏ Historical/Current
❏ Data redundancy
❏ Multiple duplications to move over

“Flexible Pub/Sub Messaging
Backed by durable log storage”

Pulsar - A cloud-native architecture
Stateless Serving
Durable Storage

Pulsar - Segment Centric Storage
❏ Topic Partition (Managed Ledger)
❏ The storage layer for a single topic
partition
❏ Segment (Ledger)
❏ Single writer, append-only
❏ Replicated to multiple bookies

Pulsar - Stream as a unified view on data

Pulsar - Two levels of reading API
❏ Pub/Sub (Streaming)
❏ Read data from brokers
❏ Consume / Seek / Receive
❏ Subscription Mode - Failover, Shared, Key_Shared
❏ Reprocessing data by rewinding (seeking) the cursors
❏ Segment (Batch)
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)

Unified Data Processing on Pulsar

Architecture V2
API
Gateway
Spark Structured
Streaming
Spark SQL

Architecture V2
API
Gateway
Spark Structured
Streaming
Spark SQL
❏ Single Data Store (Pulsar)
❏ Single Computing Engine (Spark)
❏ Unified API

Pulsar-Spark
❏ Deeply integrated with Pulsar schema
❏ Pulsar topics as Structured Streams
❏ Pulsar Connectors for Spark Structured Streaming
❏ Pulsar Connectors for Spark SQL
https://github.com/streamnative/pulsar-spark

Pulsar-Spark / Streaming Queries

Pulsar-Spark / Batch Queries

Pulsar-Spark / Write Results to Pulsar

PoC at Bestpay
❏ Ingest data to Pulsar
❏ Realtime Data
❏ pulsar-io-kafka: connect kafka messages (JSON) to Pulsar
and store them in AVRO format with schema information
❏ Historic Data
❏ pulsar-spark: query the Hive table and insert Hive rows as
Pulsar messages (AVRO) to Pulsar
❏ Data Processing
❏ Spark Structured Streaming: for stream processing
❏ Spark SQL: for batch processing and interactive queries

Benefits
❏ Complexity drop 33% (Number of clusters from 6 down to 4)
❏ Storage saving 8.7% (expect to be 28%)
❏ Time to production boosts 11x (backed with streaming SQL)
❏ Higher stability (expected)

Summary
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data
❏ Pulsar + Spark for a simple unified data processing

References
❏ pulsar-io-kafka: https://github.com/streamnative/pulsar-io-kafka
❏ pulsar-spark: https://github.com/streamnative/pulsar-spark
❏ Apache Pulsar as One Storage System for Both Real-time and Historical Data
Analysis:
https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c590
17

Community
❏ Pulsar Website: https://pulsar.apache.org
❏ Twitter: @apache_pulsar / @streamnativeio
❏ Slack: https://apache-pulsar.herokuapp.com
❏ Mailing Lists
dev@pulsar.apache.org, users@pulsar.apache.org
❏ Github
https://github.com/apache/pulsar
❏ Medium
https://medium.com/streamnative

Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_Vincent

More Related Content

Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_Vincent