Nowadays, real-time computation is heavily used in cases such as online product recommendation, online payment fraud detection and etc.. In the streaming pipeline, Kafka is normally used to store a day/week data, but won't store years-long data, as in looking at the trend historically. So, a batch pipeline is needed for historical data computation. Thus, it's where the Lambda architecture comes in. Lambda has been proved to be effective, and a good balance of speed and reliability. We have been running many systems with Lambda architecture for many years. But the biggest detraction to Lambda architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and streaming layers. With that, we have to split our business logic into many segments across different places, which is a challenge to maintain as the business grows and it also increases communication overhead. Secondly, the data are duplicated in two different systems, and we have to move data among different systems for processing. With those challenges, we have been searching for alternatives and found Apache Pulsar a great fit. In this topic, I will show how we solve those problems with Apache Pulsar by making pulsar a unified storage backend for both batch and streaming pipeline, a solution that simplifies the s/w stack, lifts up our work efficiency and lowers the cost at the same time.
1 of 38
More Related Content
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_Vincent
1. Apache Pulsar
Vincent Xie (Bestpay), Jia Zhai (StreamNative)
Unify Storage Backend for Batch and
Streaming Computation with
2. About us
Vincent (Weisheng) Xie
❏ Current Director @ Orange Financial
❏ Previous Tech lead of ML engineering
team @ Intel
Jia Zhai
❏ Co-Founder of StreamNative
❏ Apache Pulsar PMC Member
❏ Apache BookKeeper PMC Member
5. Orange Financial
Orange Financial Services Group (Chinese: 甜橙金融), formerly known as Bestpay, is an affiliate company of
China Telecom. It reached 1.13 trillion CNY ($18.37 Billion) transaction volume in 2018, with 500 million registered
users and 41.9 million active users.
Subsidiaries:
Bestpay - a mobile wallet and payment app
Jieqian - a consumer loan service
Orange Wealth
Orange Insurance
Orange Credit
Orange Financial Cloud
10. Challenges
❏ High concurrency
❏ > 50M transactions, 1 billion events a day (peek: 35K/s)
❏ Low latency demand
❏ response < 200ms
❏ Large number of batch jobs and streaming jobs
11. “A merchant’s total transaction volume ($) within the past month (30days)
(current transaction included)”
= sum($past_29days) + sum($today_upto_current)
batch streaming
32. Pulsar-Spark / Write Results to Pulsar
https://github.com/streamnative/pulsar-spark
33. PoC at Bestpay
❏ Ingest data to Pulsar
❏ Realtime Data
❏ pulsar-io-kafka: connect kafka messages (JSON) to Pulsar
and store them in AVRO format with schema information
❏ Historic Data
❏ pulsar-spark: query the Hive table and insert Hive rows as
Pulsar messages (AVRO) to Pulsar
❏ Data Processing
❏ Spark Structured Streaming: for stream processing
❏ Spark SQL: for batch processing and interactive queries
34. Benefits
❏ Complexity drop 33% (Number of clusters from 6 down to 4)
❏ Storage saving 8.7% (expect to be 28%)
❏ Time to production boosts 11x (backed with streaming SQL)
❏ Higher stability (expected)
35. Summary
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data
❏ Pulsar + Spark for a simple unified data processing