More and more applications are using Flink for low-latency data processing. Flink unifies batch and stream processing using one computation engine. However in reality, in order to really unify batch and stream processing, it requires a data system offers one unified data representation for both batch and streaming data. Nowadays, streaming data is typically stored in a log storage or messaging system, while batch data is stored in distributed filesystem and object stores. That means that data scientists still need write two different computing jobs to access same data stored in different data systems.
Apache Pulsar is the next generation messaging and streaming data system. It was originally built at Yahoo, and has graduated from Apache Incubator and become a Top-Level-Project. Pulsar separates messaging serving and data storage into two layers. Such layered architecture provides high throughput and low-latency while ensuring high availability and scalability. Pulsar’s segment centric storage design along with layered architecture makes Pulsar a perfect unbounded streaming data system, which can well fit into Flink’s computation model.
In this talk, Sijie Guo from Apache Pulsar PMC, will introduce Pulsar and its layered architecture and segment-centric storage, detailing how this architecture can well integrate with Flink to provide elastic unified batch and stream processing.
1 of 51
More Related Content
Elastic Data Processing with Apache Flink and Apache Pulsar
21. Segmented Stream
• Segmented Stream Systems
• Apache Pulsar, Twitter EventBus, EMC Pravega
• All Apache BookKeeper based
• Used BK in a different way
• Pulsar, EventBus - Uses BK as the segment store
• Pravega - Uses BK as the journal only
32. Tiered Storage
• Offloader
• When: size-based, time-based, or triggered by pulsar-admin
• How: copy a segment to tiered storage, and delete it from bookkeeper
• Access: broker knows how to read the data back, or bypass read
the offloaded segments directly
• Available Offloaders
• Cloud Offloder : AWS, GCS, Azure, …
• HDFS, Ceph, …
33. Stream as a Unified View on Data
Segment 1 Segment 2 Segment 3 Segment 4Stream
Producers
Consumers
Time
Segment 6Segment 5
Segment
Readers
34. Data Processing on Pulsar
Segment 1 Segment 2 Segment 3 Segment 4Stream Segment 6Segment 5
Time
Bounded Stream Bounded Stream
Unbounded Stream
Unbounded Stream
42. Zhaopin.com
Zhaopin.com is the biggest online recruitment service
provider in China
Zhaopin.com provides job seekers a comprehensive resume service, latest
employment, and career development related information, as well as in-depth online
job search for positions throughout China
Zhaopin.com provides professional HR services to over 2.2 million clients and its
average daily page views are over 68 million.