Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.
In this talk, Sijie Guo from Apache Pulsar community will given an overview of Apache Pulsar and how it provides the unified data view to fully leverage Apache Flink unified computation runtime for elastic data processing. He will share the latest integrations between Apache Pulsar and Apache Flink, especially around effectively-once processing and schema integration.
5. A brief history of Apache Pulsar
❏ 2012: Pulsar idea started
❏ 5+ years on production, 100+ applications, 10+ data centers
❏ 2016/09 Yahoo open sourced Pulsar
❏ 2017/06 Yahoo donated Pulsar to ASF
❏ 2018/09 Pulsar graduated as a Top-Level project
❏ 25+ committers, 154 contributors, 900+ forks, 4000+ stars
❏ Yahoo!, Yahoo! Japan, Tencent, Zhaopin, ...
12. Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
13. Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ Need scalability as well as
resilient on failures
14. Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ Need scalability as well as
resilient on failures
❏ Serverless
❏ Simple, light-weight processing
❏ Processing data with high
velocity
43. Pulsar-Flink (5) - Topic Partitions Discovery
❏ Find matching topics
❏ Fetch schemas for each topic
❏ Build schema-specific deserializer
❏ Each reader is responsible one
topic partition
❏ Each source task has a partition
discover task to check newly
added partitions
44. Pulsar-Flink (6) Exactly-once Source
❏ Message order on partition basis
❏ Seek & read
❏ Checkpoints with MessageID
❏ Durable cursor to keep
un-checkpointed messages alive
❏ Move cursor when a checkpoint is
completed
47. Future directions
❏ Unified Source API for both batch and streaming execution
❏ FLIP-27
❏ Pulsar as a catalog
❏ Pulsar as a state backend
❏ Scale-out source parallelism
❏ Key_Shared & Sticky consumer
❏ End-to-end exactly-once
❏ Pulsar transaction in 2.5.0
49. Key_Shared Subscription
❏ Key based ordering
❏ Key can be message key or a separated *order* key
❏ HashRing based routing
❏ Key based batcher
❏ Policies for messages without *keys*
https://github.com/apache/pulsar/wiki/PIP-34:-Add-new-subscribe-type-Key_shared
50. Conclusion
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data
❏ Apache Flink provides a unified view of computing
❏ Pulsar + Flink for streaming-first, unified data processing