SlideShare a Scribd company logo
Query Pulsar Streams using Apache Flink
Sijie Guo (@sijieg)
Who am I
❏ Apache Pulsar PMC Member
❏ Apache BookKeeper PMC Chair
❏ StreamNative Founder
❏ Ex-Twitter, Ex-Yahoo
❏ Interested in event streaming
technologies
What is Apache Pulsar?
“Flexible Pub/Sub Messaging
Backed by durable log storage”
Highlights
❏ Multi-Tenant Data System: Isolation, ACL, Policies
❏ Unified messaging model: Queuing + Streaming
❏ Infinite Segmented Stream Storage: Segment-centric, Tiered storage
❏ Structured Event Streams: Built-in schema management
❏ Cloud-Native Architecture: Simplified ops, Rebalance-free
A brief history of Apache Pulsar
❏ 2012: Pulsar idea started at Yahoo!
❏ 5+ years on production, 100+ applications, 10+ data centers
❏ 2016/09 Yahoo open sourced Pulsar
❏ 2017/06 Yahoo donated Pulsar to ASF
❏ 2018/09 Pulsar graduated as a Top-Level project
❏ 25+ committers, 168 contributors, 1000+ forks, 4200+ stars
❏ Yahoo!, Yahoo! Japan, Tencent, Zhaopin, THG, OVH, …
http://pulsar.apache.org/en/powered-by/
Pulsar Use Cases
❏ Billing / Payment / Trading Service
❏ Worker Queue / Push Notifications / Task Queue
❏ Unified Messaging Backbone (Queuing + Streaming)
❏ IoT
❏ Unified Data Processing
Pulsar at Tencent
❏ Billing Service (30+ billions)
❏ 500K QPS, 10 billions transaction requests
❏ 600+ Topics
Pulsar Use Cases
❏ Billing / Payment / Trade Service
❏ Worker Queue / Push Notifications / Task Queue
❏ Unified Messaging Backbone (Queuing + Streaming)
❏ IoT
❏ Unified Data Processing
Pulsar Use Cases
❏ Billing / Payment / Trade Service
❏ Worker Queue / Push Notifications / Task Queue
❏ Unified Messaging Backbone (Queuing + Streaming)
❏ IoT
❏ Unified Data Processing with Flink
Infinite segmented streams
(pub/sub + segment)
A Pulsar view on data
Pulsar - Pub/Sub
Pulsar - Topic Partitions
Pulsar - Segments
Pulsar - Stream
Pulsar - Infinite stream storage
Pulsar - Stream as a unified view on data
Pulsar - Two levels of reading API
❏ Pub/Sub (Streaming)
❏ Read data from brokers
❏ Consume / Seek / Receive
❏ Subscription Mode - Failover, Shared, Key_Shared
❏ Reprocessing data by rewinding (seeking) the cursors
❏ Segment (Batch)
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)
Unified data processing on Pulsar
Unified Data Processing
Flink 1.6 Integration
❏ Available Connectors
❏ Streaming Source
❏ Streaming Sink
❏ Table Sink
When Flink & Pulsar come together: https://flink.apache.org/2019/05/03/pulsar-flink.html
Flink Source
But that’s not cool ...
Flink Source
Flink 1.9 Integration
❏ Pulsar Schema Integration
❏ Table API as first-class citizens
❏ Exactly-once source
❏ At-least-once sink
❏ Flink Catalog Integration
Demo - Pulsar Catalog
Pulsar Schema (1)
❏ Consensus of data at server-side
❏ Built-in schema registry
❏ Data schema on a per-topic basis
❏ Send and receive typed messages directly
❏ Validation
❏ Multi-version
❏ Schema evolution & compatibilities
Pulsar Schema (2)
// Create producer with Struct schema and send messages
Producer<User> producer = client.newProducer(Schema.AVRO(User.class)).create();
producer.newMessage()
.value(User.builder()
.userName("pulsar-user")
.userId(1L)
.build())
.send();
// Create consumer with Struct schema and receive messages
Consumer<User> consumer = client.newConsumer(Schema.AVRO(User.class)).create();
consumer.receive();
Pulsar Schema (3) - SchemaInfo
{
"type": "JSON",
"schema": "{
"type":"record",
"name":"User",
"namespace":"com.foo",
"fields":[
{
"name":"file1",
"type":["null","string"],
"default":null
},
{
"name":"file2",
"type":"string",
"default":null
},
{
"name":"file3",
"type":["null","string"],
"default":"dfdf"
}
]
}",
"properties": {}
}
Pulsar Schema (4) - Producer
Pulsar Schema (5) - Consumer
Pulsar Schema (6) - Compatibility Strategy
Pulsar Schema (7) - Multi versions
Pulsar-Flink (1) - Schema <-> Row
https://github.com/streamnative/pulsar-flink
❏ Topics without schema or with primitive schemas
❏ `value` field for message payload
❏ Topics with struct schemas (AVRO, JSON)
❏ Field names and types are kept in the row
❏ Metadata Fields
❏ __key: Binary
❏ __topic: String
❏ __messageId: Binary
❏ __publishTime: Timestamp
❏ __eventTime: Timestamp
Pulsar-Flink (2) - Schema Examples
Primitive Schema Avro Schema
https://github.com/streamnative/pulsar-flink
Pulsar-Flink (3) - Pulsar Source
https://github.com/streamnative/pulsar-flink
Pulsar-Flink (4) - Streaming Tables
https://github.com/streamnative/pulsar-flink
Pulsar-Flink (5) - Pulsar Sink
https://github.com/streamnative/pulsar-flink
Pulsar-Flink (6) - Write to streaming tables
https://github.com/streamnative/pulsar-flink
Pulsar-Flink (7) - Pulsar Catalog
https://github.com/streamnative/pulsar-flink
Lambda
Batch Layer
Speed/Streaming Layer
Serving
Layer
Unified Data Stack
Unified Computing
Unified Data Storage
State-Centric
Future Work
❏ New Source API
❏ FLIP-27
❏ Scale-out source parallelism
❏ Key_Shared & Sticky consumer
❏ End-to-end exactly-once
❏ Pulsar transaction in 2.5.0
❏ Pulsar / BookKeeper as a state backend
❏ Schema-aware Offload / Tiered Storage
Key_Shared Subscription
Key_Shared Subscription
❏ Key based ordering
❏ Key can be message key or a separated *order* key
❏ HashRing based routing
❏ Key based batcher
❏ Policies for messages without *keys*
https://github.com/apache/pulsar/wiki/PIP-34:-Add-new-subscribe-type-Key_shared
Conclusion
❏ Apache Pulsar is a cloud-native streaming data storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Structured Event Streams via Pulsar Schema
❏ Pulsar is the unified data storage for Flink
❏ Pulsar + Flink for streaming-first, unified data processing stack
Community
❏ Pulsar Website: https://pulsar.apache.org https://streamnative.io
❏ Twitter: @apache_pulsar / @streamnativeio
❏ Slack: https://apache-pulsar.herokuapp.com
❏ Mailing Lists
dev@pulsar.apache.org, users@pulsar.apache.org
❏ Github
https://github.com/apache/pulsar
❏ Medium
https://medium.com/streamnative
Pulsar at Europe
❏ First Pulsar Meetup at Paris
(@OVHCloud) on Friday 10/11
❏ https://www.meetup.com/Hadoop-U
ser-Group-France/events/26492044
7/
❏ If you are looking for collaborations
on Pulsar events, talk to us :-)
Thanks!

More Related Content

Query Pulsar Streams using Apache Flink

  • 1. Query Pulsar Streams using Apache Flink Sijie Guo (@sijieg)
  • 2. Who am I ❏ Apache Pulsar PMC Member ❏ Apache BookKeeper PMC Chair ❏ StreamNative Founder ❏ Ex-Twitter, Ex-Yahoo ❏ Interested in event streaming technologies
  • 3. What is Apache Pulsar?
  • 4. “Flexible Pub/Sub Messaging Backed by durable log storage”
  • 5. Highlights ❏ Multi-Tenant Data System: Isolation, ACL, Policies ❏ Unified messaging model: Queuing + Streaming ❏ Infinite Segmented Stream Storage: Segment-centric, Tiered storage ❏ Structured Event Streams: Built-in schema management ❏ Cloud-Native Architecture: Simplified ops, Rebalance-free
  • 6. A brief history of Apache Pulsar ❏ 2012: Pulsar idea started at Yahoo! ❏ 5+ years on production, 100+ applications, 10+ data centers ❏ 2016/09 Yahoo open sourced Pulsar ❏ 2017/06 Yahoo donated Pulsar to ASF ❏ 2018/09 Pulsar graduated as a Top-Level project ❏ 25+ committers, 168 contributors, 1000+ forks, 4200+ stars ❏ Yahoo!, Yahoo! Japan, Tencent, Zhaopin, THG, OVH, … http://pulsar.apache.org/en/powered-by/
  • 7. Pulsar Use Cases ❏ Billing / Payment / Trading Service ❏ Worker Queue / Push Notifications / Task Queue ❏ Unified Messaging Backbone (Queuing + Streaming) ❏ IoT ❏ Unified Data Processing
  • 8. Pulsar at Tencent ❏ Billing Service (30+ billions) ❏ 500K QPS, 10 billions transaction requests ❏ 600+ Topics
  • 9. Pulsar Use Cases ❏ Billing / Payment / Trade Service ❏ Worker Queue / Push Notifications / Task Queue ❏ Unified Messaging Backbone (Queuing + Streaming) ❏ IoT ❏ Unified Data Processing
  • 10. Pulsar Use Cases ❏ Billing / Payment / Trade Service ❏ Worker Queue / Push Notifications / Task Queue ❏ Unified Messaging Backbone (Queuing + Streaming) ❏ IoT ❏ Unified Data Processing with Flink
  • 11. Infinite segmented streams (pub/sub + segment) A Pulsar view on data
  • 13. Pulsar - Topic Partitions
  • 16. Pulsar - Infinite stream storage
  • 17. Pulsar - Stream as a unified view on data
  • 18. Pulsar - Two levels of reading API ❏ Pub/Sub (Streaming) ❏ Read data from brokers ❏ Consume / Seek / Receive ❏ Subscription Mode - Failover, Shared, Key_Shared ❏ Reprocessing data by rewinding (seeking) the cursors ❏ Segment (Batch) ❏ Read data from storage (bookkeeper or tiered storage) ❏ Fine-grained Parallelism ❏ Predicate pushdown (publish timestamp)
  • 21. Flink 1.6 Integration ❏ Available Connectors ❏ Streaming Source ❏ Streaming Sink ❏ Table Sink When Flink & Pulsar come together: https://flink.apache.org/2019/05/03/pulsar-flink.html
  • 23. But that’s not cool ...
  • 25. Flink 1.9 Integration ❏ Pulsar Schema Integration ❏ Table API as first-class citizens ❏ Exactly-once source ❏ At-least-once sink ❏ Flink Catalog Integration
  • 26. Demo - Pulsar Catalog
  • 27. Pulsar Schema (1) ❏ Consensus of data at server-side ❏ Built-in schema registry ❏ Data schema on a per-topic basis ❏ Send and receive typed messages directly ❏ Validation ❏ Multi-version ❏ Schema evolution & compatibilities
  • 28. Pulsar Schema (2) // Create producer with Struct schema and send messages Producer<User> producer = client.newProducer(Schema.AVRO(User.class)).create(); producer.newMessage() .value(User.builder() .userName("pulsar-user") .userId(1L) .build()) .send(); // Create consumer with Struct schema and receive messages Consumer<User> consumer = client.newConsumer(Schema.AVRO(User.class)).create(); consumer.receive();
  • 29. Pulsar Schema (3) - SchemaInfo { "type": "JSON", "schema": "{ "type":"record", "name":"User", "namespace":"com.foo", "fields":[ { "name":"file1", "type":["null","string"], "default":null }, { "name":"file2", "type":"string", "default":null }, { "name":"file3", "type":["null","string"], "default":"dfdf" } ] }", "properties": {} }
  • 30. Pulsar Schema (4) - Producer
  • 31. Pulsar Schema (5) - Consumer
  • 32. Pulsar Schema (6) - Compatibility Strategy
  • 33. Pulsar Schema (7) - Multi versions
  • 34. Pulsar-Flink (1) - Schema <-> Row https://github.com/streamnative/pulsar-flink ❏ Topics without schema or with primitive schemas ❏ `value` field for message payload ❏ Topics with struct schemas (AVRO, JSON) ❏ Field names and types are kept in the row ❏ Metadata Fields ❏ __key: Binary ❏ __topic: String ❏ __messageId: Binary ❏ __publishTime: Timestamp ❏ __eventTime: Timestamp
  • 35. Pulsar-Flink (2) - Schema Examples Primitive Schema Avro Schema https://github.com/streamnative/pulsar-flink
  • 36. Pulsar-Flink (3) - Pulsar Source https://github.com/streamnative/pulsar-flink
  • 37. Pulsar-Flink (4) - Streaming Tables https://github.com/streamnative/pulsar-flink
  • 38. Pulsar-Flink (5) - Pulsar Sink https://github.com/streamnative/pulsar-flink
  • 39. Pulsar-Flink (6) - Write to streaming tables https://github.com/streamnative/pulsar-flink
  • 40. Pulsar-Flink (7) - Pulsar Catalog https://github.com/streamnative/pulsar-flink
  • 42. Unified Data Stack Unified Computing Unified Data Storage State-Centric
  • 43. Future Work ❏ New Source API ❏ FLIP-27 ❏ Scale-out source parallelism ❏ Key_Shared & Sticky consumer ❏ End-to-end exactly-once ❏ Pulsar transaction in 2.5.0 ❏ Pulsar / BookKeeper as a state backend ❏ Schema-aware Offload / Tiered Storage
  • 45. Key_Shared Subscription ❏ Key based ordering ❏ Key can be message key or a separated *order* key ❏ HashRing based routing ❏ Key based batcher ❏ Policies for messages without *keys* https://github.com/apache/pulsar/wiki/PIP-34:-Add-new-subscribe-type-Key_shared
  • 46. Conclusion ❏ Apache Pulsar is a cloud-native streaming data storage ❏ Two levels of reading API: Pub/Sub + Segment ❏ Structured Event Streams via Pulsar Schema ❏ Pulsar is the unified data storage for Flink ❏ Pulsar + Flink for streaming-first, unified data processing stack
  • 47. Community ❏ Pulsar Website: https://pulsar.apache.org https://streamnative.io ❏ Twitter: @apache_pulsar / @streamnativeio ❏ Slack: https://apache-pulsar.herokuapp.com ❏ Mailing Lists [email protected], [email protected] ❏ Github https://github.com/apache/pulsar ❏ Medium https://medium.com/streamnative
  • 48. Pulsar at Europe ❏ First Pulsar Meetup at Paris (@OVHCloud) on Friday 10/11 ❏ https://www.meetup.com/Hadoop-U ser-Group-France/events/26492044 7/ ❏ If you are looking for collaborations on Pulsar events, talk to us :-)