Query Pulsar Streams using Apache Flink

Query Pulsar Streams using Apache Flink
Sijie Guo (@sijieg)

Who am I
❏ Apache Pulsar PMC Member
❏ Apache BookKeeper PMC Chair
❏ StreamNative Founder
❏ Ex-Twitter, Ex-Yahoo
❏ Interested in event streaming
technologies

“Flexible Pub/Sub Messaging
Backed by durable log storage”

Highlights
❏ Multi-Tenant Data System: Isolation, ACL, Policies
❏ Unified messaging model: Queuing + Streaming
❏ Infinite Segmented Stream Storage: Segment-centric, Tiered storage
❏ Structured Event Streams: Built-in schema management
❏ Cloud-Native Architecture: Simplified ops, Rebalance-free

A brief history of Apache Pulsar
❏ 2012: Pulsar idea started at Yahoo!
❏ 5+ years on production, 100+ applications, 10+ data centers
❏ 2016/09 Yahoo open sourced Pulsar
❏ 2017/06 Yahoo donated Pulsar to ASF
❏ 2018/09 Pulsar graduated as a Top-Level project
❏ 25+ committers, 168 contributors, 1000+ forks, 4200+ stars
❏ Yahoo!, Yahoo! Japan, Tencent, Zhaopin, THG, OVH, …
http://pulsar.apache.org/en/powered-by/

Pulsar Use Cases
❏ Billing / Payment / Trading Service
❏ Worker Queue / Push Notifications / Task Queue
❏ Unified Messaging Backbone (Queuing + Streaming)
❏ IoT
❏ Unified Data Processing

Pulsar at Tencent
❏ Billing Service (30+ billions)
❏ 500K QPS, 10 billions transaction requests
❏ 600+ Topics

Pulsar Use Cases
❏ Billing / Payment / Trade Service
❏ IoT
❏ Unified Data Processing

Pulsar Use Cases
❏ Billing / Payment / Trade Service
❏ IoT
❏ Unified Data Processing with Flink

Infinite segmented streams
(pub/sub + segment)
A Pulsar view on data

Pulsar - Infinite stream storage

Pulsar - Stream as a unified view on data

Pulsar - Two levels of reading API
❏ Pub/Sub (Streaming)
❏ Read data from brokers
❏ Consume / Seek / Receive
❏ Subscription Mode - Failover, Shared, Key_Shared
❏ Reprocessing data by rewinding (seeking) the cursors
❏ Segment (Batch)
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)

Unified data processing on Pulsar

Flink 1.6 Integration
❏ Available Connectors
❏ Streaming Source
❏ Streaming Sink
❏ Table Sink
When Flink & Pulsar come together: https://flink.apache.org/2019/05/03/pulsar-flink.html

Flink 1.9 Integration
❏ Pulsar Schema Integration
❏ Table API as first-class citizens
❏ Exactly-once source
❏ At-least-once sink
❏ Flink Catalog Integration

Pulsar Schema (1)
❏ Consensus of data at server-side
❏ Built-in schema registry
❏ Data schema on a per-topic basis
❏ Send and receive typed messages directly
❏ Validation
❏ Multi-version
❏ Schema evolution & compatibilities

Pulsar Schema (2)
// Create producer with Struct schema and send messages
Producer<User> producer = client.newProducer(Schema.AVRO(User.class)).create();
producer.newMessage()
.value(User.builder()
.userName("pulsar-user")
.userId(1L)
.build())
.send();
// Create consumer with Struct schema and receive messages
Consumer<User> consumer = client.newConsumer(Schema.AVRO(User.class)).create();
consumer.receive();

Pulsar Schema (3) - SchemaInfo
{
"type": "JSON",
"schema": "{
"type":"record",
"name":"User",
"namespace":"com.foo",
"fields":[
{
"name":"file1",
"type":["null","string"],
"default":null
},
{
"name":"file2",
"type":"string",
"default":null
},
{
"name":"file3",
"type":["null","string"],
"default":"dfdf"
}
]
}",
"properties": {}
}

Pulsar Schema (6) - Compatibility Strategy

Pulsar Schema (7) - Multi versions

Pulsar-Flink (1) - Schema <-> Row
https://github.com/streamnative/pulsar-flink
❏ Topics without schema or with primitive schemas
❏ `value` field for message payload
❏ Topics with struct schemas (AVRO, JSON)
❏ Field names and types are kept in the row
❏ Metadata Fields
❏ __key: Binary
❏ __topic: String
❏ __messageId: Binary
❏ __publishTime: Timestamp
❏ __eventTime: Timestamp

Pulsar-Flink (2) - Schema Examples
Primitive Schema Avro Schema

Pulsar-Flink (3) - Pulsar Source

Pulsar-Flink (4) - Streaming Tables

Pulsar-Flink (5) - Pulsar Sink

Pulsar-Flink (6) - Write to streaming tables

Pulsar-Flink (7) - Pulsar Catalog

Lambda
Batch Layer
Speed/Streaming Layer
Serving
Layer

Unified Data Stack
Unified Computing
Unified Data Storage
State-Centric

Future Work
❏ New Source API
❏ FLIP-27
❏ Scale-out source parallelism
❏ Key_Shared & Sticky consumer
❏ End-to-end exactly-once
❏ Pulsar transaction in 2.5.0
❏ Pulsar / BookKeeper as a state backend
❏ Schema-aware Offload / Tiered Storage

Key_Shared Subscription
❏ Key based ordering
❏ Key can be message key or a separated *order* key
❏ HashRing based routing
❏ Key based batcher
❏ Policies for messages without *keys*
https://github.com/apache/pulsar/wiki/PIP-34:-Add-new-subscribe-type-Key_shared

Conclusion
❏ Apache Pulsar is a cloud-native streaming data storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Structured Event Streams via Pulsar Schema
❏ Pulsar is the unified data storage for Flink
❏ Pulsar + Flink for streaming-first, unified data processing stack

Community
❏ Pulsar Website: https://pulsar.apache.org https://streamnative.io
❏ Twitter: @apache_pulsar / @streamnativeio
❏ Slack: https://apache-pulsar.herokuapp.com
❏ Mailing Lists
dev@pulsar.apache.org, users@pulsar.apache.org
❏ Github
https://github.com/apache/pulsar
❏ Medium
https://medium.com/streamnative

Pulsar at Europe
❏ First Pulsar Meetup at Paris
(@OVHCloud) on Friday 10/11
❏ https://www.meetup.com/Hadoop-U
ser-Group-France/events/26492044
7/
❏ If you are looking for collaborations
on Pulsar events, talk to us :-)

Query Pulsar Streams using Apache Flink

More Related Content

Query Pulsar Streams using Apache Flink