Integrating Apache Pulsar with Big Data Ecosystem

Integrating Apache Pulsar 
with  
Big Data Ecosystem
Yijie Shen
20190817

Data analytics with Apache Pulsar

Why so many analytic frameworks ?
Each kind has its best fit
•Interactive Engine
• Time critical
• Medium data size
• Rerun on failure
•Batch Engine
• The amount of data can be very
large
• Could run on a huge cluster
• Fine-grained fault tolerance
•Streaming
• Ever running jobs
• Time critical
• Need scalability as well as
resilient on failures
•Serverless
• Simple processing logic
• Processing data with high
velocity
Don’t ask, I don’t know.

Why Apache Pulsar fits all ?
It’s a Pulsar Meetup, dude...

Pulsar – A cloud-native architecture
Stateless Serving
Durable Storage

Pulsar – Segment-based storage
•Managed ledger
• The storage layer for a single topic
•Ledger
• Single writer, append-only
• Replicated to multiple bookies

Pulsar – Infinite stream storage
•Reduce storage cost
• offloading segment to tiered storage one-by-one

Pulsar Schema
• Consensus of data at server-side
• Built-in schema registry
• Data schema on a per-topic basis
•Send and receive typed message directly
• Validation
• Multi-version

Durable and ordered source
•Failures are inevitable for engines
•Re-schedule failed tasks
• Tasks assigned to fixed (start, end] in Spark
• Tasks recover from checkpoint (start in Flink
•Exactly-once
• Based on message order in topic
• Seek & read
•Messages ”keep-alive” by subscription
• Move sub cursor on commit
task1 task2
Durable cursor

Two levels of reading API
•Consumer
• Subscribe / seek / receive
• Per topic partition
• Pulsar-Spark, Pulsar-Flink
•Segment
• Read directly from Bookies
• For parallelism
• Presto

Processing typed records
•Regard Pulsar as structured storage
•Fetching schema as the first step
• With Pulsar Admin API
• Dynamic / multi-versioned schema not supported in Spark/Flink
• But you could try AUTO_CONSUME
•SerDe your messages into InternalRow / Row
• Avro schema and avro/json/protobuf Message
• Or parse the Avro record as we do in pulsar-spark[1]
•Message metadata as metadata fields
• __key, __publishTime, __eventTime, __messageId, __topic

Topic/Partition add/delete discovery
• Streaming jobs are long
running
• Topics & partitions may be
added on removed during a job
• Periodically check topic for
status
• Spark: during incremental
planning
• Flink: with a monitoring thread
in each task
Pulsar-Spark as an example
• Happens during logical planning
• getBatch(start: Option[Offset],
end: Offset)
• Discovery topic differences between
start and end
• Start – last end
• End – getOffset()
• Connector
• provide available offset for all topic/
partitions for each getOffset
• Create DataFrame/DataSet based on
existing topic/partitions
• SS take care of the rest
Offset {
topicOffsets: Map[String, MessageId
}

Various APIs use Pulsar as source
val df = spark 
.read 
.format("pulsar") 
.option("service.url", "pulsar://...") 
.option("admin.url", "http://...") 
.option("topic", "topic1") 
.load()
val prop = new Properties() 
prop.setProperty(“service.url”, serviceUrl) 
prop.setProperty(“admin.url”, adminUrl) 
prop.setProperty(“partitionDiscoveryIntervalMillis”, "5000") 
prop.setProperty(“startingOffsets”, "earliest")
env.addSource(new FlinkPulsarSource(sourceProps))
show tables in pulsar."public/default";
select * from pulsar."public/
default".generator_test;
Spark
Flink
Presto

Pulsar-Spark and Pulsar-Flink
•Pulsar-Spark based on Spark 2.4 is now open sourced
• https://github.com/streamnative/pulsar-spark
•Pulsar-Flink based on Flink 1.9 will open-source soon
•Roadmaps for these two projects
• End-to-end exactly once with pulsar transaction support
• Fine-grained batch parallelism on segment level
• Pulsar-spark / Pulsar-flink

Integrating Apache Pulsar with Big Data Ecosystem

More Related Content

Integrating Apache Pulsar with Big Data Ecosystem