In Apache Pulsar Beijing Meetup, Yijieshen gave a presentation of the current state of Apache Pulsar integrating with Big Data Ecosystem. He explains why and how Pulsar fits into current big data computing and query engines, and how Pulsar integrates with Spark, Flink and Presto for unified data processing system.
4. Why so many analytic frameworks ?
Each kind has its best fit
•Interactive Engine
• Time critical
• Medium data size
• Rerun on failure
•Batch Engine
• The amount of data can be very
large
• Could run on a huge cluster
• Fine-grained fault tolerance
•Streaming
• Ever running jobs
• Time critical
• Need scalability as well as
resilient on failures
•Serverless
• Simple processing logic
• Processing data with high
velocity
Don’t ask, I don’t know.
6. Pulsar – A cloud-native architecture
Stateless Serving
Durable Storage
7. Pulsar – Segment-based storage
•Managed ledger
• The storage layer for a single topic
•Ledger
• Single writer, append-only
• Replicated to multiple bookies
9. Pulsar Schema
• Consensus of data at server-side
• Built-in schema registry
• Data schema on a per-topic basis
•Send and receive typed message directly
• Validation
• Multi-version
10. Durable and ordered source
•Failures are inevitable for engines
•Re-schedule failed tasks
• Tasks assigned to fixed (start, end] in Spark
• Tasks recover from checkpoint (start in Flink
•Exactly-once
• Based on message order in topic
• Seek & read
•Messages ”keep-alive” by subscription
• Move sub cursor on commit
task1 task2
Durable cursor
11. Two levels of reading API
•Consumer
• Subscribe / seek / receive
• Per topic partition
• Pulsar-Spark, Pulsar-Flink
•Segment
• Read directly from Bookies
• For parallelism
• Presto
12. Processing typed records
•Regard Pulsar as structured storage
•Fetching schema as the first step
• With Pulsar Admin API
• Dynamic / multi-versioned schema not supported in Spark/Flink
• But you could try AUTO_CONSUME
•SerDe your messages into InternalRow / Row
• Avro schema and avro/json/protobuf Message
• Or parse the Avro record as we do in pulsar-spark[1]
•Message metadata as metadata fields
• __key, __publishTime, __eventTime, __messageId, __topic
13. Topic/Partition add/delete discovery
• Streaming jobs are long
running
• Topics & partitions may be
added on removed during a job
• Periodically check topic for
status
• Spark: during incremental
planning
• Flink: with a monitoring thread
in each task
Pulsar-Spark as an example
• Happens during logical planning
• getBatch(start: Option[Offset],
end: Offset)
• Discovery topic differences between
start and end
• Start – last end
• End – getOffset()
• Connector
• provide available offset for all topic/
partitions for each getOffset
• Create DataFrame/DataSet based on
existing topic/partitions
• SS take care of the rest
Offset {
topicOffsets: Map[String, MessageId
}
14. Various APIs use Pulsar as source
val df = spark
.read
.format("pulsar")
.option("service.url", "pulsar://...")
.option("admin.url", "http://...")
.option("topic", "topic1")
.load()
val prop = new Properties()
prop.setProperty(“service.url”, serviceUrl)
prop.setProperty(“admin.url”, adminUrl)
prop.setProperty(“partitionDiscoveryIntervalMillis”, "5000")
prop.setProperty(“startingOffsets”, "earliest")
env.addSource(new FlinkPulsarSource(sourceProps))
show tables in pulsar."public/default";
select * from pulsar."public/
default".generator_test;
Spark
Flink
Presto
15. Pulsar-Spark and Pulsar-Flink
•Pulsar-Spark based on Spark 2.4 is now open sourced
• https://github.com/streamnative/pulsar-spark
•Pulsar-Flink based on Flink 1.9 will open-source soon
•Roadmaps for these two projects
• End-to-end exactly once with pulsar transaction support
• Fine-grained batch parallelism on segment level
• Pulsar-spark / Pulsar-flink