SlideShare a Scribd company logo
Integrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar

with 

Big Data Ecosystem
Yijie Shen
20190817
Data analytics with Apache Pulsar
Why so many analytic frameworks ?
Each kind has its best fit
•Interactive Engine
• Time critical
• Medium data size
• Rerun on failure
•Batch Engine
• The amount of data can be very
large
• Could run on a huge cluster
• Fine-grained fault tolerance
•Streaming
• Ever running jobs
• Time critical
• Need scalability as well as
resilient on failures
•Serverless
• Simple processing logic
• Processing data with high
velocity
Don’t ask, I don’t know.
Why Apache Pulsar fits all ?
It’s a Pulsar Meetup, dude...
Pulsar – A cloud-native architecture
Stateless Serving
Durable Storage
Pulsar – Segment-based storage
•Managed ledger
• The storage layer for a single topic
•Ledger
• Single writer, append-only
• Replicated to multiple bookies
Pulsar – Infinite stream storage
•Reduce storage cost
• offloading segment to tiered storage one-by-one
Pulsar Schema
• Consensus of data at server-side
• Built-in schema registry
• Data schema on a per-topic basis
•Send and receive typed message directly
• Validation
• Multi-version
Durable and ordered source
•Failures are inevitable for engines
•Re-schedule failed tasks
• Tasks assigned to fixed (start, end] in Spark
• Tasks recover from checkpoint (start in Flink
•Exactly-once
• Based on message order in topic
• Seek & read
•Messages ”keep-alive” by subscription
• Move sub cursor on commit
task1 task2
Durable cursor
Two levels of reading API
•Consumer
• Subscribe / seek / receive
• Per topic partition
• Pulsar-Spark, Pulsar-Flink
•Segment
• Read directly from Bookies
• For parallelism
• Presto
Processing typed records
•Regard Pulsar as structured storage
•Fetching schema as the first step
• With Pulsar Admin API
• Dynamic / multi-versioned schema not supported in Spark/Flink
• But you could try AUTO_CONSUME
•SerDe your messages into InternalRow / Row
• Avro schema and avro/json/protobuf Message
• Or parse the Avro record as we do in pulsar-spark[1]
•Message metadata as metadata fields
• __key, __publishTime, __eventTime, __messageId, __topic
Topic/Partition add/delete discovery
• Streaming jobs are long
running
• Topics & partitions may be
added on removed during a job
• Periodically check topic for
status
• Spark: during incremental
planning
• Flink: with a monitoring thread
in each task
Pulsar-Spark as an example
• Happens during logical planning
• getBatch(start: Option[Offset],
end: Offset)
• Discovery topic differences between
start and end
• Start – last end
• End – getOffset()
• Connector
• provide available offset for all topic/
partitions for each getOffset
• Create DataFrame/DataSet based on
existing topic/partitions
• SS take care of the rest
Offset {
topicOffsets: Map[String, MessageId
}
Various APIs use Pulsar as source
val df = spark

.read

.format("pulsar")

.option("service.url", "pulsar://...")

.option("admin.url", "http://...")

.option("topic", "topic1")

.load()
val prop = new Properties()

prop.setProperty(“service.url”, serviceUrl)

prop.setProperty(“admin.url”, adminUrl)

prop.setProperty(“partitionDiscoveryIntervalMillis”, "5000")

prop.setProperty(“startingOffsets”, "earliest")
env.addSource(new FlinkPulsarSource(sourceProps))
show tables in pulsar."public/default";
select * from pulsar."public/
default".generator_test;
Spark
Flink
Presto
Pulsar-Spark and Pulsar-Flink
•Pulsar-Spark based on Spark 2.4 is now open sourced
• https://github.com/streamnative/pulsar-spark
•Pulsar-Flink based on Flink 1.9 will open-source soon
•Roadmaps for these two projects
• End-to-end exactly once with pulsar transaction support
• Fine-grained batch parallelism on segment level
• Pulsar-spark / Pulsar-flink

More Related Content

Integrating Apache Pulsar with Big Data Ecosystem

  • 2. Integrating Apache Pulsar
 with 
 Big Data Ecosystem Yijie Shen 20190817
  • 3. Data analytics with Apache Pulsar
  • 4. Why so many analytic frameworks ? Each kind has its best fit •Interactive Engine • Time critical • Medium data size • Rerun on failure •Batch Engine • The amount of data can be very large • Could run on a huge cluster • Fine-grained fault tolerance •Streaming • Ever running jobs • Time critical • Need scalability as well as resilient on failures •Serverless • Simple processing logic • Processing data with high velocity Don’t ask, I don’t know.
  • 5. Why Apache Pulsar fits all ? It’s a Pulsar Meetup, dude...
  • 6. Pulsar – A cloud-native architecture Stateless Serving Durable Storage
  • 7. Pulsar – Segment-based storage •Managed ledger • The storage layer for a single topic •Ledger • Single writer, append-only • Replicated to multiple bookies
  • 8. Pulsar – Infinite stream storage •Reduce storage cost • offloading segment to tiered storage one-by-one
  • 9. Pulsar Schema • Consensus of data at server-side • Built-in schema registry • Data schema on a per-topic basis •Send and receive typed message directly • Validation • Multi-version
  • 10. Durable and ordered source •Failures are inevitable for engines •Re-schedule failed tasks • Tasks assigned to fixed (start, end] in Spark • Tasks recover from checkpoint (start in Flink •Exactly-once • Based on message order in topic • Seek & read •Messages ”keep-alive” by subscription • Move sub cursor on commit task1 task2 Durable cursor
  • 11. Two levels of reading API •Consumer • Subscribe / seek / receive • Per topic partition • Pulsar-Spark, Pulsar-Flink •Segment • Read directly from Bookies • For parallelism • Presto
  • 12. Processing typed records •Regard Pulsar as structured storage •Fetching schema as the first step • With Pulsar Admin API • Dynamic / multi-versioned schema not supported in Spark/Flink • But you could try AUTO_CONSUME •SerDe your messages into InternalRow / Row • Avro schema and avro/json/protobuf Message • Or parse the Avro record as we do in pulsar-spark[1] •Message metadata as metadata fields • __key, __publishTime, __eventTime, __messageId, __topic
  • 13. Topic/Partition add/delete discovery • Streaming jobs are long running • Topics & partitions may be added on removed during a job • Periodically check topic for status • Spark: during incremental planning • Flink: with a monitoring thread in each task Pulsar-Spark as an example • Happens during logical planning • getBatch(start: Option[Offset], end: Offset) • Discovery topic differences between start and end • Start – last end • End – getOffset() • Connector • provide available offset for all topic/ partitions for each getOffset • Create DataFrame/DataSet based on existing topic/partitions • SS take care of the rest Offset { topicOffsets: Map[String, MessageId }
  • 14. Various APIs use Pulsar as source val df = spark
 .read
 .format("pulsar")
 .option("service.url", "pulsar://...")
 .option("admin.url", "http://...")
 .option("topic", "topic1")
 .load() val prop = new Properties()
 prop.setProperty(“service.url”, serviceUrl)
 prop.setProperty(“admin.url”, adminUrl)
 prop.setProperty(“partitionDiscoveryIntervalMillis”, "5000")
 prop.setProperty(“startingOffsets”, "earliest") env.addSource(new FlinkPulsarSource(sourceProps)) show tables in pulsar."public/default"; select * from pulsar."public/ default".generator_test; Spark Flink Presto
  • 15. Pulsar-Spark and Pulsar-Flink •Pulsar-Spark based on Spark 2.4 is now open sourced • https://github.com/streamnative/pulsar-spark •Pulsar-Flink based on Flink 1.9 will open-source soon •Roadmaps for these two projects • End-to-end exactly once with pulsar transaction support • Fine-grained batch parallelism on segment level • Pulsar-spark / Pulsar-flink