A Unified Platform for Real-time Storage and Processing

A Unified Platform for Real-
time Storage and Processing 
Apache Pulsar as Stream Storage
Apache Spark for Processing as an Example
Yijie Shen
yjshen
20190629

Outline
• Motivation & Challenges
• Why Pulsar
• Spark-Pulsar Connector

Motivation
• Ubiquity of real-time data
• Sensors, logs from mobile app, IoT
• Organizations got better at capturing data
• Data matters
• Batch and interactive analysis, stream processing, machine learning,
graph processing
• The involvement of analytic platforms
• Unified / similar API for batch/declarative and stream processing
• E.g. Spark, Flink

Challenges
• Compatibility with cloud infrastructure
• Multi-tenant management
• Scalability
• Data movement during its lifecycle
• Visibility of data
• Operational cost and problems
• Multiple systems to maintain
• Resource allocation and provisioning
Message
Queue
Cold
Storage

Pulsar – A cloud-native architecture
Stateless Serving
Durable Storage

Pulsar – Segment-based Storage
• Managed ledger
• The storage layer for a single topic
• Ledger
• Single writer, append-only
• Replicated to multiple bookies

Pulsar – Infinite Stream Storage
• Reduce storage cost
• offloading segment to tiered storage one-by-one

Pulsar Schema
• Consensus of data at server-side
• Built-in schema registry
• Data schema on a per-topic basis
• Send and receive typed message directly
• Validation
• Multi-version

Outline
• Motivation & Challenges
• Why Pulsar
• Spark-Pulsar Connector
• API
• Internals

Spark Pulsar Connector – API
• Read
val df = spark 
.read 
.format("pulsar") 
.option("service.url",
"pulsar://...") 
.option("admin.url", "http://...") 
.option("topic", "topic1") 
.load()
• Write
df 
.write 
.format("pulsar") 
.option("service.url", "pulsar://...") 
.option("admin.url", "http://...") 
.option("topic", "topic2") 
.save()
• Deploying
./bin/spark-submit --packages org.apache.pulsar.segment:psegment-connectors-spark-
all_{{SCALA_BINARY_VERSION}}:{{PSEGMENT_VERSION}} ...
Stream mode
readStream writeStream
start()

Two levels of Reading API
• Consumer
• Subscribe / seek / receive
• Per topic partition
• Segment
• Read directly from Bookies
• For parallelism

Spark Structured Streaming Overview
• Input and Output
• Input sources must be replayable
• Sinks must support idempotent writes for exactly-once semantic
• API that are streaming specifically
• Triggers
• how often the engine will attempt to compute a new result and update the
output sink
• event time as watermark
• policy to determine when enough data has been received

SS Source and Sink API
trait Source {
def schema: StructType
def getOffset: Option[Offset]
def getBatch(start: Option[Offset], end:
Offset): DataFrame
def commit(end: Offset): Unit
def stop(): Unit
}
trait Sink {
def addBatch(batchId: Long, data: DataFrame):
Unit
}

Anatomy of StreamExecution
availableOffsets
offsetLog (WAL)
getOffset()
Logical Plan
getBatch()
IncrementalExecutio
n
addBatch()
commit
batchCommitLog
Source
StreamExecution
Sink
commit()

Topic/Partition add/delete discovery
• Happens during logical planning
• getBatch(start: Option[Offset], end: Offset)
• Discovery topic differences between start and end
• Start – last end
• End – getOffset()
• Connector
• provide available offset for all topic/partitions for each getOffset
• Create DataFrame/DataSet based on existing topic/partitions
• SS take care of the rest
Offset {
topicOffsets: Map[String, MessageId]
}

A Little More On Schema
• Regard Pulsar as structured data storage
• Only fetched once at the very beginning of query planning
• All topics for a DataFrame/DataSet must share same schema
• Fetched using Pulsar Admin API

A Unified Platform for Real-time Storage and Processing

More Related Content

A Unified Platform for Real-time Storage and Processing