SlideShare a Scribd company logo
Greg Methvin
Scaling Customer Engagement
with Apache Pulsar
Senior Staff Software Engineer
Agenda
2
1. About me
2. About Iterable and how we manage scale
3. Challenges with RabbitMQ
4. Why we chose Pulsar
5. Use case: message sends
6. Implementation details
7. Bugs and lessons learned
8. Future plans
About me
3
● Software engineer at Iterable
○ mainly internal frameworks and messaging architecture
● Previously at several other startups
● Maintainer for OSS libraries:
○ pulsar4s - Scala client for Pulsar
○ Play Framework - web framework
○ directory-watcher - JVM file watching utility
○ several smaller projects
● Speaking as a Pulsar user
○ also from POV of an OSS maintainer
About Iterable
4
● B2B tools for customer engagement campaigns
○ promote interaction with products
○ wide range of markets
● Customers use Iterable to send messages to users
○ though email, SMS, push, in-app
○ campaigns structured in different ways
How Iterable manages scale
5
● Handling billions of events per day
○ message sends
○ ingested events and users
○ workflow states
● Services should be as stateless as possible
● Use queues to manage state of long-running tasks
○ messages to be sent to users
○ calling webhooks
○ workflows
How Iterable uses queues
6
Ideal message broker requirements
7
● Supports ordered streams of messages
● Supports work queues
● Highly available
● Delayed work
○ initiate some action at a specified time
○ retry this message after N minutes
● Many queues
○ segmenting queues by customer and further by type
○ 10k+ for each service
○ different rate limits
Limitations of RabbitMQ
8
● Memory use often causes flow control issues
○ blocks publishers
● Delayed messages work, but not well
○ done using dead-letter functionality
○ had to develop workarounds
● Problems with large numbers of queues
○ often need 10k+ per node
● Not built as a distributed system by default
Limitations of RabbitMQ: clustering
9
● Clustering has severe limitations
○ not designed to handle network partitions
○ messages often lost
● Iterable has avoided using it
Alternatives
10
● SQS
○ fully managed but less control
○ supports delays, but only up to 15 min
○ payload size sometimes over max of 256k
● Kafka
○ not ideal for work queue use cases
○ poor support for many topics
○ no delay support
Why Iterable chose Pulsar
11
● Supports our scale:
○ a million or more topics
○ easy to add storage capacity
○ easy to scale throughput
● Key features:
○ both streaming and queuing
○ negative acknowledgements
○ built-in delay functionality
Pulsar use case: message sends
12
● First Pulsar use case
● Core to our business - email, SMS, push, in-app
● Different patterns of sends:
○ all sent at one time - simple queue use case
○ “ideal” time for each user
○ individually triggered (possibly delayed) - e.g. API call,
● Handle failures - negative ack
● Shared subscription - can easily scale consumers
● Can add partitions for high-volume customers
Implementation details: design choices
13
● Topic layout
○ Tenants are services, e.g. email
○ customer-specific “projects” are namespaces, e.g. email/project-123
○ Nice for permissions, limits, etc.
○ e.g. persistent://email/project-123/marketing-type-456
● Regex consumers
○ support many short-lived topics
○ decouple producers and consumers
○ unused topics get deleted automatically (as of 2.5.1)
Implementation details: tooling
14
● Pulsar lacks official Scala API
● pulsar4s: https://github.com/sksamuel/pulsar4s
○ Akka-streams-based CommittablePartitionedSource
○ Added support for new Pulsar features
○ Has many integrations:
■ avro4s
■ effect types: cats-effect, monix, ZIO
■ json libs: circe, spray-json, play-json, jackson
Implementation example: pulsar4s + Akka streams
15
import com.sksamuel.pulsar4s.avro._ // Use avro4s to generate the schema at compile time
case class EmailMessage(id: UUID, from: String, to: Seq[String], subject: String, body: String, /*...*/)
val client = PulsarClient("pulsar://localhost:6650")
val createConsumer = () => client.consumer(ConsumerConfig(
topicPattern = "persistent://email/customer-123/.*".r,
subscription = Subscription("email-service")))
def handleMessage(msg: CommittableMessage[EmailMessage]): Future[Done] = {
// ... do something with msg ...
msg.ack()
}
val pulsarSource = committableSource[EmailMessage](createConsumer)
val control = pulsarSource.mapAsync(config.parallelism)(handleMessage).to(Sink.ignore).run()
// When we're done consuming, use control.shutdown() to shut down the stream
Testing and migration plan
16
● Load tests first
● Test use patterns for new features:
○ negative acknowledgements
○ delayed messages
● Implemented test publishes in parallel with RabbitMQ
○ for “test” topics, simply ack message
○ add random nacks/delays
● Gradually migrate customers
Bugs and issues
17
● Race condition with delayed message delivery
○ reported to Streamlio, fixed in https://github.com/apache/pulsar/pull/5499
○ caused consumers to get stuck
● Batching issues
○ negative acknowledgements broken - reported as
https://github.com/apache/pulsar/issues/5969
○ batches treated as a single unit - in backlog
● Metrics - “non-delayed backlog” metric
● Docs in general need work
○ community helps make up for that
Message sends: migration results
18
● Able to smoothly migrate after several months
○ relatively incident-free
● More throughput with much reduced infrastructure costs
○ RabbitMQ was already overprovisioned
● Simpler design with fewer hacks
● Running for over six months without issues
Conclusions and next steps
19
● Expanding use cases to other parts of the system
○ ingestion
○ webhooks
○ analytics
○ workflows
● Learned a lot about Pulsar
○ New features have risks but risks are manageable
○ Docs need work
○ Pulsar has a very helpful community
● Continue involvement in Pulsar OSS community
○ especially pulsar4s - contributions welcome!
Thank you!
Questions?
gmethvin
P.S.: Iterable is hiring!
Pulsar Slack: Greg Methvin

More Related Content

Scaling customer engagement with apache pulsar

  • 1. Greg Methvin Scaling Customer Engagement with Apache Pulsar Senior Staff Software Engineer
  • 2. Agenda 2 1. About me 2. About Iterable and how we manage scale 3. Challenges with RabbitMQ 4. Why we chose Pulsar 5. Use case: message sends 6. Implementation details 7. Bugs and lessons learned 8. Future plans
  • 3. About me 3 ● Software engineer at Iterable ○ mainly internal frameworks and messaging architecture ● Previously at several other startups ● Maintainer for OSS libraries: ○ pulsar4s - Scala client for Pulsar ○ Play Framework - web framework ○ directory-watcher - JVM file watching utility ○ several smaller projects ● Speaking as a Pulsar user ○ also from POV of an OSS maintainer
  • 4. About Iterable 4 ● B2B tools for customer engagement campaigns ○ promote interaction with products ○ wide range of markets ● Customers use Iterable to send messages to users ○ though email, SMS, push, in-app ○ campaigns structured in different ways
  • 5. How Iterable manages scale 5 ● Handling billions of events per day ○ message sends ○ ingested events and users ○ workflow states ● Services should be as stateless as possible ● Use queues to manage state of long-running tasks ○ messages to be sent to users ○ calling webhooks ○ workflows
  • 6. How Iterable uses queues 6
  • 7. Ideal message broker requirements 7 ● Supports ordered streams of messages ● Supports work queues ● Highly available ● Delayed work ○ initiate some action at a specified time ○ retry this message after N minutes ● Many queues ○ segmenting queues by customer and further by type ○ 10k+ for each service ○ different rate limits
  • 8. Limitations of RabbitMQ 8 ● Memory use often causes flow control issues ○ blocks publishers ● Delayed messages work, but not well ○ done using dead-letter functionality ○ had to develop workarounds ● Problems with large numbers of queues ○ often need 10k+ per node ● Not built as a distributed system by default
  • 9. Limitations of RabbitMQ: clustering 9 ● Clustering has severe limitations ○ not designed to handle network partitions ○ messages often lost ● Iterable has avoided using it
  • 10. Alternatives 10 ● SQS ○ fully managed but less control ○ supports delays, but only up to 15 min ○ payload size sometimes over max of 256k ● Kafka ○ not ideal for work queue use cases ○ poor support for many topics ○ no delay support
  • 11. Why Iterable chose Pulsar 11 ● Supports our scale: ○ a million or more topics ○ easy to add storage capacity ○ easy to scale throughput ● Key features: ○ both streaming and queuing ○ negative acknowledgements ○ built-in delay functionality
  • 12. Pulsar use case: message sends 12 ● First Pulsar use case ● Core to our business - email, SMS, push, in-app ● Different patterns of sends: ○ all sent at one time - simple queue use case ○ “ideal” time for each user ○ individually triggered (possibly delayed) - e.g. API call, ● Handle failures - negative ack ● Shared subscription - can easily scale consumers ● Can add partitions for high-volume customers
  • 13. Implementation details: design choices 13 ● Topic layout ○ Tenants are services, e.g. email ○ customer-specific “projects” are namespaces, e.g. email/project-123 ○ Nice for permissions, limits, etc. ○ e.g. persistent://email/project-123/marketing-type-456 ● Regex consumers ○ support many short-lived topics ○ decouple producers and consumers ○ unused topics get deleted automatically (as of 2.5.1)
  • 14. Implementation details: tooling 14 ● Pulsar lacks official Scala API ● pulsar4s: https://github.com/sksamuel/pulsar4s ○ Akka-streams-based CommittablePartitionedSource ○ Added support for new Pulsar features ○ Has many integrations: ■ avro4s ■ effect types: cats-effect, monix, ZIO ■ json libs: circe, spray-json, play-json, jackson
  • 15. Implementation example: pulsar4s + Akka streams 15 import com.sksamuel.pulsar4s.avro._ // Use avro4s to generate the schema at compile time case class EmailMessage(id: UUID, from: String, to: Seq[String], subject: String, body: String, /*...*/) val client = PulsarClient("pulsar://localhost:6650") val createConsumer = () => client.consumer(ConsumerConfig( topicPattern = "persistent://email/customer-123/.*".r, subscription = Subscription("email-service"))) def handleMessage(msg: CommittableMessage[EmailMessage]): Future[Done] = { // ... do something with msg ... msg.ack() } val pulsarSource = committableSource[EmailMessage](createConsumer) val control = pulsarSource.mapAsync(config.parallelism)(handleMessage).to(Sink.ignore).run() // When we're done consuming, use control.shutdown() to shut down the stream
  • 16. Testing and migration plan 16 ● Load tests first ● Test use patterns for new features: ○ negative acknowledgements ○ delayed messages ● Implemented test publishes in parallel with RabbitMQ ○ for “test” topics, simply ack message ○ add random nacks/delays ● Gradually migrate customers
  • 17. Bugs and issues 17 ● Race condition with delayed message delivery ○ reported to Streamlio, fixed in https://github.com/apache/pulsar/pull/5499 ○ caused consumers to get stuck ● Batching issues ○ negative acknowledgements broken - reported as https://github.com/apache/pulsar/issues/5969 ○ batches treated as a single unit - in backlog ● Metrics - “non-delayed backlog” metric ● Docs in general need work ○ community helps make up for that
  • 18. Message sends: migration results 18 ● Able to smoothly migrate after several months ○ relatively incident-free ● More throughput with much reduced infrastructure costs ○ RabbitMQ was already overprovisioned ● Simpler design with fewer hacks ● Running for over six months without issues
  • 19. Conclusions and next steps 19 ● Expanding use cases to other parts of the system ○ ingestion ○ webhooks ○ analytics ○ workflows ● Learned a lot about Pulsar ○ New features have risks but risks are manageable ○ Docs need work ○ Pulsar has a very helpful community ● Continue involvement in Pulsar OSS community ○ especially pulsar4s - contributions welcome!
  • 20. Thank you! Questions? gmethvin P.S.: Iterable is hiring! Pulsar Slack: Greg Methvin