Iterable's platform is used by marketers to reach hundreds of millions of users every day, and those numbers are quickly growing. Iterable's infrastructure is built with pub-sub messaging at it's core, so the reliability, scalability and flexibility provided by that system are business critical.
In this talk we'll discuss why Iterable chose Pulsar as a pub-sub messaging system, as well as how Iterable is taking advantage of some of more recently added features in Pulsar. We'll also talk about some of the challenges we encountered, where we think Pulsar can improve, and some contributions we've made to the open source community around Pulsar.
2. Agenda
2
1. About me
2. About Iterable and how we manage scale
3. Challenges with RabbitMQ
4. Why we chose Pulsar
5. Use case: message sends
6. Implementation details
7. Bugs and lessons learned
8. Future plans
3. About me
3
● Software engineer at Iterable
○ mainly internal frameworks and messaging architecture
● Previously at several other startups
● Maintainer for OSS libraries:
○ pulsar4s - Scala client for Pulsar
○ Play Framework - web framework
○ directory-watcher - JVM file watching utility
○ several smaller projects
● Speaking as a Pulsar user
○ also from POV of an OSS maintainer
4. About Iterable
4
● B2B tools for customer engagement campaigns
○ promote interaction with products
○ wide range of markets
● Customers use Iterable to send messages to users
○ though email, SMS, push, in-app
○ campaigns structured in different ways
5. How Iterable manages scale
5
● Handling billions of events per day
○ message sends
○ ingested events and users
○ workflow states
● Services should be as stateless as possible
● Use queues to manage state of long-running tasks
○ messages to be sent to users
○ calling webhooks
○ workflows
7. Ideal message broker requirements
7
● Supports ordered streams of messages
● Supports work queues
● Highly available
● Delayed work
○ initiate some action at a specified time
○ retry this message after N minutes
● Many queues
○ segmenting queues by customer and further by type
○ 10k+ for each service
○ different rate limits
8. Limitations of RabbitMQ
8
● Memory use often causes flow control issues
○ blocks publishers
● Delayed messages work, but not well
○ done using dead-letter functionality
○ had to develop workarounds
● Problems with large numbers of queues
○ often need 10k+ per node
● Not built as a distributed system by default
9. Limitations of RabbitMQ: clustering
9
● Clustering has severe limitations
○ not designed to handle network partitions
○ messages often lost
● Iterable has avoided using it
10. Alternatives
10
● SQS
○ fully managed but less control
○ supports delays, but only up to 15 min
○ payload size sometimes over max of 256k
● Kafka
○ not ideal for work queue use cases
○ poor support for many topics
○ no delay support
11. Why Iterable chose Pulsar
11
● Supports our scale:
○ a million or more topics
○ easy to add storage capacity
○ easy to scale throughput
● Key features:
○ both streaming and queuing
○ negative acknowledgements
○ built-in delay functionality
12. Pulsar use case: message sends
12
● First Pulsar use case
● Core to our business - email, SMS, push, in-app
● Different patterns of sends:
○ all sent at one time - simple queue use case
○ “ideal” time for each user
○ individually triggered (possibly delayed) - e.g. API call,
● Handle failures - negative ack
● Shared subscription - can easily scale consumers
● Can add partitions for high-volume customers
13. Implementation details: design choices
13
● Topic layout
○ Tenants are services, e.g. email
○ customer-specific “projects” are namespaces, e.g. email/project-123
○ Nice for permissions, limits, etc.
○ e.g. persistent://email/project-123/marketing-type-456
● Regex consumers
○ support many short-lived topics
○ decouple producers and consumers
○ unused topics get deleted automatically (as of 2.5.1)
14. Implementation details: tooling
14
● Pulsar lacks official Scala API
● pulsar4s: https://github.com/sksamuel/pulsar4s
○ Akka-streams-based CommittablePartitionedSource
○ Added support for new Pulsar features
○ Has many integrations:
■ avro4s
■ effect types: cats-effect, monix, ZIO
■ json libs: circe, spray-json, play-json, jackson
15. Implementation example: pulsar4s + Akka streams
15
import com.sksamuel.pulsar4s.avro._ // Use avro4s to generate the schema at compile time
case class EmailMessage(id: UUID, from: String, to: Seq[String], subject: String, body: String, /*...*/)
val client = PulsarClient("pulsar://localhost:6650")
val createConsumer = () => client.consumer(ConsumerConfig(
topicPattern = "persistent://email/customer-123/.*".r,
subscription = Subscription("email-service")))
def handleMessage(msg: CommittableMessage[EmailMessage]): Future[Done] = {
// ... do something with msg ...
msg.ack()
}
val pulsarSource = committableSource[EmailMessage](createConsumer)
val control = pulsarSource.mapAsync(config.parallelism)(handleMessage).to(Sink.ignore).run()
// When we're done consuming, use control.shutdown() to shut down the stream
16. Testing and migration plan
16
● Load tests first
● Test use patterns for new features:
○ negative acknowledgements
○ delayed messages
● Implemented test publishes in parallel with RabbitMQ
○ for “test” topics, simply ack message
○ add random nacks/delays
● Gradually migrate customers
17. Bugs and issues
17
● Race condition with delayed message delivery
○ reported to Streamlio, fixed in https://github.com/apache/pulsar/pull/5499
○ caused consumers to get stuck
● Batching issues
○ negative acknowledgements broken - reported as
https://github.com/apache/pulsar/issues/5969
○ batches treated as a single unit - in backlog
● Metrics - “non-delayed backlog” metric
● Docs in general need work
○ community helps make up for that
18. Message sends: migration results
18
● Able to smoothly migrate after several months
○ relatively incident-free
● More throughput with much reduced infrastructure costs
○ RabbitMQ was already overprovisioned
● Simpler design with fewer hacks
● Running for over six months without issues
19. Conclusions and next steps
19
● Expanding use cases to other parts of the system
○ ingestion
○ webhooks
○ analytics
○ workflows
● Learned a lot about Pulsar
○ New features have risks but risks are manageable
○ Docs need work
○ Pulsar has a very helpful community
● Continue involvement in Pulsar OSS community
○ especially pulsar4s - contributions welcome!