SlideShare a Scribd company logo
June 18, 2020
Five Years of Operating a
Large Scale Globally
Replicated Pulsar
Installation
Ludwig Pummer
ludwig@verizonmedia.com
Joe Francis
joef@verizonmedia.com
People
Ludwig Pummer ludwig@verizonmedia.com
Principal Production Engineer
Verizon Media
Joe Francis joef@verizonmedia.com
Director, Core Platforms
Verizon Media
2
Agenda
3
1. Focus & Use Cases
2. Scaling up
3. Provisioning and Capacity
4. Hardware Evolution
5. JVM GC Experiences
6. Metrics and Monitoring
7. Deployment
8. Broker Isolation Policies
9. BookKeeper Storage Utilization
10. BookKeeper Rack Awareness
Our Focus
● Operate a hosted pub-sub service within VMG
○ open-sourced as Pulsar
● Global presence
○ 6 DC (Asia, Europe, US)
○ full mesh replication
● Business critical use cases
○ Serving use cases
○ Lower latency bus for other low latency service
○ Write availability
4
Use Cases
● Application integration
○ Server-to-server control, status, notification messages
● Persistent queue
○ Buffering, feed ingestion, task distribution
● Message bus for large scale data stores
○ Durable log
○ Replication within and across geo-locations
5
Trajectory
2015
1 tenant
2 clusters, 2 DC
60K wps @ 2KB
60K rps
<100 topics
6
2016
20 tenants
12 clusters, 6 DC
500K wps avg
1.1M rps avg
1.4M topics
2020
100+ tenants
18 clusters, 6 DC
1.3M wps avg / 3M peak
2M rps avg / 6M peak
2.8M topics
More deliveries (increase fanout) → Add Brokers
More publishes → Add Bookies, Brokers
More storage → Add Bookies
Massively more topics → Add Clusters (SuperCluster)
See PIP 8 for SuperCluster (peer clusters)
7
Scaling up a Cluster
8
Storage & I/O auto-balancing
wps by namespace
wps by bookkeeper
automatic distribution
Out for Reimage In after Reimage
Provisioning Model
9
New tenant provides:
● Average message size
● Peak publishes per second
● Steady-state deliveries per second (fan out)
● Per cluster/DC
● Tenants: x509 principal: Athenz
○ https://www.athenz.io/
○ Open source platform for X.509 cert based service authentication and authorization
Calculate:
● Broker messages/sec
● Broker bandwidth
● Bookie MB/sec
Hardware Evolution: Brokers
10
Pre-2015
8-core
24GB RAM
1G NIC
2015
12-core
48GB RAM
10G NIC
2020
12-core
96GB RAM
10G or 25G NIC
Hardware Evolution: BookKeepers
11
Pre-2015
12-core
32GB RAM
1G NIC
12 x 300GB 15K RPM SAS
drives
(2 x HW RAID-10 of 6 drives)
2015
12-core
64GB RAM
10G NIC
10 x 4TB 7.2k RPM SATA
2 x 120GB SSD
(1 x HW RAID-10 of 10 drives
1 x HW RAID-1 of 2 SSDs)
2020
36-core
192GB RAM
25G NIC
4 x 4TB NVMe
2 x 128GB Optane Persistent
Memory
Hardware Evolution: ZooKeepers
12
Pre-2015
8-core
24GB RAM
1G NIC
240 GB SSD
2015
12-core
64GB RAM
10G NIC
2x 240 GB SSD
2020
12-core
64GB RAM
10G NIC
2x 960 GB SSD
JVM Garbage Collector
13
G1GC Pulsar 1.x G1GC Pulsar 2.x ZGC Pulsar 2.x
gc pause previous minute
gc events previous minute
AdoptOpenJDK 11
gc time previous minute
gc events previous minute
safepoint pause previous minute
safepoint events previous minute
AdoptOpenJDK 11
Metrics and Monitoring
14
Metrics and Monitoring: Too much data
15
Metrics and Monitoring: Tenant metrics
16
Brokers
/admin/destinations
collector
topic monitor
Metrics and Monitoring: Tenant metrics
17
● No downtime. Manage risk.
● Staged. Sequenced.
● Low parallelism
● Managed by Screwdriver jobs
○ https://docs.screwdriver.cd/
○ Open Source Build Platform designed for Continuous Delivery
● Screwdriver launches Ansible-like tool
Deployment
18
Deployment
19
Deploy Order Post-deploy checks
1. ZooKeeper (not prod) rejoined quorum
2. BookKeeper port, ZK node, bookiesanity
3. Broker ports, sanity
{
"namespaces" : [ "tenant-one/.*" ],
"primary" : [ "broker6[7-9].example.com" ],
"secondary" : [ "none" ],
"auto_failover_policy" : {
"policy_type" : "min_available",
"parameters" : {
"min_limit" : "0",
"usage_threshold" : "100"
}
}
}
20
Broker Isolation Policies
Uses
● High Profile/Reserved capacity
● Misbehaving tenants
● Debugging
Factors and Configuration impacting cluster storage utilization
● Number of topics x write throughput
● Increased write-quorum
● Increased topic TTL
● Increased retention period
● Compaction thresholds and intervals
● Over-replication
● MinLedgerRolloverTime, MaxLedgerRolloverTime,
CursorRolloverTime
● Crossing BookKeeper compaction threshold
21
BookKeeper Storage Utilization
Rack as a failure domain
2 Logical Racks vs. N Logical Racks
22
BookKeeper Rack Awareness
A
D
A
B B
C C
D
A
D
E
A
B
F
B
C
E
C
D
F
Thank you.
Ludwig Pummer ludwig@verizonmedia.com
Joe Francis joef@verizonmedia.com

More Related Content

Five years of operating a large scale globally replicated Pulsar installation — Francis&Ludwig Pummer

  • 1. June 18, 2020 Five Years of Operating a Large Scale Globally Replicated Pulsar Installation Ludwig Pummer [email protected] Joe Francis [email protected]
  • 2. People Ludwig Pummer [email protected] Principal Production Engineer Verizon Media Joe Francis [email protected] Director, Core Platforms Verizon Media 2
  • 3. Agenda 3 1. Focus & Use Cases 2. Scaling up 3. Provisioning and Capacity 4. Hardware Evolution 5. JVM GC Experiences 6. Metrics and Monitoring 7. Deployment 8. Broker Isolation Policies 9. BookKeeper Storage Utilization 10. BookKeeper Rack Awareness
  • 4. Our Focus ● Operate a hosted pub-sub service within VMG ○ open-sourced as Pulsar ● Global presence ○ 6 DC (Asia, Europe, US) ○ full mesh replication ● Business critical use cases ○ Serving use cases ○ Lower latency bus for other low latency service ○ Write availability 4
  • 5. Use Cases ● Application integration ○ Server-to-server control, status, notification messages ● Persistent queue ○ Buffering, feed ingestion, task distribution ● Message bus for large scale data stores ○ Durable log ○ Replication within and across geo-locations 5
  • 6. Trajectory 2015 1 tenant 2 clusters, 2 DC 60K wps @ 2KB 60K rps <100 topics 6 2016 20 tenants 12 clusters, 6 DC 500K wps avg 1.1M rps avg 1.4M topics 2020 100+ tenants 18 clusters, 6 DC 1.3M wps avg / 3M peak 2M rps avg / 6M peak 2.8M topics
  • 7. More deliveries (increase fanout) → Add Brokers More publishes → Add Bookies, Brokers More storage → Add Bookies Massively more topics → Add Clusters (SuperCluster) See PIP 8 for SuperCluster (peer clusters) 7 Scaling up a Cluster
  • 8. 8 Storage & I/O auto-balancing wps by namespace wps by bookkeeper automatic distribution Out for Reimage In after Reimage
  • 9. Provisioning Model 9 New tenant provides: ● Average message size ● Peak publishes per second ● Steady-state deliveries per second (fan out) ● Per cluster/DC ● Tenants: x509 principal: Athenz ○ https://www.athenz.io/ ○ Open source platform for X.509 cert based service authentication and authorization Calculate: ● Broker messages/sec ● Broker bandwidth ● Bookie MB/sec
  • 10. Hardware Evolution: Brokers 10 Pre-2015 8-core 24GB RAM 1G NIC 2015 12-core 48GB RAM 10G NIC 2020 12-core 96GB RAM 10G or 25G NIC
  • 11. Hardware Evolution: BookKeepers 11 Pre-2015 12-core 32GB RAM 1G NIC 12 x 300GB 15K RPM SAS drives (2 x HW RAID-10 of 6 drives) 2015 12-core 64GB RAM 10G NIC 10 x 4TB 7.2k RPM SATA 2 x 120GB SSD (1 x HW RAID-10 of 10 drives 1 x HW RAID-1 of 2 SSDs) 2020 36-core 192GB RAM 25G NIC 4 x 4TB NVMe 2 x 128GB Optane Persistent Memory
  • 12. Hardware Evolution: ZooKeepers 12 Pre-2015 8-core 24GB RAM 1G NIC 240 GB SSD 2015 12-core 64GB RAM 10G NIC 2x 240 GB SSD 2020 12-core 64GB RAM 10G NIC 2x 960 GB SSD
  • 13. JVM Garbage Collector 13 G1GC Pulsar 1.x G1GC Pulsar 2.x ZGC Pulsar 2.x gc pause previous minute gc events previous minute AdoptOpenJDK 11 gc time previous minute gc events previous minute safepoint pause previous minute safepoint events previous minute AdoptOpenJDK 11
  • 15. Metrics and Monitoring: Too much data 15
  • 16. Metrics and Monitoring: Tenant metrics 16 Brokers /admin/destinations collector topic monitor
  • 17. Metrics and Monitoring: Tenant metrics 17
  • 18. ● No downtime. Manage risk. ● Staged. Sequenced. ● Low parallelism ● Managed by Screwdriver jobs ○ https://docs.screwdriver.cd/ ○ Open Source Build Platform designed for Continuous Delivery ● Screwdriver launches Ansible-like tool Deployment 18
  • 19. Deployment 19 Deploy Order Post-deploy checks 1. ZooKeeper (not prod) rejoined quorum 2. BookKeeper port, ZK node, bookiesanity 3. Broker ports, sanity
  • 20. { "namespaces" : [ "tenant-one/.*" ], "primary" : [ "broker6[7-9].example.com" ], "secondary" : [ "none" ], "auto_failover_policy" : { "policy_type" : "min_available", "parameters" : { "min_limit" : "0", "usage_threshold" : "100" } } } 20 Broker Isolation Policies Uses ● High Profile/Reserved capacity ● Misbehaving tenants ● Debugging
  • 21. Factors and Configuration impacting cluster storage utilization ● Number of topics x write throughput ● Increased write-quorum ● Increased topic TTL ● Increased retention period ● Compaction thresholds and intervals ● Over-replication ● MinLedgerRolloverTime, MaxLedgerRolloverTime, CursorRolloverTime ● Crossing BookKeeper compaction threshold 21 BookKeeper Storage Utilization
  • 22. Rack as a failure domain 2 Logical Racks vs. N Logical Racks 22 BookKeeper Rack Awareness A D A B B C C D A D E A B F B C E C D F