Revolutionizing Real-time Streaming Machine Learning at Discord

Text Link

‍

Background:‍

Discord, a leading communication platform, aimed to enhance its real-time streaming capabilities to bolster safety and personalization features. David Christle, a staff machine learning engineer at Discord, spearheaded the initiative, sharing the journey in a recent presentation at Pulsar Summit NA 2023 in San Francisco.

‍

Challenge:

Discord faced the challenge of elevating its real-time streaming machine learning platform to address safety and personalization concerns, such as limiting spam access or protecting Discord users' accounts from being compromised. The existing architecture was built for heuristics, not for machine learning. The need for a robust, scalable, and real-time solution led them to explore the integration of Apache Flink, Pulsar, and Iceberg.

“The important point in this space is that the speed and scalability of what we build really matter.”

‍

“This framework is so powerful that it lets you filter, transform, join, aggregate; you have so much freedom in how you manipulate the data, and it works very efficiently even in real-time. These pipelines can be very simple, something like event ingestion and deduplication; you can do ETL tasks with them. We can do ML on stream data with pretty low latency.”

‍

Solution:

In a presentation, David Christle shed light on Discord's open-source-first approach to real-time streaming. The transition from Google Cloud Pub/Sub to Pulsar emerged as a pivotal decision, enhancing efficiency and scalability. Flink and Iceberg played a crucial role in analyzing real-time data and managing historical events and backfilling.

“The key for us is that Pulsar not only has the queue delivery mode, which is very popular in Discord, but it also has a partitioned style delivery mode.”

“These partitions have time ordering guarantees that queues don’t, and we take advantage of that to have a very short watermark kind of experience. And so that’s how we can get accurate results with low latency.”

‍

Technical Journey:

The presentation navigated the technical landscape, illustrating Discord's infrastructure architecture. A pivotal transition to managed Pulsar clusters and the strategic use of hybrid sources for historical replay and real-time streams demonstrated the platform's evolution. Christle underscored the benefits of this strategy, emphasizing its profound impact on improving safety metrics within Discord.

‍

Results:

Discord's commitment to open-source technologies, coupled with the judicious selection of tools, empowered the team to build a formidable system. Christle highlighted the system's success in achieving significant improvements in safety metrics, showcasing the platform's efficacy in real-world scenarios.

‍

Key Takeaways:

The case study illuminated the practical implementation of real-time streaming machine learning at Discord. Insights into technical choices, infrastructure design, and the impactful switch to open source technologies provided valuable lessons for the audience. The success story underscored the capability of a small team of engineers to create a powerful ML system.

‍

Future Prospects:

The presentation concluded by discussing potential future applications of the platform. Discord's journey serves as an inspiration for organizations seeking to leverage real-time streaming for enhanced safety and personalization, demonstrating that strategic technology choices can yield substantial benefits.

‍

Conclusion:

Discord's case study stands as a testament to the successful fusion of Apache Flink, Pulsar, and Iceberg in building a real-time streaming machine learning platform. David Christle's presentation not only showcased the technical prowess of the chosen technologies but also highlighted the collaborative success of a small engineering team. Discord's story serves as a beacon for those navigating the dynamic landscape of real-time streaming and machine learning integration.

“The key for us is that Pulsar not only has the queue delivery mode, which is very popular in Discord, but it also has a partitioned style delivery mode. These partitions have time ordering guarantees that queues don’t, and we take advantage of that to have a very short watermark kind of experience. And so that’s how we can get accurate results with low latency.”