Input features are the building blocks for machine learning models. You cannot have a great model without great features. By building on top of Apache Pulsar's infinite retention of events, we built infrastructure to serve features in production and to generate training datasets. It allowed our machine learning teams to change, test, and deploy personalization features at an extraordinary rate to 10s of millions of end-users.
This talk will discuss:
- What event-sourcing is and why it's so powerful for machine learning infrastructure.
- How we built the StreamSQL feature store on top of Pulsar, Flink, and Cassandra.
- How a feature store accelerates ML development.
1 of 40
More Related Content
Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder
3. Agenda
● The ML process
● Moving our ML Pipelines w/ Pulsar
● Building a Feature Store on top of Pulsar
● Q&A
4. Last 5 articles read
Current article
Top Genre
Average Content Length
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Machine Learning :: Model(Features) = Output
6. Last 5 articles read
Current article
Top Category
Total time spent reading
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Behind every great model is a set of great features
9. Our ML teams spent >80% of their time
building and maintaining ML pipelines for
feature generation and feature
engineering.
10. Deploy Feature to
Production
Validate New Feature
Increases Performance
Generate Training Dataset
with new Feature
Hypothesis
New Feature
The Feature Engineering Cycle
12. Last 5 articles read
Current article
Top Category
Total time spent reading
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
13. Generating Point-in-Time Correct Training Data
Time
read
Features at
timestamp
read
Features at
timestamp
read
Features at
timestamp
Events Storage Training Data
14. Generating Features for Serving (in a perfect world)
Time
read
Features at
timestamp
read
Features at
timestamp
read
Features at
timestamp
Event Stream Online FeaturesProcessor
15. Last 5 articles read
Current article
Top Category
Total time spent reading
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Most Features are Stateful
16. Total time spent reading
Input Features Model Output
Stateful Features must be Bootstrapped
22. Combine Batch & Stream Processing with an Immutable Ledger
● Each new event appends to the end of the ledger
● Cut at an arbitrary point, and the ledger looks like a
batch problem
● Only read from the head of the ledger and it looks
like streaming problem
26. Feature are the building blocks of
ML models; however, they are
developed and maintained in
ad-hoc ways.
They lack a dedicated system of
management.
27. ML Pipelines < Feature Stores
● No concrete feature definitions, feature logic is split
across Flink jobs.
● No feature versioning and rollback.
● No feature sharing, re-use, and discovery.
● No integrations into Tensorflow, Jupyter, etc.
28. A Platform for features allows for teams to work together.
Features are easily defined, shared, and re-used.
There exists a single source of truth for features.
29. Last 5 articles read
Current article
Top Category
Total time spent reading
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Models across an organization may benefit from
some of these features.
30. StreamSQL.io
accelerates and
enhances machine
learning development
● Facilitate model development
Discover, re-use, and share
features across teams and models.
● Deploy with confidence
Use a single feature definition for
training and serving.
● Limit complexity
Unified streaming and batch
processing for feature generation.
● Increase model performance
Use 3rd party features from text
embeddings to weather data.
37. Deploy Feature to
Production
Validate New Feature
Increases Performance
Generate Training Dataset
with new Feature
Hypothesis
New Feature
The Feature Engineering Cycle