Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder

Feature Stores
Building ML Infrastructure on Apache Pulsar

Simba Khadder
Co-Founder & CEO
StreamSQL.io
Using Apache Pulsar to power our feature store for
>100m MAU

Agenda
● The ML process
● Moving our ML Pipelines w/ Pulsar
● Building a Feature Store on top of Pulsar
● Q&A

Last 5 articles read
Current article
Top Genre
Average Content Length
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Machine Learning :: Model(Features) = Output

Feature Engineering > Model Research*

Current article
Top Category
Total time spent reading
....
Recommend Next
Article
Behind every great model is a set of great features

Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder

Credit: Microsoft Azure Sales deck

Our ML teams spent >80% of their time
building and maintaining ML pipelines for
feature generation and feature
engineering.

Deploy Feature to
Production
Validate New Feature
Increases Performance
Generate Training Dataset
with new Feature
Hypothesis
New Feature
The Feature Engineering Cycle

Training Data
Online Features Serving
Train
User ID
Feature Set
Arr([FeatureSet, Actual])

Current article
Top Category
....
Recommend Next
Article

Generating Point-in-Time Correct Training Data
Time
read
Features at
timestamp
read
Features at
timestamp
read
Features at
timestamp
Events Storage Training Data

Generating Features for Serving (in a perfect world)
Time
read
Features at
timestamp
read
Features at
timestamp
read
Features at
timestamp
Event Stream Online FeaturesProcessor

Current article
Top Category
....
Recommend Next
Article
Most Features are Stateful

Stateful Features must be Bootstrapped

Bootstrapping Stateful Features with
Historical Data in S3
SELECT user, SUM(readtime) FROM read_events GROUP BY user;

Time
Persisted in S3 Not in S3, but in Kafka
retention period
MsgID

Finish bootstrapping & start
stream processing from Kafka
SELECT user, SUM(readtime) FROM read_events GROUP BY user;

Full Feature Deployment Process

Combine Batch & Stream Processing with an Immutable Ledger
● Each new event appends to the end of the ledger
● Cut at an arbitrary point, and the ledger looks like a
batch problem
● Only read from the head of the ledger and it looks
like streaming problem

Pulsar Based Architecture with Inﬁnite Retention

Pulsar’s oﬄoading makes Event-Sourcing achievable

Pulsar’s Tiered Architecture enhances Processing on
Inﬁnite Retention

Feature are the building blocks of
ML models; however, they are
developed and maintained in
ad-hoc ways.
They lack a dedicated system of
management.

ML Pipelines < Feature Stores
● No concrete feature deﬁnitions, feature logic is split
across Flink jobs.
● No feature versioning and rollback.
● No feature sharing, re-use, and discovery.
● No integrations into Tensorﬂow, Jupyter, etc.

A Platform for features allows for teams to work together.
Features are easily deﬁned, shared, and re-used.
There exists a single source of truth for features.

Current article
Top Category
....
Recommend Next
Article
Models across an organization may beneﬁt from
some of these features.

StreamSQL.io
accelerates and
enhances machine
learning development
● Facilitate model development
Discover, re-use, and share
features across teams and models.
● Deploy with confidence
Use a single feature definition for
training and serving.
● Limit complexity
Unified streaming and batch
processing for feature generation.
● Increase model performance
Use 3rd party features from text
embeddings to weather data.

Time
label
Features at
timestamp
label
Features at
timestamp
label
Features at
timestamp

StreamSQL
The Feature Store for Machine Learning
Beta

Simba Khadder
simba@streamsql.io

Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder

More Related Content

Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder