Article updated on August 11, 2023
Machine learning (ML) has been adopted by companies of all stages of maturity, size, and risk adversity. However, operationalizing ML at enterprise scale remains costly, time-consuming, and potentially risky. Why? Because training a ML model and building a reliable prediction service from a ML model are two very different things.
Models make predictions with input data. So, when you put a model into production, you need to turn the model into a prediction service that either runs on a schedule or runs 24x7 serving an online application or service. Your prediction service needs to be plugged into data sources - live data, historical data, and contextual data. For example, a service predicting the validity of an insurance claim will take as input the claim details entered on their website, but will augment the input data with the claimant’s history and policy details.
One of the main challenges in this process of operationalizing a model is connecting up the different phases of the model’s lifetime: (1) creating features from raw data, (2) training the model, and (3) making predictions (inference) on new data. In large organizations, these phases can span different teams: from data analytics to data science to infrastructure teams.
A feature store for machine learning is a platform that manages and provides access to both historical and live feature data and also provides support for creating point-in-time correct datasets from the historical feature data. The feature store provide the capabilities of
Features are simply the data that will be used by the models. For example, that data could be a row of an excel sheet or the pixels in a picture.
Features are “any measurable input that can be used in a predictive model”.
As “fuel for AI systems”, features are used to train ML models and make predictions. The issue with predictions is that they require a lot of data or features. The more data, the better the predictions.
They also need to be organized in order to make sense; the data for the features needs to be pulled from somewhere (a data source) and the features need to be stored after being computed (feature engineering - transforming the source data into features) for the ML pipeline to be able to use the features.
Machine Learning, in general, requires ready-made datasets of features to train models correctly. When we say datasets, we mean that the features are typically accessed as files in a file system (you can also read features directly as Dataframes from the Feature Store).
The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by Data Scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions.
The first public feature store, Michelangelo Palette, was announced by Uber in 2017. The main purpose of a feature store is to facilitate the discovery, documentation, and reuse of features and to ensure their correctness, whether they are used by batch or online applications. The feature store provides a high throughput batch API for creating point-in-time correct training data and retrieving features for batch predictions, a low latency serving API for retrieving features for online predictions. The feature store helps ensure consistent feature computations across the batch and serving APIs.
Here is where the feature store comes in. Let’s say you’re working with an e-commerce recommendation system where a search query for items would benefit from personalization. When a user issues the search query, it is typically executed on a stateless web application- modern microservice principles have made many such services stateless with operational state stored in a database, key-value store, or message bus.
The search query is information poor - it only contains some text, and the only other state available is the user ID and/or session ID. However, the user ID and session ID can be used to retrieve large numbers of precomputed features from the feature store.
The original information-poor signal (search query and user/session IDs) can now be transformed into an information-rich signal by enrichment with features representing the user’s history (items the user interacted with, orders) and features representing context (what items are popular right now). The feature store provides history and context to enable applications to become AI-enabled.
So now in more technical terms, here is why having a feature store will make your life easier:
The Feature Store is a data platform that connects different ML pipelines (feature, training, and inference pipelines). Internally, existing feature stores are all dual-database systems, containing a columnar data store to store historical (offfline) feature data, and an online (low-latency) row-oriented data store to serve precomputed features to online applications.
There are feature stores that come with an offline/online database built-in (Hopsworks with Hudi and RonDB, respectively), those that support an external offline database (Hopsworks), and virtual feature stores that don’t come with any built in data stores (FeatureForm), but can be plugged in to existing systems.
Each feature store provides a different solution for plugging into existing Enterprise data infrastructure and existing machine learning tooling. During the Feature Store Summit, many organizations discussed the motivations, benefits, and challenges of their individual approaches.
The easiest way to get started with using a Feature Store today is to use Hopsworks serverless feature store. It is the only Python-native feature store (as of early 2023), and it has a generous time-unlimited free tier with up to 25GB of free storage for offline features.