Let me tell you a secret – the art of machine learning is not about data (although lots of people say that it is). Data itself is the raw material input for machine learning, especially for supervised learning or deep learning. The secret of machine learning is actually taking that RAW product and turning it into something useful and useable. Part of that is finding what is important within it, potentially cleaning, labelling, and standardising it; the result is producing a feature that can be used by a model. This takes investment of time and money.

So, this blog is the first part of 2 part series that looks at how to treat Features as assets, how to maximise the ROI from them (from a business point of view) and how to tackle them from a longer term (engineering) point of view.

The Issue

The process of taking RAW data and producing Features from it is called Feature Engineering. It can take Data Scientists hours, days, and sometimes even weeks to build to features involving lots of experiments and iterations. This is one of the most valuable things a Data Scientist can do.

For clients’ taking their initial steps into Machine Learning or for a very simple use case we typically see that Features are initially created for each individual model and are stored with the model as part of an MVP. This is a good approach to get your initial models delivering results.  However, for more mature clients that have several or hundreds of models live, or clients with several Data Science teams, or models with complex needs; then Inawisdom has seen the following challenges arise from this approach:

  • Knowing which features exist: In situations where there are two different data scientist teams working on different use cases producing different models. We sometimes see a lack of knowledge sharing of each other’s work and this means that they both end up creating the same feature. Even worse, they may end up with slightly different implementations as maybe one of the two Data Scientists does not understanding a characteristic of the data.
  • Feature Tracking: This is the inverse of Knowing which features exist – this is knowing where a feature is being used by which models, when was it last updated and which model version is using which iteration of a feature.
  • Updating features: Over time some features might need to be updated. For example, you are a stationery shop and you’re predicting the stock levels for pens, where the colours available have been black, blue and red. However, last month your supplier started providing green pens and customers have bought some from you. Therefore, this month you want to predict how many green pens you need to order – to do that your need to update your colour feature to also have Green in it. This means you may want to release a new version of the feature with an updated model.
  • Stale features: An extension of updating features where sometimes you need real time features, for example the amount of stock in the warehouse so that your recommendation only suggests items that can be fulfilled. This means that you need to be able to pull in new data and automatically regenerate the features without the need for a Data Scientist.
  • Evolving features: Again, an extension of updating features. This occurs when a feature is used by one model and then later it can be reused for another model with some slight changes that can satisfy both models. This would mean the update would need to be handled in the original model but is more efficient than maintaining two different but similar features.
  • Avoiding being stuck in a Notebook: Features typically start life in notebooks when preparing data sets for training, but they are also needed at inference time to take input data and encode it. However, we see situations where they are not saved (especially when dealing with categorical features) or if they are saved then they are saved as binary files and stored inside the codebase. This leads to issues later (especially for real-time inference) such as when you need to upgrade a library and that means the binary cannot be read in, and it can inflate your deployment package size such that it cannot be deployed to an edge device or within a Lambda.

These challenges are worth consideration as otherwise you’ll struggle as you try to scale your use of machine learning, with more inefficiencies and friction creeping in. This will lead to wasted effort, increased costs, and a stretched time to market. We don’t want anything stopping you from realising the opportunities in front of you and unleashing the potential of machine learning!

The Feature Store

To address these challenges and to enable the scaling of your Machine Learning efforts, then you should look to adopt a Feature Store. A Feature Store provides the following capabilities:

  • Centralisation and Reuse: The core element of any Feature Store is a central location to store any features.
  • Catalogue: Each stored feature must be accurately defined with additional details that helps with searching and discovery. The details need to include a clear description of what the feature is and should be used for, what was the data source of the feature, when it was created, when was it modified and by whom. You may want to include more.
  • Management: Like software or product development it is important to be able to manage features. Feature management is being able to version the feature, govern what versions of a feature can be used and which models are using the feature (incl. which version). This means if you need to change the feature you can manage the impact on models, especially those that are that fully operational.
  • Offline vs Online: Features can be used in two modes; offline and online. Offline is the most used option and is used for infrequent access where latency is not an issue. Discovery, Training and Batch Processing are good examples of offline uses. Online is for frequent access or where low latency is a requirement (circa <50ms). Real-time or near-real time inferences where you make thousands of predictions an hour/minute would generally need Online Feature access.
  • Manual, Regular, Real-time Updating: Features have disparate requirements about when they are created and when they need updating. The most common ones we see that a feature needs to support are:
    • Manual: The feature is created as a one-off or is updated very rarely, e.g., annually. It is typical that these are created manually in code by a Data Scientist. A good example of this the one-hot encoding (converting a list of strings to numbers) of country codes or city names.
    • Regular: The feature is created at set intervals, e.g., every week or month, and their creation is normally automated. In addition, detection of changes in source data can trigger the updating of a feature. An example might be using the previous 6 months of monthly stock level data (to provide recency or seasonality) as a feature for a demand forecasting that predicts the next month’s/quarter’s sales.
    • Real Time: The ability to process data from streams and update an online feature. Examples of these are current stock prices or airline ticket prices for use in reinforcement learning.
  • Additional Capabilities: Some Feature Stores have some additional capabilities and integrations:
    • Model Registry: Please see my previous blog on what a Model Registry is. However, this integration (be it part of same tool or external) provides a richer lineage between source data and predictions.
    • Transformations: The ability to run transformations internally or trigger external transformations for data preparation and/or the data engineering to produce features from source data.
    • Code Linking: Features can be developed in Python, PySpark, SQL and many other languages and tools. To allow you to quickly reference the code some Feature Stores allow for viewing the source code by integrating with your SCM, typically GIT.

Conclusion

Features are refined from raw data to power machine learning. However, as your use of Machine Learning grows, then so does the amount of features you have and the time it takes to create them. As we’ve seen, there are a few challenges that come with having lots of features or complex features to look after.

But a Feature Store (with its ability to centrally manage features for all teams) can help address these challenges and promote better use of features across the business. If you would like to know more about how to implement a simple Feature Store, then please check out part 2!