There is a massive buzz around MLOps in the industry at the moment. Many businesses are finding that they have Machine Learning (ML) models already built in-house but do not know how to take these models to the next level with regards to making them “battle-ready” for use in production.

Through the means of exploring how we provided this solution to one of our customers, this article will dive into how a secure, scalable and production-ready MLOps framework can help businesses grow the use of ML in their product offering at speed, without sacrifice.

This particular customer offers a B2B SaaS product focused on real-time price optimisation. To accomplish this, they have built two ML models that have their results combined to produce a final optimised price. Prior to engaging Inawisdom, they had begun on an exploratory journey into MLOps through SageMaker Pipelines so were aware of what it was that they ultimately wanted to accomplish.

Whilst Inawisdom were not involved in the data science element of this product, it was important to understand how the ML models were being trained in order to architect the MLOps framework correctly.

A key driver behind our customer’s need for an scalable MLOps framework was their future plans for additional ML models. With their existing setup, many resources would be duplicated and the complexity of the overall system to maintain would continue to grow as their use of ML expanded. In addition to this, due to building their first iteration with SageMaker pipelines, easy integration with services outside of the SageMaker ecosystem would be less than seamless.

The Solution: A Scalable MLOps Framework

From early discussions with the customer, it was clear that the core of the framework needed to be a platform that could integrate with many different services, be triggered both manually and automated as well as give an easy debugging experience if the MLOps process were to go wrong. Using Step Functions as the core of the framework seemed like the perfect fit given these requirements.

Step Functions allows for orchestration of workflows via the vehicle of a state machine, integrating with almost any AWS service. These integrations enable direct API calls to be made from the state machine to the required service without needing to have a Lambda function act as a proxy. This in turn reduces complexity of the overall system.

There are three key areas that made up the overall MLOps framework. These are: the configuration deployment pipeline; the pre-processing, training and evaluation pipeline; the model deployment pipeline. Each of these will be explored in more detail in the remainder of this article.

Configuration deployment pipeline

A key tenet of MLOps is to empower those who are creating the models without having to burden them with the management of infrastructure and pipelines. With this customer, the approach taken was to introduce the concept of a ‘model’ Git repository that would contain configuration file(s) and the Python scripts that are required to build a model. Using a Git repository to store these files enables version control and additional traceability over who made what changes and why.

Any changes to the model repository would trigger a CodePipeline pipeline that would take the contents, perform any build steps required and ultimately update a DynamoDB table with references to the latest built artifacts. Storing this in DynamoDB allowed for low-cost queries to be run for access patterns like “What is the latest pre-processing Docker image URI for Model A” from the latter pipelines within this framework.

Using this approach enables the data scientists that are building the model to have full control over the scripts that are run to prepare data for and evaluate the trained model. Through the configuration file, it is also possible to set the various resource requirements for the various stages of the pipeline. This means that if a data scientist makes a change to the way data is processed for training and as such requires compute with more resource, they can configure this without having to involve an operations team.

Pre-processing, training, evaluation pipeline

The pre-processing, training and evaluation pipeline is the real core of the MLOps framework developed. It is in this stage that a trained usable model is output ready for a human decision (aided by automated evaluation) to be made on whether it should be deployed.

This element of the framework is entirely dynamic and adaptable to the variety of different models that are required to be processed through it. To do this, Step Functions was used rather than alternative services such as CodePipeline.

Due to the presence and operation of the configuration deployment pipeline, this pipeline is able to determine which tasks to perform for processing the data, which parameters to use for training the model and how to evaluate it without any of it being hard-coded. The advantage of not needing to hard-code this is that it makes the pipeline repeatable, reducing resource duplication and ultimately simplifying the overall system.

Everything that feeds this pipeline is able to be defined by a data scientist in the ‘model’ repository, empowering them to control what happens within the pipeline without having to worry about the how when it comes to orchestrating all the moving parts.

Model deployment pipeline

The model deployment pipeline is decoupled from the previous pipeline so that it’s possible to re-deploy older models without having to go through a training process. De-coupling also enables a hard boundary between a model being evaluated and being deployed. This hard boundary makes it easier to implement various forms of an approval process in future.

This pipeline, like the previous one, was implemented as a state machine in Step Functions. The workflow begins by getting all the information about the model based on the name of the model that is passed in as the input. By getting this information as part of the workflow rather than passing it all in directly, it keeps the complexity of running it manually to a minimum.

Using this information, the workflow continues by first updating a DynamoDB table to indicate that there is a new model being deployed to the endpoint, then creating the resources that are necessary for updating an endpoint, updating the configuration for autoscaling and beginning the update process for the endpoint.

Updating a SageMaker endpoint can take a few minutes and the API call is asynchronous, meaning that it does not wait for the update to complete before returning a result. Due to this, the workflow has a task that checks the status of the update a number of times with an increasing time between each retry. If too many retries occur, the deployment is treated as timed out, a notification is sent and a DynamoDB table is updated to indicate such an event.

Once the update has succeeded, the DynamoDB table is updated to reference the model that is now deployed to the endpoint and that there is no longer a deployment in progress.

Conclusion

In conclusion, this was a challenging project that delivers real value to the customer. It enables them to scale their product without having to worry about whether the MLOps framework that underpins it will stand up to the task. The introduction of new ML models can happen without having to redesign pipelines, significantly reducing their GTM time.