Back in September, I took part in a video shoot for the AWS “This is my architecture” series. With my host Shafreen, we discussed an AWS architecture to train an initial model in SageMaker, deploy it, continually evaluate its performance in production, and then automatically retrain the model if required.

As I mentioned at the time, I had great fun taking part in the video shoot. I’m delighted to say that the video is now published on the AWS website and YouTube here.

Managing Model Drift

One of the topics discussed in this video is how to automate the management of model drift.  When we say “model drift”, of course it’s not the model that is drifting. It’s the environment that is changing around the model.  The model is static – it’s determined by the training data set, the algorithm and the hyperparameters.  And some deliberate randomness is thrown in too! But the environment that the model is operating in is usually changing in small, subtle ways. For example, customer behaviours change and economic conditions vary.

To deal with this, it is generally necessary to retrain the model on an up to date training set on a periodic basis.  Typically this uses a sliding window approach, e.g. remove the oldest day from the training set, and add in the new day of data. Then retrain your model, evaluate its performance and if it really does perform better against current conditions, deploy it.

A question I am often asked is “how often should I retrain my model?”. Like all the best questions, the answer is “it depends”. For most ML models, daily or monthly is fine, and so a simple scheduled approach is sufficient.  This can use a cron job or the preferred AWS CloudWatch Event to trigger a Step Function which then orchestrates the various data prep, training, evaluation and deployment steps.

Drift in more dynamic environments

Some models operate in a much more dynamic environment, and so a simple scheduled retraining strategy is not sufficient (well – it’s sub-optimal). This is often the case in ML applications where there are other actors at play, such as competitors who may be dynamically varying their strategies (e.g. pricing). As they vary their strategy, your predictive model may start significantly under or over-shooting the “correct” prediction for this new market state. Another example is a central bank may change interest rates. It’s quite a rare event but this may ruin the predictive accuracy of an existing model that has been trained using a time period when interest rates never changed.

The solution here is to automatically monitor the performance of your model in production on new data and determine if it is suddenly under-performing. This analysis can again be managed via scheduled Lambda/Step Functions. For example, if your mean prediction is X, and that drops by 10% over a certain time interval, then this may indicate that a sudden model drift issue has occurred.  An automatic model retraining process can be triggered.

One challenge here is the need to collect sufficient new training data from the new “market state”. There is no point replacing an underperforming model with one that is inaccurate as the training set size was too small. So a sub-strategy here is to deploy a “safe” model, temporarily, whilst collecting fresh training data. Once a statically significant training set has been compiled, then retrain, evaluate and deploy the new superior model.