The use of Machine Learning has become a critical component in many businesses, examples being product recommendations, making decisions on stock levels, and evaluating credit scores. Machine Learning uses like these for businesses are having a direct impact on the top line (how money is made) or the bottom line (where money is spent). However, increasingly and I would say more importantly they are having an impact on decisions and behaviours of individuals or groups of people. I am not the only who thinks so, the EU also has seen this impact and they are looking at regulation in this area (https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-laying-down-harmonised-rules-artificial-intelligence)
This blog however does not look at those impacts or how to judge them, instead this blog will look at how to track back from the impact which models made which predictions, how those models where made and how new models can be approved with confidence. This is known as model lineage and model versioning. Finally, it will then look at how to implement a Model Registry to govern and implement these two concepts on AWS.
Background
Machine Learning models may seem a little more involved than a traditional software development and require more artefacts and skills. However, by applying standard software engineering practices, you just need to understand the similarities and build upon existing techniques. A Machine Learning model consists of the following parts:
- Algorithm: The algorithm is some data science logic used to implement your prediction. The algorithm is normally part of a framework which is an opensource library. So, in engineering terms is a dependency with a version number and some configuration management.
- Data Set: The Data Set is the data the model is trained with and for supervised model a holdout data set for evaluation later. To form the data set, normally a set of data preparation steps are undertaken. So, in engineering terms this is like a database or data store with some queries or transformations run against it to produce some information.
- Script: The script(s) is only needed for some algorithms, and it customises the algorithm to specific use case. Again, in engineering terms this is like an extending or integrating an opensource library with your business logic. The script is normally implemented in Python or R and needs to be stored in version control.
- Hyperparameters: Tunes your model to increase its effectiveness and accuracy. In engineering terms this is like a configuration file and once the best set of parameters is known then they need controlling like any other development artefact. Inawisdom typically save the hyperparameters in a file that we call the manifest.
- Artifact: You train a model by using the “script” with the “algorithm” on the “data set” and the outcomes is the “artifact,”. The “artifact” is a binary in engineering terms and the training can be thought of as a build process.
The Issue
Mapping these data science terms to engineering terms helps engineers understand them. However, another important consideration is that the model is not necessarily reproduceable from a training process even if the inputs are the same (data and configuration). For example, a deep learner can learn differently every time it is trained resulting in slightly different models and in turn they can predict slightly different things. Sometimes if the data changes enough, then model might not be able to predict anything meaningful.
If models cannot necessarily reproduceable or the output can change a lot given changes in data, then how can we provide traceability into a prediction? After all, that prediction may have led to a crucial outcome for an individual or contributed to a decision for company. Well providing this traceability and auditability is known as model lineage:

Figure 1Model Lineage
Figure 1 shows model lineage at a high level and it comprises of the following steps to audit a prediction from a model:
- The first we need to know what artifact was used during inference:
- Each artifact produced needs a version number
- Each prediction needs to be signed with that version number
- For a versioned “artifact” or model you need to know the following from the training that created it:
- The “Hyperparameters” and setting used for training
- The location of the “script” in source control and its revision
- The version number of the “algorithm”, “framework”, and any other dependencies
- The “data set” needs to be versioned and stored somewhere that allows it to be retrieved. Using a Feature Store is ideal for this, like SageMaker Feature Store, Dynamo DB or S3.
- An approval process be it automated or manual but provides clear audit of whether a model was approved and who approved it
- Lastly, the “data set” itself needs to have data lineage:
- The source of the data
- The processes run on the data such as one hot encoding, scaling, regulation, and imputation (handling missing data)
- The version and source code of the script for processes
The Model Registry
To make model lineage as easy as possible and to scale Machine Learning across a business then we recommend that a Model Registry. A Model Registry provides this by treating every model as an immutable set of artefacts and for some use cases enforces this immutability by using integrity validation.
A Model Registry has the following features:
- Model Storage: The ability to store all kind of models from all the major frameworks or your own algorithm
- Model Versioning: The ability to store multiple increments of a model and the changes between versions
- Model Approval: The ability to register models as approved versions and trigger deployments, with an immutable record of who approved them
- Stage Transitions: The ability to transition models thought stages of deployment into production and again with an immutable record who did them.
- Data Versioning:
- The source of the data and immutable copy of it in an optimised format
- The detailed processes run on the data to create an optimised format
- The version and source code of the script for processes
- Meta Data: meta data or manifest that covers:
- The “hyperparameters” and other setting used for training
- The location of the “script” in source control and its revision
- The version number of the “algorithm”, “framework”, and any other dependencies
- API: The ability to load an approved model at inference with a clear identifier of the entry in the model registry.
The Implementation
There are number of technologies and implementations that can be used, ML Flow and Amazon SageMaker Studio have purpose-built services like the SageMaker Model Registry. However, you may wish to build your own to understand the underlying implementation better or fit your own needs.
Therefore, let’s looks at building our own on AWS:

Figure 2Model Registry on AWS
In Figure 2 we can see that 3 AWS Services are being used to build our Model Registry; S3, Code Commit and DynamoDB. Here is how they work:
- Source Code
- After model development is completed then all the associated source code and configuration is submitted to Code Commit.
- Model Storage and Model Versioning is done in S3 (known as the model store):
- The important thing is that each execution (trained model) is place a new folder in the model store to version it
- That folder must be named after the GIT Revision, A pipeline execution identifier or date
- Data Versioning, and Prediction Capture in model store:
- The data set is stored in the model version folder after it is processed and prepared
- The predictions (or subset of them if you are doing millions of predictions a day) are stored as part of the model version after inference
- Meta Data, Approval and Stage Transition is done in DynamoDB as a meta store
- A candidate model is registered after its development is complete in the meta store:
- This needs to contain location of any configuration and source files (including revisions) to be used
- A candidate model is registered after its development is complete in the meta store:
- This can automatically perform a Stage Transition and trigger a new model version to be created, or this can be done manually
- Upon creation of a model version then the meta store is updated at the following stages:
- For feature engineer/processing stage the data location and processes are record in the meta store
- For training stage then the trained model’s location in s3 and the training job details
- For validation stage the approval of the by automated or human verification processes) as an approved version
- The successful completion of each stage automatically perform a Stage Transition to the next one or the transition can be done manually by updating the meta store afterwards
- The deployment process is then run to deploy the approved version and meta store updated with.
- The model version marked as deployed and if need the previous version marked for rollback
- The prediction capture process is updated to record or sign prediction with the new model version.
This process can be controlled and governed by using a manifest file that is uploaded as part of the candidate model to detail all the settings for each stage of the pipeline (Stage Transitions). This allows for the model to be refreshed by creating a new model version whenever required (i.e., due to time, diff or new data)
Conclusion
ML models are now having a real world impact on our businesses and lives and the usages of machine learning are growing in all sectors and industries. This growth, powered by experiments and creativity, is unlocking the enormous value contained within in our data.
As your usage of machine learning grows and matures, more oversight becomes necessary, especially in highly regulated industries like banking or health care. This requires you to provide traceability for your models, through Model Lineage, to evidence how models were trained and what data was used, along with how they were then approved and promoted. To help manage Model Lineage, we would advise you to look at putting in place a Model Registry. A Model Registry will help you to track your models from creation to expiration including what predictions were made and the regeneration of models due to changes in data or technology.
Having both Model Lineage and a Model Registry in place will allow you to power your next phase of scaled AI and meet any future regulatory needs.
If you need any help with embedding either of these techniques into your ML processes or pipelines then please get in touch.