Hope for those of you who celebrate it, that you had a good Christmas and here is my first blog of the new year. This blog covers the very important subject about how to look after your Machine Learning models.
Previously both Robin, our CTO, and I have blogged about the challenges around getting a Machine Learning model live and delivering value to your business. However, this is not the end of the story for your model, it is in fact just the start of it. Your model is now in the real world where data changes and over time models drift (becoming less accurate). Therefore, you need to care for your model by doing both the following things:
- Nurture your model: Have your Data Scientist routinely look back over any updated data for more features or changes to existing features, then look to adapt or change your model to reflect those changes.
- Feed your model: Build both a data and continuous deployment pipeline to handle updates to your data, retrain your model when needed and to deploy the resulting updated model for inference.
This blog looks at the recently announced AWS Step Functions Data Science Software Development Kit that aims to help engineers in implementing a data and continuous deployment pipeline for your model. I will review it based on the experience I have from situations where Inawisdom have developed data and continuous deployment pipeline for our customers’ models.
I am going to refer to AWS Step Functions Data Science Software Development Kit as AWS Data Science SDK, from now on in this blog.
Background
The machine learning development life cycle has 3 iterations:
- Explore: We call this ‘Discovery’ at Inawisdom and it is about exploring all the data you have within a problem domain, then identifying opportunities for using Machine Learning with that data and then creating an initial model to prove an opportunity.
- Refine: Is about continuously working on the data and model for an identified opportunity, in order to improve accuracy and to realise value from your model.
- Repeat: Is about maintaining accuracy with updated data and retraining.
More details can be found here https://www.jeremyjordan.me/ml-projects-guide/. A few of our clients have reached the “repeat” iteration. Examples of those who have, are both our public case studies; Drax and Aramex. Aramex is an engagement which I have personally worked on for the past year and one of their biggest uses of machine learning is Address Prediction. Address Prediction takes a descriptive address and predicts the route to deliver a parcel successfully. This is a very complex use case that requires over 190 models and a subset of those models need retraining weekly with new delivery data. This means we have worked closely with Aramex at building a pipeline over the last 6 months. From this experience I would expect to see the following key components from the AWS Data Science SDK:
- AWS Glue: Used for raw data ingress, cleaning that data and then transforming that data into a training data set
- Amazon S3: Used to store your data and training data sets
- AWS Lambda: Used to stitch elements together and perform any additional logic
- AWS ECS/Fargate: There are situations where you may need to run very long running processes over the data to prep the data for training. Lambda is not suitable for this due to its maximum execution time and memory limits, therefore Fargate is preferred in these situations.
- Amazon SageMaker training jobs: Lastly the ability to run training on the data that the pipeline has got ready for you.
Some additional features that I would love to see are:
- Deployments to Amazon SageMaker endpoints: The ability to perform deployments from the pipeline, including blue/green, linear and canary style updates.
- AWS Cloud Formation (Infrastructure as Code): In reality you would want to be able to version your pipeline and run it in multiple accounts so that you can test any changes to the pipeline before you update your operational copy.
Using the AWS Data Science SDK
In order to get to grips with the AWS Data Science SDK I quickly deep dived it by going to its GitHub location, https://github.com/aws/aws-step-functions-data-science-sdk-python, I started up a SageMaker notebook instance, loaded the example notebook, completed all the steps within it and created an AWS Step Function. Then once I had got to grips with the basics, I created a full example pipeline that utilised AWS Glue and Amazon ECS that is at the bottom of this blog. In doing so I found the AWS Data Science SDK is very clean and simple to use; it successfully addresses my own thoughts about defining StepFunctions. I find defining StepFunctions in JSON and YAML is quite limiting if, like me, you are from a development background. They have all the constructs of a programming language; loops, conditions, etc. However, I cannot “program” StepFunctions like I would normally. This is exactly what the AWS Data Science SDK gives a Data Scientist, a way to “program” a StepFunction from the comfort of their familiar environment of a notebook.
Capabilities
In terms of capabilities, from the full example, I can confirm that the AWS Data Science SDK supports all the mandatory features that I identified earlier. The reason for this is that the SDK is completely dependent on the service integrations that AWS StepFunctions supports; here is a summary:
- ETL: For ETL, the AWS Data Science SDK allows you to specify an AWS Glue job or ECS tasks as a step in your pipeline. However, you still need to create and implement these logics outside of the SDK. If you are going to use ECS, then you will require detailed knowledge of how ECS works, for me this is more than I would expect a Data Scientist to know but if they are coupled with a DevOps engineer then it would be sufficient.
- Training: The AWS Data Science SDK combines the standard interface that Data Scientists use, the SageMaker estimator from the SageMaker SDK, with the StepFunction SageMaker Training Job Sync service integration. This feels like a very natural fit and it means that it was very easy when compared with doing the SageMaker Training Job Sync integration directly.
- Deployment; The AWS Data Science SDK supports making deployments to a SageMaker endpoint. This is done using the Pass Parameters Step Function feature to a Service API https://docs.aws.amazon.com/step-functions/latest/dg/connect-parameters.html. This is functionally fine and works; however, I would have liked a higher-level integration like there is for training jobs instead of the three API calls you have to make. I would also have liked it to do other types of deployment, as only rolling is supported, hence why I used a lambda function in my example to handle the deployment to SageMaker.
Visualisation
To aid you, the AWS Data Science SDK offers the ability to visualise the StepFunction it will create into your notebook and once you have built and run the StepFunction it allows you to render its progress. This is the visualisation of the StepFunction that I created earlier, rendered within the notebook I used:
Infrastructure as Code
In terms of Infrastructure as Code and DevOps the AWS Data Science SDK supports both Amazon States Language and CloudFormation exports to JSON. This means you can easily handover the pipeline to a DevOps engineer to deploy into production. However, a word of warning here is that all the ARNs and Container URLs are hardcoded, so the first thing the DevOps engineer will need to do is parameterise and add some mappings. One for the SDKs roadmap. Using the CloudFormation export feature, here is an example JSON output:
Template
After doing this deep dive, I feel it could still be a big ask for a Data Scientist to program a StepFunction from scratch, especially if the Data Scientist has limited AWS experience. To help here the AWS Data Science SDK has a “pipeline” feature. The “pipeline” feature builds you a templated step function that will implement the very simplest of pipelines. Using the training_pipeline_pytorch_mnist example notebook I created the following basic pipeline:
Feature Requests
One additional feature I would like on top of the ones I have highlighted so far is for the AWS Data Science SDK to set up an AWS Cloud Watch trigger for the pipeline. If you are going to automate retraining of a model, then it is likely to be because one of the following:
- Running the pipeline based on data changing or the passing of time, for this then setting up a scheduled CloudWatch Event is the typical solution.
- Running the pipeline based on model drift. This can now be achieved using the new SageMaker Model Monitor. SageMaker Model Monitor can output a prediction metric to CloudWatch, and you can use an alarm to trigger the pipeline.
Currently both are not supported by the SDK. Your need to handle this yourself.
Conclusion
The AWS Data Science SDK is very useful and allows the AWS-aware Data Scientist or Machine Learning engineer to quickly get a basic pipeline running. I can see Inawisdom using the SDK on projects at the MVP stage that require us to quickly spin up such a pipeline. For full Productionisation, or the more advanced cases, it would act as a starting point for a hardened DevOps engineer to finish off and surround your core logic with the engineering rigour required. However, it certainly beats writing JSON! ! And that’s why the AWS Data Science SDK is so great as you can use it from any python file or script to define any type of step function.
Full Example
Please see below the example I created and please note it is adapted from the real world scenario and will not work out of the box.
failed_state = Fail("HandleFailed") prepare = LambdaStep( state_id="Prepare", parameters={ "FunctionName": "PrepareFunction” } ) prepare.add_retry(Retry( error_equals=["States.TaskFailed"], interval_seconds=15, max_attempts=2, backoff_rate=4.0 )) prepare.add_catch(Catch( error_equals=["States.TaskFailed"], next_step=failed_state )) extract_data = GlueStartJobRunStep( state_id="Pull data (Glue job)", parameters={ "JobName":"iaw-bigdata-training-redshift-to-parquet", "Arguments":{ "--destination.$":"$.S3Location" } } ) extract_data.add_retry(Retry( error_equals=["States.TaskFailed"], interval_seconds=15, max_attempts=2, backoff_rate=4.0 )) extract_data.add_catch(Catch( error_equals=["States.TaskFailed"], next_step=failed_state )) transform_data = EcsRunTaskStep( state_id="Convert to training set (ESC Fargate)", parameters={ "Cluster":"cfECSCluster-ecsCluster-xxxx", "TaskDefinition":"cfDataCleaning-ecsTaskDefinition-xxxH", "LaunchType":"FARGATE", "PlatformVersion":"1.3.0", "NetworkConfiguration":{ "AwsvpcConfiguration":{ "AssignPublicIp":"DISABLED", "SecurityGroups":[ "sg-xxxxxxx" ], "Subnets":[ "subnet-xxxxxx", "subnet-xxxxxx", "subnet-xxxxxx" ] } }, "Overrides":{ "ContainerOverrides":[ { "Name":"amx-dev-bigdata-ecs-apcleaning", "Environment":[ { "Name":"SOURCE_LOCATION", "Value.$":"$.S3Location" }, { "Name":"OUTPUT_LOCATION", "Value.$":"$.S3Location" } ] } ] } } ) transform_data.add_retry(Retry( error_equals=["States.TaskFailed"], interval_seconds=15, max_attempts=2, backoff_rate=4.0 )) transform_data.add_catch(Catch( error_equals=["States.TaskFailed"], next_step=failed_state )) estimator = sagemaker.estimator.Estimator( role=“IAM role”, train_instance_count = 1, train_instance_type = “ml.p3.2xlarge”, train_volume_size = 30, train_max_run=86400, image_name = “520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow-scriptmode:1.12.0-gpu-py3”, input_mode = “File”, output_path="$.S3Location", tags=[ { “Key”: "AutoTraining", “Value”: "True" } ], model_channel_name=“model”, enable_sagemaker_metrics=False, hyperparameters = { “clusters”: 10, “learning-rate”: "0.0007", “batch-size”: "1000", “epochs”: "1000", “min-epochs”: "0", “patience”: "15", "sagemaker_program": "runner.py", "sagemaker_region": "eu-west-1", "sagemaker_submit_directory": "s3://somebucket/sourcedir.tar.gz" } ) input = stepfunctions.inputs.ExecutionInput() train_model = TrainingStep( state_id="Train model (SageMaker)", job_name= input, estimator = estimator, data = { “train”: “s3://somebucket/train/data.parquet”, “val”: “s3://somebucket/val/data.parquet”, “test”: “s3://somebucket/test/data.parquet” } ) train_model.add_retry(Retry( error_equals=["States.TaskFailed"], interval_seconds=15, max_attempts=2, backoff_rate=4.0 )) train_model.add_catch(Catch( error_equals=["States.TaskFailed"], next_step=failed_state )) deploy_model = LambdaStep( state_id="Deploy Model", parameters={ "FunctionName": "LambdaFunction", #replace with the name of the function you created "Payload": { "input": "HelloWorld" } } ) deploy_model.add_retry(Retry( error_equals=["States.TaskFailed"], interval_seconds=15, max_attempts=2, backoff_rate=4.0 )) deploy_model.add_catch(Catch( error_equals=["States.TaskFailed"], next_step=failed_state )) succeed_state = Succeed("HandleSuccess") happy_path = Chain([extract_data, transform_data, train_model, deploy_model, succeed_state]) workflow_definition=happy_path # Next, we define the workflow workflow = Workflow( name="MyWorkflow_v1234", definition=workflow_definition, role=workflow_execution_role )
very informative article post. much thanks again
Like your post.