At Inawisdom we use Serverless technologies and Micro Service architectures for embedding Machine Learning models within applications or creating distributed solutions. We have found that in production Serverless architectures are cost-efficient and address many of the classic non-functional operational and monitoring requirements of our customers.
This blog will use an example Serverless architecture for the inference process (making a prediction) and it will look at how to monitor its operation, performance tune its implementation, and eliminate bottlenecks.
In order to accomplish this, we will be using the following AWS services/tools:
REFERENCE ARCHITECTURE:Above is our reference architecture for this blog. Here is a quick rundown of the components (each one of these components could be subject of its own blog):
- Amazon API Gateway exposes and manages the API and is our contract with the rest of an enterprise (or it could be consumed by a mobile client)
- Amazon Dynamo DB will store any reference data
- Amazon SageMaker Endpoint will host the deployed ML model
- AWS Lambda will contain the application logic; it processes the API request, looks up the reference data, calls the model for a prediction and then formats the response
The target uptime for the reference architecture is 99.95% (same as API Gateway), it is business-critical and a response time of below 200ms at the 90th percentile is required. Let’s assume that everything is Multi-AZ. If you would like to know more about High Availability and Fault Tolerance, then see “Amazon SageMaker Endpoints: inference at scale with high availability”. This blog however will focus on how to prove the SLA is met and the performance targets are not breached.
UPTIME AND USAGE
API Gateway is a great service and has lots of built-in Amazon CloudWatch metrics for your KPIs. However, there are a couple of metrics noticeably absent; availability and Usage Plan usage. A Usage Plan is a set of limits that you can place on your API to stop abuse and control expenditure. Your API will return HTTP status code 429 if the limits are breached. Without any additional logging or monitoring it is hard to spot that the issue is Usage Plan related. Both these can be overcome using AWS Lambda.
Firstly, use a lambda function to implement a health-check that fires every minute, triggered from a scheduled Amazon Cloud Watch Event. The health-check polls the API with a single ‘meaningful’ request from a different account than the account your API is deployed within. A ‘meaningful’ request is important as it needs to execute the solution end-to-end. On a successful return from the API then the success needs to be registered. This is accomplished by the health-check lambda putting a metric into Amazon CloudWatch to update a simple counter. In order to know when the health-check fails an Amazon CloudWatch alarm is configured so that if the metric is missing for 5 minutes then an alert is signalled.
Secondly, in the same account as the API create a different Amazon CloudWatch event that fires daily and invokes a new AWS Lambda function. The new lambda function needs to call the API (management plane) of the API Gateway service to get the previous days usage details for each Usage Plan and then puts corresponding metrics into Amazon CloudWatch. Like before, set up an Amazon CloudWatch alarm such that if the usage metric is within 10% of the usage plan limit then signal an alert.
Lastly Amazon CloudWatch dashboards are great for transparency and spotting issues during an incident. Therefore, create an Amazon CloudWatch dashboard for the API and solutions that includes availability and usage.
AWS Lambda, API Gateway and SageMaker Endpoints record numerous Amazon CloudWatch metrics. The key ones to look at for KPIs are:
- API Gateway 4XX and 5XX errors
- Lambda Latency
- Lambda Invocations
- Lambda Errors
- SageMaker Endpoints Model Latency and Invocations
- SageMaker Endpoints Disk, CPU and memory usage
- SageMaker Endpoints 4XX and 5XX errors
For the above metrics configure AWS CloudWatch alarms according to your KPIs. Our top tips here are:
- Detect and alarm on any abnormal changes in traffic volume. For example, for invocations create an alarm with a “less than” threshold and combine this with an alarm for a “greater than” threshold. The new anomaly detection feature in CloudWatch can also be used for these purposes.
- Design your alarms to work well together, for example avoid triggering multiple alarms if the same metric is missing for a period. Instead have one alarm that breaches if missing and set the other alarms to ignore.
- Dealing with something as soon as possible reduces the business/user impact. In order to take proactive steps, you need to consider levelling your alarms, i.e. 80% warn / 95% critical.
- Standardise your alarms by creating them using a reusable CloudFormation template. This also allows you to track changes. For example, place the Amazon SageMaker Endpoints alarms into the same template created in “Amazon SageMaker Endpoints: inference at scale with high availability” blog.
- Make sure you enable AWS CloudWatch detailed metrics and logging on each stage of the API Gateway deployment
Again, lastly for transparency and ease of spotting issues during an incident, add your KPIs to the Amazon CloudWatch dashboard for your API and solution. Here is an example:
In order to understand the performance of a solution and how services interrelate to each other we use AWS X-Ray. AWS X-Ray is an application performance management tool that observes your solution at run-time. AWS X-Ray does this by capturing data from the invocation of AWS Services and your application code. Using the X-Ray in the AWS console provides a full overview of how each element is performing. In the example below the AWS console shows all the AWS services being used in our reference architecture including the average response times of the model in AWS SageMaker:
In order to use AWS X-Ray, you need to download the AWS X-Ray SDK. The AWS X-Ray SDK works by monkey patching the main AWS SDK and a number of open source SDKs. Monkey patching is wrapping a class at runtime with another, recording its invocation and completion. AWS X-Ray takes these captures locally, then in the background it sends them asynchronously in batches using UDP to the X-Ray Daemon. The X-Ray daemon runs as a sidecar to your main application (note for Lambda the running of the X-Ray Daemon is handled for you by the Lambda runtime). The X-Ray Daemon then sends the data to the AWS X-Ray API.
Here is how to monkey patch X-Ray in Python:
from aws_xray_sdk.core import patch_all
For using X-Ray, here are some top tips:
- X-Ray is not part of the AWS SDK; this means you need to add it to your dependency management framework. This also applies to the Lambda runtime, as it requires you to deploy the X-Ray SDK yourself. This seems a bit strange as the X-Ray Daemon is available as a sidecar within the Lambda runtime. This makes the X-Ray SDK ideal for putting into a Lambda Layer.
- Cold Starts are shown in the AWS console in amber – look into optimising these as much as you can
- Having monkey patched X-Ray, X-Ray will not record anything until you complete the following steps:
- Enable X-Ray for your stage in API Gateway
Ensure any Lambda functions have IAM roles that include a policy with write access for the X-Ray API
- Use AWS X-Ray everywhere (production, QA and development) and all of the time. X-Ray has a feature where you can use different Sample Rates and it loads them at run-time. This means in production the Sample Rate can be lowered to only 1% of your traffic. This can help in spotting any issues and identify any changes in performance between deployments.
AWS X-Ray is heavily used at Inawisdom as it is especially useful for performance tuning and solving bottlenecks. From the AWS Console, AWS X-Ray provides a visualisation of the captures it records in the form of traces. A trace shows the complete execution time of the Lambda, including processing time of any downstream services and how many times a service is invoked. The picture below is a trace from the reference architecture. The trace clearly shows the calls to the model (AWS SageMaker), from it we can see it took less than 15ms for the model to make a prediction:
In the above example trace there are ‘## predict’ entries; these entries are from another feature of X-Ray. X-Ray allows you to instrument the application logic within Lambda functions for even deeper analysis. Here is an example of how to annotate a function in Python, firstly include this:
from aws_xray_sdk.core import xray_recorder
Then add the X-Ray annotation to your functions. The annotation means that every time the function is run it will be recorded by X-Ray. Here is an example:
def __predict(self, req):
Annotations and multiple functions work as follows:
- Calling an annotated function from another annotated function nests the captures
- Calling an annotated function more than once from another annotated function nests the multiple captures under the parent capture
- Calling an annotated function from within a loop means each iteration will create a new capture, under the calling annotated function
Sometimes we have cases where we build or deal with custom models and SageMaker Endpoints. In this situation we are not using AWS stock TensorFlow or PyTorch SageMaker Docker images for inference, but instead building a model in python using Gunicorn. Looking at X-Ray we found that it supports Gunicorn, and we therefore integrated the two together. The result was when we deploy a custom model into Amazon SageMaker, we found that we can record traces from inside the model during live inference.
At Inawisdom we want to deliver as much value as possible for our customer’s business, and to do so at every stage of their Machine Learning journey – from initial discovery and hypothesis testing through to high-volume business-critical inferences. In this article I’ve shown how we combine the latest DevOps tools and services such as X-Ray, Lambda, CloudWatch, SageMaker and API Gateway to meet the considerations outlined in the AWS Well Architecture Operational Excellence pillar for the most demanding machine learning production workloads.