At Inawisdom we are routinely taking our clients Machine Learning models and productionising them. In this blog I am going to cover some of the aspects of how we accomplish this, offer some top tips, and also share some things we’ve found along the way as we’ve lifted the bonnet on how Amazon SageMaker implements endpoints for performing predictions.

What are Amazon SageMaker Endpoints?

If you read any of our previous blogs or watched this video about AWS SageMaker, you’ll know that here at Inawisdom we are big fans of AWS SageMaker. It has lots of neat features for Data Scientists that helps them massively in preparing data and in training models. But what I really love is how AWS SageMaker handles the next step in the application of Machine Learning. The next step is using trained model artefacts to make real-time inferences or batch predictions. This is where an Amazon SageMaker endpoint steps in – an Amazon SageMaker endpoint is a fully managed service that allows you to make real-time inferences via a REST API. Taking the pain away from running your own EC2 instances, loading artefacts from S3, wrapping the model in some lightweight REST application, attaching GPUs and much more. This is great as it means with a single click or command you have a fully working solution, here is an example deployment for XGBOOST from a notebook in AWS SageMaker:

Production Workloads

For basic workloads a simple deployment of the kind above will allow you to get up and running, then sit back and watch inferences/predictions being made. However, for production workloads with high throughput or what are mission critical a lot more needs to be considered. Currently for an AWS SageMaker Endpoint you pay an on-demand price for the number of hours that instances behind it are up and running (including idle time). Therefore, like with EC2 instances the following needs to be considered and balanced:

  • Cost Optimisation: For an AWS SageMaker endpoint you need to settle on an instance type for instances it uses that satisfies your baseline usage (with or with-out Elastic GPU)
  • Elastic Scaling: You need to tune the instances an AWS SageMaker endpoint uses to scale-in and scale-out with the amount of load, handling fluctuations in low and high usage
  • High Availability: Black Swan events like Availability Zone failures can occur, this means for mission critical systems then such events must be handled smoothly. High Availability also allows for the seamless rollout of updates of model updates and rollback to previous good version can occur without any disruption to service.

The Underlying Nature of SageMaker Endpoints

To understand how to optimise and tune an AWS SageMaker endpoint we need to understand more about their nature and how they use instances. Firstly, AWS openly publishes that AWS SageMaker uses Docker for training jobs and endpoints, more details can be found at:

Using Docker immediately raises the following questions:

  • How many docker containers will run on a single underlying EC2 instance?
  • Is Kubernetes or ECS used? And do I have to become a docker expert?
  • How fast and how slow are instances started and stopped?
  • How do instances reside within the VPC and use network resources? For example, can the number of instances exhaust the network addresses of a VPC?
  • How isolated are my models? as docker uses soft CPU and Memory units?
  • Will I suffer issues if containers are bin packed or re-distributed?

The Experiment

In order to lift the bonnet of an Amazon SageMaker endpoint and to look into all these considerations I conducted a series of experiments, and here is how I did it:

    1. I created a dedicated VPC in EU-West-1 with three /20 subnets (4091 addresses), one located in each of the three Availability Zones
    2. I deployed an notebook instance of Amazon SageMaker into the VPC in AZ 1a and then downloaded this example notebook xgboost byom.
    3. In the notebook I changed the model configuration to use the VPC:
    4. I changed the endpoint configuration to use more than one instance:
    5. I followed all the steps up to “Create endpoint” and then noted the addresses available in the subnets
    6. I then ran the “Create endpoint” section and waited until the endpoint was in service and then noted the addresses available in the VPC
    7. I then create a new AWS Lambda function that takes a body of a HTTP POST request and forwards it to the AWS SageMaker endpoint URL. Once a response is returned from the endpoint than lambda needs returned the response back to the invoker
    8. Next I created an AWS API Gateway instance with a regional endpoint and used an API Key for authentication. Within the AWS API Gateway I created a resource with an AWS Lambda integration that invoked the lambda I created in step 7
    9. I downloaded, installed and configured serverless-artillery to hit my API Gateway:
    10. I started the load test from serverless-artillery
    11. I recorded repeatedly every 15 minutes for 2 hours the available addresses in the VPC and metrics from CloudWatch

Please note:

  • The ‘ml.t2.medium’ instance type was used as there was a very tight service limit on the AWS account I was using. This highlights a very important point, AWS Service limits for AWS SageMaker are very restricted. You should be proactive in asking for a limit increase once you know your instance type and the number of instances you will need for an endpoint. I would advise making the support request in advance of commissioning any model into production.
  • An API Key was used instead of IAM roles for API Gateway to make the integration with serverless-artillery as easy as possible


The load test of the XGBoost model lasted for 2 hours with a slow ramp up and then sustained level of load. I was able to confirm this and the fact that I was using two instances by looking at the endpoint’s total CPU usage in AWS CloudWatch. Here is the graph:


Some noticeable interpretations from this graph are:

  • There was a slow ramp in for the first 15 mins until we hit around the 200% CPU usage mark. The 200% CPU usage means we are using more than the capacity of a single endpoint instance.
  • At 13:20 we saw the start of a drop in the CPU usage and at 13:40 it stopped at 100%, why was this? From the load script I configured we know that this is when serverless-artillery entered the 2nd phase of sustained load.
  • We then saw a return to 200% CPU usage 10 minutes later.
  • At 14:40 we saw a complete stop in load and this is when the serverless-artillery job completed.

In order to look into what happen at 13:40 I looked at the AWS CloudWatch Logs streams:

From the AWS CloudWatch Logs we can see a few things:

  • During the entire run there was only three log streams and each one has an ‘instance id’ as an identifier. This combined with the fact that we only saw two used IP addresses in the Subnets means there was only ever 3 instances.
  • There were 2 instances at the start of the run and at the end we still had only two instances. However, in between at 13:24 one of them stopped making log entries and at 13:38 just before 13:40 (the 100% CPU marker) we can see in CloudWatch a new instance starts to record log entries.

In order to look into more details into what happens on instance i-01be56xxx then let’s look at the log stream entries:

Here is a view of the same entries in AWS CloudWatch Logs Insights for 3 minutes:

The findings are:

  • There are no errors in the logs and the last entry was at 13:24. This would imply the instance did not crash due to a fatal error.
  • Instance i-010adxx started logging at 13:38 but we only saw the CPU start to increase at 13:45, i.e. a gap of 7 minutes to start up, which is in line with the default start-up of 5 minutes.
  • The output from AWS CloudWatch Logs Insights confirms that each instance was producing 100-150 entries per second (one entry per request).

From AWS CloudWatch Logs we can also find more indications on how endpoint instances work. Here is the top of that same log stream of instance i-01be56xxx:

The log entries above tell us the following:

  • The XGBoost inference engine is implemented using Gunicorn – this means the endpoint is communicating using a lightweight REST API to the model on located each of the instances.
  • 5 Gunicorn workers were created and this is very important as Python is single threaded and you need to relate the number of workers to CPU cores. Here is the explanation from Gunicorn:

“Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request” 2019 ©


The key findings from the experiment were:

  • Load was spread evenly over the 2 instances and each instance can serve a large amount of inference requests per second. This means that only a few instances are required, and this is reflected in the AWS SageMaker service limits.
  • The failure of Instance i-01be56xxx proves that High Availability is possible, and instances can auto-heal. Instance i-01be56xxx was a ‘ml.t2.medium’ and we can assume we used up all of its CPU credits, causing it to become overwhelmed and stop responding when there was not enough CPU to respond to health checks.
  • Instance i-01be56xxx termination took 5 minutes and instance i-010adxx took 7 minutes to start. Such timings are indicative of the time it takes to start EC2 instances (5 minutes or more). Whilst By comparison, normally docker containers would take minutes and lambdas functions would take seconds.
  • The ‘’InitialInstanceCount’’ was set to 2 as a hard limit. In neither the AWS CloudWatch outputs nor in the VPC monitoring did we ever see less than or more than 2 instances launched at the same time. We however did prove that if an instance crashes or terminates that it will be auto healed.
  • We have proven from the endpoint creation process and from building a model that an endpoint uses Docker containers on EC2 instances. The load test confirms that the containers are invoked using requests.
  • We have seen that each instance can serve a high number of requests per second and can handle multiple requests at once. This means a low number of instances is used for inferences/predictions.
  • Each instance if deployed into a VPC will use a single IP Address. This means that if a reasonable subnet range is used then the number of IP addresses will not be exhausted.

Further Reading

We could stop there but let’s think about ‘’InitialInstanceCount’’ again and discuss a “top tip”. From the findings in this blog it does not seem very elastic, but this is actually not correct. You can make the instances a AWS SageMaker Endpoint uses scale in and out with your load. The way you can achieve it is by using an EC2 application auto-scaling group and setting AWS SageMaker Endpoint as your target group. This allows you to scale them like a regular application auto-scaling group including adjusting the scalable dimension. How to do it is hidden in the depths of the AWS documentation (AWS : Endpoint auto scaling add policy). You can set up an auto-scaling group from the AWS Console, CLI or using CloudFormation. Here is a CloudFormation example:

Please note: ‘’InitialInstanceCount’’ is now truly an initial ‘Desired’ instance count.


We have clearly shown that AWS SageMaker uses Docker for endpoints to provide isolation and abstraction. It allows the inference engine of the machine learning model to be coded in whatever programming language is your preference. It also provides the inference engine with isolation from the Operating System of the underlying instance. Docker is not being used for consolidation (i.e. a single Docker container is hosted per instance) and using the correct instance type in production is very important. You have to be very careful that the instance size suits your model.

In order to get both the best performance and the most optimised costs you need to load test your model. The target to aim for is to work out the number of invocations your model can handle on your instance type – balancing the CPU utilisation with the error rate and reaching around 80% usage to allow for some spikes in traffic. We also highly recommend utilising all the AWS CloudWatch metrics that AWS SageMaker provides, including putting alarms on model latency and invocations per instance.

AWS SageMaker provides high availability and fault tolerance for endpoints by allowing you to configure multiple instances spread over multiple Availability Zones. Deploying an AWS endpoint into a VPC will provide the model with a connection to resources in a VPC and can add additional security. However, in a set of reasonably sized subnets you do not need to worry about exhausting the network’s IP addresses.

I hope this blog has provided a detailed insight into how AWS SageMaker implements endpoints and how endpoints work under the hood. This has been invaluable information for us at Inawisdom when productionising Machine Learning models!