Last week, following an invitation from AWS, Inawisdom supervised and participated in a SageMaker Data Science Hackathon in London. The goal of the Hackathon was to help a digital media organisation explore Amazon’s SageMaker capabilities and the fit to their business needs. This will allow the organisation to innovate faster and unlock the team’s creativity. For me, this was an excellent opportunity to work beside other Data Scientists and compare techniques and ideas. I also got to drill some more in deeper into SageMaker’s more advanced functions.

In Inawisdom, using SageMaker is a no-brainer. It’s a tool that allows us to work on Data Science projects collaboratively while keeping client data safe, and move from training to production quickly. Besides, Inawisdom’s acceleration platform, RAMP, is based on AWS technology, so using SageMaker makes sense as we can integrate systems and services with ease.

Media and AI

In a sector that has been using data science to a considerable extent for quite a long time now, Media and Ad companies have an interesting challenge.  In many sectors we see organisations really at “day 1” when it comes to AI/ML. These are organisations that are predominantly seeking to augment their operations using AI.  But in the Media and Ad sector, data science plays a pivotal role in their business models. However, their data science maturity leads to a curious “first mover disadvantage” in some ways.  The time investment that has gone into their existing models means that…

  • Changing models (e.g. to use SageMaker-hosted neural networks or custom algorithms) overnight is not easy – as there are existing production deployments to transition over
  • Any new models will take time to tune and refine to get to the accuracy and (probably more importantly) business confidence levels of the existing battle-tested models

Both of these factors will create a “drag” effect on the early data science adopting industry sectors. Ironically the latest data science technologies like SageMaker may have a brake on their adoption from the most data science aware organisations fior this reason. Not because they don’t want to adopt, but because there is some inertia to overcome. Hence, a  Bring Your Own Model (BYOM) approach is an obvious evolutionary step to SageMaker adoption. Organisations have good and sophisticated models, but using SageMaker to handle the heavy lifting of highly scalable training and endpoint inference hosting of those models is a good place to start.

The sector often uses straight-forward ML models such as General Linear Models and Decision Trees mainly for their interpretability and staff experience. These models have been around for a long time and there are many scientific papers and case studies that portray them as the industry standard. We discussed the next generation of these models and we believe they will feature an element of Deep Learning / Neural Networks to boost the models’ prediction performance and capture hidden nuances.

SageMaker and BYOM

My first observation was that everyone picked up SageMaker really quickly. Not surprisingly though, as it uses Jupyter Notebooks which is a familiar tool in data science. When it came to special SageMaker commands, the online documentation proved very useful, as well as StackOverFlow forums and AWS blogs.

Second observation, and as expected: the need to BYOM. Luckily, the SageMaker allows users to bring their own bespoke models to the AWS environment and use its flexible capabilities. The users can package their own algorithms building a Docker container and use it for model training and inference. In some cases, where a framework/model has direct support in SageMaker (XGBoost, k-means, TensorFlow, MXNet etc.), users can use the existing SageMaker containers and load their models straight away. In general, BYOM is not that hard but Data Scientists / Engineers are expected to spend a couple of days setting this up. Knowing Docker containers can really expedite this.

Third observation: the need to write in other programming languages. Again here, SageMaker has the answer – if your team’s preferred weapon of choice is R, you can install an R kernel, build an R Docker container and use SageMaker for training and hosting.

Finally, one thing that stuck with me was the interest in batch inference. SageMaker can make predictions by calling an API (“invoke-endpoint”) on a deployed model. Although one can make a single prediction using one line of input data, batch functionality also exists. This allows users to make predictions on a larger size of input data. Doing so may be tricky: The batch data needs to be serialised to CSV string payloads (as per the API’s body request) and deserialised upon receipt. AWS provides examples of how to do this in SageMaker Examples tab (see here).

The entire list of SageMaker guiding notebooks can be found here.