“Evolve or die.” These were the words that Werner Vogel used in his closing keynote at re:Invent 2022. He was alluding to the rapid pace at which the landscape around us all is changing and the fact that organisations need to continue to invest in digital transformation and modernisation to keep up with the competition.
Let’s take a quick look at just a handful of my favourite announcements from re:Invent, out of the many, many new features and services that emerged during the event. This post will be focusing only on the data-related announcements. Inawisdom CTO of AI/ML Phil Basford will be be covering the machine learning announcements in a future post. I’ll also be publishing another post looking at some exciting new serverless announcements. Some of these features are generally available, some are only released in preview.
Let’s look at Redshift. Redshift is AWS’ data warehouse offering, designed to allow businesses to exploit insights from massive amounts of data. There are three new pieces of functionality in the Redshift arena that I consider to be notable.
Transactional data and analytical data are often stored in separate databases, due to different database engines being better suited to OLAP (online analytical processing) vs OLTP (online transactional processing) workloads. This means that if analysts want to derive insights from transactional data, they are often forced to build ETL (extract-transform-load) pipelines to move the data from one database to another. A key theme throughout the whole of re:Invent was AWS’ desire for a zero-ETL future. By this, they mean a cloud where moving or accessing data between services is easy and there’s little-to-no need for ‘glue’ code (not to be confused with AWS Glue).
The first announcement is Aurora zero-ETL integration with Redshift. This solves the problem we’ve just talked about of there being friction in the analysis of transactional data. With this new feature, data from Aurora will be accessible in Redshift in near-realtime. This allows for data analysts to enrich queries with live business data, delivering more accurate insights to the business quicker. All without data engineers having to build custom ETL pipelines.
Next up, let’s talk about getting data into Redshift. There are two main ways in which data gets into Redshift: the first is via the INSERT INTO SQL query and the second is using the COPY command to load data from S3 objects. The latter is the option that most use, as it makes loading large amounts of data much easier than specifying all the values in a SQL query.
Competitors to Redshift, such as Snowflake, have features like Snowpipe which automatically loads data into the data warehouse when it is uploaded to an S3 bucket. This is great for architectures where ELT (extract-load-transform) is the preferred way over ETL. The data would be loaded into a table in the warehouse without any transformation, opening the door for a tool like dbt (Data Build Tool) to curate the raw data and build data models.
Prior to re:Invent, the only way to get this functionality with Redshift as your data warehouse was to build a custom pipeline (I even wrote a blog post about it). But now, AWS have announced the preview of auto-copy S3 to Redshift. This feature allows the creation of ‘copy jobs’ which automatically run a COPY command against Redshift when an object is created in S3. This is a big step in the right direction for Redshift; the presence of the VARIANT data type in Snowflake, which allows the storage of semi-structured data in a single column without having to define target schema upfront, means that auto-loading of data in an ELT pipeline with Redshift still has a little way to go in order to catch up with competition.
The final Redshift announcement I want to look at is Redshift Multi-AZ (preview). For RA3 clusters, they can now be created in Multi-AZ mode to achieve high availability for those businesses who deem 24/7 access to their data warehouse as business-critical. Redshift deploys two clusters in an active-active configuration across two Availability Zones, automatically shifting new queries and connections to the healthy cluster in the event of a single AZ becoming unavailable. On paper, this seems like a very useful announcement; however, there are a long list of limitations detailed in the documentation to be aware of before exploring it for a future project. Hopefully the majority of these get resolved before general availability (GA).
QuickSight is AWS’ data visualisation tool for business users. Typically when using QuickSight, a visualisation engineer will set up a suite of dashboards and reports that meet the business’ requirements, and these are consumed until the needs change. With the advancements in Natural Language Processing (NLP), QuickSight Q was released in September 2021. QuickSight Q enables the exploration of data by asking questions such as “What was the top product sold this week?” rather than having to build a chart from scratch. This reduces the time to insight for data and business intelligence analysts.
A new announcement from re:Invent was the availability of two new question types in QuickSight Q. Now, analysts are able to explore data with questions like “Why were my sales lower in EMEA in Q3 2022?” as well as “Forecast sales in NAMER in Q1 2023”. I saw a demo of both of these in a session at re:Invent and they look pretty impressive. It’ll be interesting to see how well they perform in a real-life environment, especially on the ML-powered forecasting front.
Glue Data Quality
Any data platform is only as good as the data within it. If the quality of the data within the data lake deteriorates it can become a data swamp. This massively reduces the usefulness of the platform and inhibits the ability to extract any valuable insights from the data.
In order to prevent data swamps forming, monitoring and enforcing data quality within the ETL or ELT pipeline is essential. There are plenty of choices available, such as Great Expectations, however there has been no ‘easy’ AWS first-party offering up to now.
At re:Invent, Glue Data Quality was announced as a way to monitor data quality both at rest in a data lake, as well as in transit as part of a data pipeline. Glue Data Quality is built on top of an Open Source library, Deequ, an AWS project that aims to implement unit testing for data on top of Apache Spark. This new service supports the definition of custom rules in addition to suggesting rules based on existing data.
I’m excited to see how effective this service turns out to be – data quality monitoring is an essential part of a mature data platform and any way that AWS can make that more accessible is a good thing.
Last, but very much not least, DataZone. This is a brand-new service for AWS in an area that they haven’t really explored before.
Amazon DataZone is a data cataloguing and governance tool to help “unlock data across organizational boundaries”. DataZone is based around the concept of data producers and data consumers. Data producers make data available from sources such as Redshift and Glue Catalog in projects. Data consumers then subscribe to these projects to gain access to the data. Consumers can then explore the data by utilising federated access to the Redshift or Athena query console.
Whilst this service is still in closed preview, it looks to be a great addition to the AWS suite and I can see this being useful for larger organisations where the practice of data sharing can yield incredible benefits and increased agility.
AWS released some interesting new services and features in the data space at re:Invent. Overall, I was relatively pleased with the quality of what was announced – however, I’ll reserve absolute judgement until preview features become generally available and can be properly evaluated.
Keep an eye out for Phil Basford’s post on the AI/ML announcements, as well as my next post on the general cloud and serverless announcements, coming soon!