Inawisdom engagements invariably involve lots of data wrangling – whether that’s data cleansing and feature engineering for a Machine Learning DaaS , or migrating customer data into a new AWS RAMP environment.
Our rapid, agile delivery approach requires us to demonstrate value quickly, often by finding innovative ways to ingest and transform data.
In this blog, I’ll explain how we took 2000 Excel workbooks, which contained 8 million cell values and 500k embedded comments and processed them in 10 seconds.
For a recent data migration project, the customer had over 2000 Excel workbooks, each of which held a mixture of values and embedded cell comments.
Embedded comments present a unique problem because they are invisible to most import tools – we had to find a way to expose the data without writing reams of VBA macro code.
The workbooks started life as well-structured, machine-readable templates. Over time they were curated by users who accidentally deleted rows and columns and distorted the machine-friendly meta-data.
Our challenge was to extract the Excel data and transform it into S3 files, so that it could loaded into an AWS Redshift data warehouse.
The project fell neatly into 4 phases:
1. Unpacking a single workbook
2. Scaling the solution to handle 2000 files
3. Validating the process
4. Transferring the raw data to Redshift
Phase 1: Unpacking a single Excel workbook
Having settled on the versatile OpenPyXL Python library to unpack the Excel workbooks, we needed an iterative development environment, which would allow us to examine the data at each stage of its transformation; and to demonstrate the capability to the customer.
Inawisdom adopt a defensive security posture – customer data must remain in the AWS cloud during processing. SageMaker Notebook Instances gave us the perfect blend of development flexibility and cloud security.
The notebooks run entirely in the browser (so data does not leave the customer’s AWS account) and they can be stopped when not in use, which minimises the cloud computing overhead.
They combine Python code, development notes and query results in a single document, which enables us to bring the customer alongside the development work.
We developed Python code to read a single XLSX file into a Pandas DataFrame. The Excel files had multi-line headers and gaps in the data, so we applied a series of transformations until we arrived at a conventional table structure.
The OpenPyXL library allowed us to fetch the underlying cell comments, as well as the actual cell values.
Each Excel workbook had a different width. To homogenise the data across 2000 files, we used the Pandas “melt” function to UNPIVOT the data into a uniform structure.
The Python code produced 3 outputs for each workbook processed: (1) a CSV of extracted cell values, (2) a CSV of cell comments and (3) a CSV of statistics for that file (e.g. filename, success/failure during processing, count of values, count of comments, etc.)
The “Values” and “Comments” data from all the workbooks would eventually be loaded into two Redshift tables.
The “Statistics” CSVs would be used by AWS Glue & Athena to validate the end-to-end ETL process.
Phase 2: Scaling the code to 2000 files
It took only 10 seconds to strip an Excel workbook of its data and comments and produce the resulting CSVs.
Although this was quick for one file, processing 2000 files serially would have taken around 6 hours.
We knew we would be running the routine several times during testing; and the go-live window was small.
6 hours was too long – the files would have to be processed in parallel.
By packaging the Pandas and OpenPyXL libraries into a Lambda Layer, the final Python notebook code could be ported directly to a single Lambda function.
An S3 bucket was configured to trigger a separate instance of the Lambda function for each Excel file that arrived in a designated “file drop” folder.
At run-time, the customer copied 2000 on-premise Excel workbooks into this “file drop” folder in one go.
In response, 2000 parallel invocations of the Lambda function worked in unison to unpack the files, outputting 2000 CSVs each of values, comments, and status logs into their respective output folders.
AWS CloudWatch monitoring helped us track the Lambda activity – we could see the function being called in parallel and we had early visibility on any processing errors:
We now had the raw data, stored as CSV files in S3.
Phase 3: Validating the process
The Excel migration formed part of a wider web application delivery. Once the data was migrated in, the Excel source files would be decommissioned, so we had to guarantee the accuracy of the data being loaded.
We used AWS Glue to “crawl” the status logs and create a Glue Data Catalog meta-data table, which could then be queried directly using AWS Athena.
Athena provided the framework to write SQL queries directly against the CSV status logs in S3. The whole setup process took no more than 5 minutes, from crawling to querying.
We wrote a set of SQL validation queries to count the files processed; and tally the values and comments extracted from all files:
We could now ensure that all files were accounted for, and we could check the value and comment tallies once the data had been loaded into the Redshift data warehouse.
The Athena SQL queries helped us to quickly identify the corrupt, unreadable files, which Lambda had failed to process. These were cases where the users had inadvertently broken the template structure.
Because the whole extraction took 10 seconds (not 6 hours), we could quickly identify the rogue files, which the customer could repair and resubmit for processing within minutes.
Phase 4: Transferring the raw data to Redshift
Once all the “Values” CSVs were homogenised and stored in the same S3 folder, we were able to issue a single Redshift COPY command to load the data from all 2000 files into a Redshift table in one go. The same process was applied for the “Comments” CSVs.
Once in Redshift, the extracted values and comments would go on to be merged with other sources to support the customer’s migration project.
This project involved 2000 Excel workbooks, but the same platform (SageMaker Notebooks, Lambda & Athena) would have comfortably scaled to 200k files or 2 million.
The Data Engineering team at Inawisdom love a technical challenge. However, we measure ourselves by the long-term relationships we foster with our customers.
In this instance, sophisticated AWS tooling enabled us to rapidly develop a scalable solution; and provided the mechanism to communicate clearly with the customer at all stages of development.
AWS provide a vast and versatile range of tools for Big Data engineering. The number of options for any given problem can be bewildering, which is why the AWS certification process is so rigorous. Our customers have confidence in our solutions because they know our engineers are qualified to make the right choice at the right time.
Do you have a data wrangling challenge you need help with? If so, we’d love to hear from you. Please get in touch by contacting us on firstname.lastname@example.org