We’re obsessed with helping our customers better exploit their data, and always challenging ourselves to see if we can do things quicker, better and simpler. When new technologies and approaches appear and there’s an opportunity to speed up and improve what we deliver, then we’re all over it.
Dbt (data build tool) is one such example and in this post, I’ll talk through how we’ve handled and overcome the typical challenges we face every day when using dbt for customer projects.
First, let’s cover off the high level dbt project structure. We use a similar approach to dbt’s best practice guidance but tweaked slightly to align with Inawisdom’s logical representation of the main stages of data flow, i.e. Landing, Sanitised, Curated and Optimised.
Whether it be AWS native services such as Glue or Lambda, or 3rd party tools such as Matillion, this structure and ELT approach applied to data ingestion and transformation has served us well on many Inawisdom data platform projects. It enables us to quickly ingest disparate sources, while investing time in the right places, focusing on business value when it comes to curation.
In dbt, our approach is no different; we create a dbt model structure that represents each of the layers as shown below.
To bring this to life, we’ll describe each of the layers in turn…
Landing is represented through dbt Source definitions in source.yml as per https://docs.getdbt.com/docs/build/sources.
At this stage, data is source system aligned and represented in its raw form with no transformation. We separate each source with a model folder (and underlying schema) and database override macros to allow us to create nicely separate and secure environments.
Sanitised is also source system aligned, with each table/object defined as a dbt Stage model.
Here we apply lightweight transforms purely on the scope of the object (i.e. no joins) such as column renames, date format standardization and, in the case of JSON data, defining the data type. This ensures we can handle the variety challenge by only selecting the attributes we need downstream.
When consuming from Kafka topics, which often contain multiple message types, this too is the ideal place to apply a WHERE clause to limit the scope of messages to those we’re interested in.
Curated, represented as Core in the dbt model, is where the bulk of the action takes place.
Why Core and not Curated? It’s only a naming thing, but we find in the early stages of a project where the target data model is evolving and data domains may not be fully fleshed out, we place everything in Core – but with the freedom to easily extend the structure to domains such as Finance or Marketing. Whereas we see Curated as the logical layer above that contains both Core and data domains.
We’ll also override the default dbt schema behaviour to ensure each folder is represented as a separate schema to keep everything nicely aligned and easily manageable.
Tip: Notice the macro in the Core section that sets tables to transient in non-prod environments to save on Snowflake fail-safe costs (you can read more about this here).
Optimised is where we create models that will ultimately serve the end-user community. These are initially deployed as views (defined in the project.yml) to make it quick and easy to refine business logic and re-deploy. This pays dividends in the early stages of a project where the amount of change is high and a quick feedback loop is critical to maintaining momentum.
In the Curated layer, our view here is pragmatism, whether the modelling approach is 3NF, Star, Snowflake, Operational Data Store, One-Big-Table… the focus is on building a data model that is flexible and ultimately, one that makes sense to those working with it. With dbt, the essence is iteration and speed; we can change the target model relatively easily, and the learnings from deploying and using a first iteration of the model are far more valuable than overthinking and analysis-paralysis. That’s definitely not to say we don’t give it any thought, hence our emphasis on domain-orientated thinking.
Typically, in the early stages of understanding a customer’s data, we’ll be mindful of the data domains. It’s difficult to see the wood for the trees, where there are hundreds, if not thousands of tables in scope.
But grouping tables into logical domains helps not only structure the thinking and subsequent data model but, more importantly, helps you understand the business and how it works.
Examples of data domains in Financial Services could include: Quote, Agreement, Credit Risk, Collections. In Healthcare, we might see domains such as: Patient, Payer, Provider, Caregiver.
By understanding and thinking about domains, you’re naturally thinking about business context, rather than underlying physical tables and systems.
The other important point is that, wherever possible, you should decouple the end-users from the Curated layer by only exposing the Optimised layer. That way if you need to change the underlying data model, you’ve got the freedom to do so, whilst insulating end-users from those changes. If there’s a fairly simple hierarchy which makes sense to de-normalise into a single table Star schema style, then do so.
Streaming data presents its own unique challenges when it comes to ingesting and transforming data. For consuming from Kafka, Amazon MSK Connect provides a simple managed service for reliable transfer of data to your chosen data sync in near real-time. In our case, we’re transferring to S3 using the Confluent S3 Sink.
It is important to optimise the file size when loading data into Redshift or Snowflake, as you don’t want a high number of small files. So be mindful that whilst you can easily consume from multiple Kafka topics, the configuration settings that control the frequency that messages are written to S3 (using either rotate.interval.ms or rotate.schedule.interval.ms combined with flush.size) are the same for all topics. Therefore, you may want to consider creating multiple MSK Connect clusters with settings aligned to topics with similar message volumes.
Challenge 1: Combining Current and Historic Data
Typically in a Kafka topic, you’ll only have access to a small subset of the messages (by default 7 days) and will have to supplement the dbt full load with historic data. Once the dbt full refresh is complete, you’ll no longer want to process the historic data as part of the ongoing incremental.
Chances are, at some point you’ll need to repeat a full refresh (for example when a new column is added to the dataset) and therefore having a configurable and flexible approach is ideal. By creating an Historic Load Control table, similar to that shown below, we can use this within our dbt models to control whether we process the historic data.
The table comprises the model name and a Boolean flag to indicate whether we want to process historic data. Ninety nine times out of 100, all these settings will be false.
Within our dbt model, it’s then very easy to create two additional CTEs that determine whether we pull in and process the historic data, combined with our Kafka event data.
With this approach it’s then very simple to UNION the historic and event data and process both without having to create separate models, and something we can very easily control should we need to repeat in the future (as I can assure you, you will 😊).
Challenge 2: Late-Arriving Data
The second challenge is handling late-arriving data. When consuming from an events source such as Kafka, some data is inevitably going to arrive late and worse still, sometimes out of order.
Having a source date timestamp is crucial, as by using this combined with a handy little tweak to dbt’s get_merge_sql macro (as highlighted opposite), we can ensure target data is only updated when valid to do so.
The additional add_merge_date_condition macro below shows how to add a clause to ensure the date timestamp of the source row is greater than or equal to the target.
We use the following macro to add the additional comparison logic to the when matched condition, so that if an update_date_column is found in the model config, it’s used in merge statement.
Tag, tag and tag some more
I can’t stress enough how important tagging your models in dbt is. When your models get complex (which they will) you’ll want to be able to filter and view lineage in a variety of ways and the more tags, the more flexibility you’ll have.
It goes without saying, the obvious data governance tags such as PII and PHI are fundamental in order to help satisfy data protection and privacy requirements, such as GDPR.
Less obvious are tags indicating which domain or schema they belong to. Or tags indicating the frequency at which models should be refreshed, such as hourly, daily, weekly, etc.
Whilst initially you may re-process everything on a simple daily overnight schedule, this is sure to change as business requirements demand more up-to-date data. It’s then simple to create dbt jobs at varying frequencies, that filter the models they process based on their tag values.
The elements described in this blog post are part of our Inawisdom dbt template project, an accelerator we use on customer engagements, which also includes Snowflake or Redshift scripts to create the necessary Warehouses, Databases, Schemas and Roles. This accelerator allows us to hit the ground running on each new project, so that we can focus on building the necessary transforms to deliver business value.
Hopefully this has given you some tips and ideas to feed into your own dbt projects.