When it comes to next-generation data processing architectures, there’s still a lot of innovation going on. It’s great to see the advancements being made to enhance, simplify and democratise the use of data.
One technology we’ve seen gaining great traction over the past few years is DBT – a data transformation workflow tool that enables teams to build data pipelines using SQL. It’s something we are increasingly seeing adopted by customers who are looking to empower their data teams and speed up time to insight.
To find out a bit more about what DBT can do and why it’s becoming such a popular tool, we sat down with Tobey Agboola, a Data Engineer within our Data Management practice, to talk about his experience.
Why has DBT been gaining traction? How does the data transformation process normally work, and what does it look like with DBT in place?
Data has to come from somewhere, so the first stage of the process always involves collecting and ingesting data from whatever sources you have – that could be things like a point-of-sale system or business software. Then a data engineer will create a data ingestion pipeline to get it into the warehouse environment.
But at this point, it’s still not usable for the data analysts. They usually need to transform it in some way – do some filtering, aggregation, summarisation – to plug it into a BI tool and answer the questions they’re trying to explore.
So, they would come to me as the data engineer, and I would follow the business logic they’ve given me to do the transformation – creating tables and using tools such as AWS Glue for writing code. But in some cases, if a project was urgent or time-critical, data analysts were somewhat dependent on the engineers to be able to answers the questions they needed from the data.
Where DBT comes in is this data transformation stage. It gives data analysts the ability to build the data pipeline just by writing some simple SQL; DBT does the rest. The premise with DBT is that the majority of data analysts will be familiar with SQL, so they won’t need to know about the rest of the infrastructure needed to build the pipeline or be experts with programming languages.
It’s open source and can save time while not being that complex to learn. There is of course always a need for full-scale data engineering and utilising all the incredible AWS tools, but for certain tasks, DBT can offer data analysts the ability to get what they need without waiting for data engineering skills. As long as the data is in a warehouse environment, the data analyst can write some simple SQL, configure it with DBT and access the data.
It also works natively, out-of-the-box, with popular new generation cloud-based data warehouses like Redshift, Snowflake, and other data warehouses that deal with big data. And because it’s just based mainly on SQL, it doesn’t require moving data across the network like some of the older tools – which makes it faster and more secure. In this case, the processing is happening in the database/warehouse engine – so moving from an ETL process to an ELT one.
What skills do you need to use DBT?
The target market for DBT is data analysts, data scientists and analytics personnel with SQL skills. So even though it’s very easy to use, you still need some specific skills sets. You definitely need to know SQL, and it would also be useful to be familiar with Jinja/Python – I wouldn’t say it’s mandatory but would certainly help you make the most of the features of DBT.
Database design skills will also help when you’re writing models, as well as knowledge of using the CLI command. And finally, some understanding of GIT or another version control system will help with version controlling the models you develop.
Would you say DBT is good for all projects or just for getting started? Can you grow its complexity?
I would say it’s good for all projects that need the data transformation step – as long as you have data already in the data warehouse, because DBT doesn’t handle the ingestion stage. But if you need to transform data within a data warehouse, then it’s good for all projects. And you can definitely scale it.
Is it fit for event-driven architectures, or mainly batch?
In my experience, it’s batch only – I haven’t seen it used for event-driven architecture. I don’t think it’s really a fit for an event-driven model.
Here at Inawisdom, we’re all-in on AWS. So, do AWS and DBT work well together?
Yes, definitely – DBT complements all the AWS architecture we already use and integrates seamlessly with the likes of Redshift. It doesn’t really replace anything, it just makes that last stage of data transformation easier.
Previously we were using AWS Glue for all end-to-end data pipelines we were doing (the data ingestion/extraction, load and transform) up to the presentation layer for BI tools, but we found that the requests from the analytics team/data scientists were increasing and we were getting a bit of a backlog. So we decided to implement a DBT project, just in that last stage of data transformation, to enable the analysts to do more themselves, in terms of building data pipelines/models using the raw data already loaded into the warehouse. Of course, adding an additional tool into the mix rather than sticking to a single one can add a more complex arch to a project so it’s really a choice for the user/company to decide on and understand whether the trade-off is right for them.
We have found that DBT is very good for collaboration as well – the data engineers often end up using it over other tools, so they can reuse and repurpose what the business analysts have already created.
There is a bit of a tussle between UI-based and code-based ETL tools – in your opinion, is one approach a more natural fit with automation generally?
I think it really depends on your comfort zone. There are benefits to both. We utilise multiple tools for ETL as I mentioned – AWS Glue, Matillion – and it’s really just matching the capabilities you have and the developer experience you require. For instance, DBT offers dbt-cloud, a managed SaaS IDE/UI-based (visual), and dbt-core, a CLI command-based (non-visual) option – but it really just comes down to who’s using it, what they want from their project and also their skill set. The visual ones are easier to use, as not everyone is comfortable with CLI commands. Additionally, dbt-cloud version is a paid service, whilst the dbt-core version is open-source and free.
Can it be used by anyone with a bit of technical skills? Or do you need a data scientist?
Anyone with basic SQL skills can use it, but to get the most out of it you really need a good knowledge of both SQL and Jinja. It depends on the complexity of what you’re trying to achieve. It’s really usable – I’ve even seen CEOs using it to do some simple data transformation. But to use it to best effect, you’d benefit from having someone who is slightly more technical, especially if you’re trying to do more complex data transformation.
For production-grade systems, there is no substitute for well thought out data model design.
Are there any limitations with DBT?
The main limitation that comes to mind is that it’s a not a solution for ETL, it’s just for the transformation stage. Some people might want one tool for the whole process. Depending on an organisation’s businesses processes and environment, they may want just one tool that does everything for them to reduce complexity, so again, it’s a choice.
What customers does it suit?
DBT works well for any industry, but it’s best for businesses that are already mature in their use of cloud, because it fits well into the existing architecture. It’s the most natural fit for fast-moving, high-pressure industries, where teams need to access data quickly without waiting around – being able to answer questions with their data as soon as they think of them and being able to spot patterns and trends quickly to maintain their competitive edge.
For example, I’ve used it before on a project for a fintech company. They wanted to be a more data-driven organisation in terms of their decision making, and particularly, to better understand their customers to make their marketing more effective. They needed to be able to access the data faster, to enable the team to turn around customer insights quickly.
Previously, if an analyst had asked the data engineers to transform some data, it would normally take about 3 days because we had to build the pipeline, test it, and integrate it with the architecture. But with DBT, it could be done in half a day. Sometimes requirements would change as they investigated the data – for example, the data team might realise that they needed YTD revenue instead of just revenue, so the data would need to go through more transformation. With DBT, the data team could run as many queries or models as necessary, to get the answers they needed.
However, the freedom of enabling analysts and business users with self-service analytics must be tempered by a suitable governance framework, providing the guardrails and protection needed to avoid “spaghetti code” that cannot be run in production.
As our chat with Tobey has shown, the growing popularity of DBT isn’t without reason. This is a useful tool for any business that wants to empower their data analysts, reduce dependencies and optimise the process of transforming and analysing their data.
It’s of particular use in fast-moving industries, where speed to insight can help companies maintain their competitive advantage. And while it’s simple to use (with a bit of SQL knowledge), and mostly applicable for batch processing – it still comes equipped with robust features and integration so with the right skills, can potentially be suitable for more complex projects.
On the flipside, it’s important to recognise you are adding in a supplementary tool to your data process, so you need to understand the trade-off that brings in terms of increased operational overhead and the potential intricacies you may well have to account for. It is also worth a word of caution that while speed and agility are incredible benefits, there is also the loss of control; that by giving access to multiple users to quickly create and curate code, you risk unleashing an over-abundance of it which can make versioning and ongoing management difficult in the long term.