The Problems Solved by dbt

Recently I talked to a company trying to fill a vacancy for a “Head of Business Intelligence”. They insisted on this person beig experienced in dbt. They mentioned it 4 times. So, I got curious and asked the person currently occupying this position a few poignant questions on why they insisted on this tool and they in turn asked a few questions on why I was skeptical about it. We established the following facts, which I want to share to give the reader some context. They

are a SME SaaS company
do not use any version control
ingest most of their data via Fivetran
think they have clean data
want to build team of 4 BI people
have a lot of raw data and many stakeholders query this raw data via Metabase
have very little established in terms of data enrichment

So, I was wondering, what problems does dbt actually solve and I was even more interested in other methods to solve these problems - ideally for free. Also I wanted to challenge my own scepticism about this product since I have decided against using it in the past. Also, we all know that tools are changing incredibly fast nowadays, so it might not even be the same tool I checked out years ago anymore.

My understanding of the USP of dbt is that it creates the Analytics Engineer, i.e. making the skills and practices of a data engineer available to an analyst. Much like a low-code solution for data engineering problems, just with … lots of code.

Let’s dive into it:

Workflow Link to heading

dbt allows for a better UX layered on top of git, making the creation of branches, switching between them, comitting, and creation of merge request available to people who wouldn’t touch git with a ten foot barge pole.

However, no best practices are enforced. Commits can be arbitraly atomic or universal. Commit messages do not get any hinting (which would not be more complicated than the automatic documentation). Readability of the queries is up to the individual contributor.

So, what’s the value created here? A purely mechanical mastery of the version control workflow and especially the creation of branches, comitting and merging.

Code Readability/Quality Link to heading

Another point that dbt markets is an enhanced readability of the code. The use of macros is encouraged. Macros are snippets that expand into SQL code when the dbt files are compiled.

Besides that, the examples in the documentation are all employing the following patterns:

select * from cte
select * from {{ ref(“other model”) }}
group by 1,2,3

This is not what I consider a readable coding style, because all semantic information is hidden until the code is compiled and even then there is extensive use of shorthands.

Here is a simple example of what dbt compiles


with
    customers as (

        select *
        from `project`.`schema`.`stg_customers`

    ),

    orders as (

        select *

        from `project`.`schema`.`stg_orders`

    ),

    customer_orders as (

        select
            customer_id,

            min(order_date) as first_order_date,
            max(order_date) as most_recent_order_date,
            count(order_id) as number_of_orders

        from orders

        group by 1

    ),

    final as (

        select
            customers.customer_id,
            customers.first_name,
            customers.last_name,
            customer_orders.first_order_date,
            customer_orders.most_recent_order_date,
            coalesce(customer_orders.number_of_orders, 0) as number_of_orders

        from customers

        left join customer_orders using (customer_id)

    )

select *
from final

I prefer a much more expressive style of SQL coding even going so far repeating the name of a column as an alias like so:


WITH customer_details AS ( # CAPITALIZATION of keywords increases readability

    SELECT 
      customer_id AS customer_id  # Explicit naming of columns encourages thinking
      , first_name AS first_name  # of the appropriate scope
      , last_name AS last_name
    FROM `project`.`schema`.`stg_customers`

    ),

  orders AS (

    SELECT
      order_id AS order_id
      , customer_id AS customer_id
      , order_date AS order_date
      , status AS status
    FROM `project`.`schema`.`stg_orders`
    
    ),

  customer_stats AS (

    SELECT
        customer_id AS customer_id
        , MIN(order_date) AS first_order_date
        , MAX(order_date) AS most_recent_order_date
        , COUNT(order_id) AS number_of_orders
    FROM orders
    GROUP BY customer_id  # In this toy example it makes no difference
                          # if the shorthand is used, but when there are
                          # 5 or more columns to group by
                          # it is much easier to understand and debug

    ),

  customers_details_with_stats AS (

    SELECT
      customer_details.customer_id AS customer_id
      , customer_details.first_name AS first_name
      , customer_details.last_name AS last_name
      , customer_stats.first_order_date AS first_order_date
      , customer_stats.most_recent_order_date AS most_recent_order_date
      , coalesce(customer_stats.number_of_orders, 0) AS number_of_orders
    FROM customer_details
    LEFT JOIN customer_stats 

    # Similarly, explicit naming of tables and columns
    # increases understanding of the ORM when performing
    # non-trivial JOINs across multiple tables

    WHERE customer_details.customer_id = customers_stats.customer_id   
                                                                 


    )

SELECT
  customer_id
  , first_name
  , last_name
  , first_order_date
  , most_recent_order_date
  , number_of_orders
FROM customers_details_with_stats

That might seem like a lot of unnecessary characters but with some clever use of VS Code keyboard shortcuts it is actually not that bad. If the model get sufficiently complicated this is much easier to understand though as everything is explicit.

There are macros in dbt that help to make the compiled code more expressive like the star macro that generated the list of columns for the user. This macro is evaluated only upon request though and it is not in front of the user’s eyes constantly.

Data Lineage Link to heading


Data lineage graph in dbt

Model Flexibility Link to heading

How flexible do you need your models to be? If your organization is in its adolescence, probably a lot.

When does our model change? A major selling point is the flexibility to replace one model with another without rewriting a majority of your code.

So, what are some hypothetical scenarios where your data model might change:

Implementation of a new payment/shopping cart/CRM software

Running pipelines Link to heading

Testing Link to heading

Documentation Link to heading

Trade-offs Link to heading

Which problems should you have to warrant dbt? Link to heading

If your tansformation become too wide or too long or too disconnected due to multiple use cases within the same use case don’t use dbt

An undeniable plus is the environment management, which enables clean separation between the dev environments and the production environment

So, in conclusion strongly consider using dbt, when:

your organization is still figuring things out and
your team size is small enough that constant cummunication and coordination is possible
but so small that it needs a little productivity boost to keep up with the demands of your org

A Few more Bonus annoyances Link to heading

Diff is not semantic and - AFAIK - not replaceable with a different diff engine
Merge request workflow is confusing when deleting
Relying on the default dataset location did not work well - I had to manually create my schemas/datasets in the region of my choice
I ran into problems very quickly and the error messages were no helb in figuring out what went wrong