Knowledge

Building Data Products with dbt

dbt is one of the most popular tools for data transformation. Here is how we structure dbt projects as data products with clear ownership, output ports, data contracts, and quality tests.

Why dbt for Data Products?

dbt (data build tool) is a SQL-first transformation framework that runs on top of data warehouses like Snowflake, BigQuery, Databricks, AWS Athena, and DuckDB. It brings software engineering best practices to data: version control, modularity, testing, and documentation.

These properties make dbt a natural fit for implementing data products. A dbt project can represent a data product: it has clear inputs, transformation logic, output models, tests, and documentation, all in one Git repository owned by one team.

Start with a Data Contract

Let's use a real-life example of a data product we at Entropy Data use internally for customer success to help us understand how our customers use the application and if we proactively can support them building better data products with data contracts.

Before writing any dbt model, we start by defining a data contract for the target data product (or more precisely, for the data product's output port). The data contract describes what data you will need to provide, its schema, quality guarantees, and terms of use. Think of it as the requirements specification for our data product.

By designing the contract first, we align on what information is actually needed before investing time into implementation. It keeps the data product small and handy. This contract-first approach avoids building data sets that nobody uses or that does not meet our expectations.

The data contract is written using ODCS format and stored alongside the dbt project:

# datacontracts/customer_activity_v1.odcs.yaml

apiVersion: v3.1.0
kind: DataContract

id: entropydata-customer-activity-v1
name: Customer Activity
version: 1.0.0
status: draft

description:
  usage: analytics
  purpose: Customer activity within Entropy Data for customer success.

schema:
- name: customer_activity
  physicalType: table
  properties:
    - name: organization_id
      description: Unique identifier of the organization.
      logicalType: string
      required: true
      primaryKey: true
      examples:
        - 550e8400-e29b-41d4-a716-446655440000
    - name: organization_vanity_url
      description: Vanity URL of the organization.
      logicalType: string
      examples:
        - acme-corp
    - name: organization_created_by
      description: Email address of the user that created the organization.
      logicalType: string
      examples:
        - admin@acme.com
    - name: organization_created_at
      description: Timestamp when the organization was created.
      logicalType: timestamp
      examples:
        - "2024-03-15T10:00:00Z"
    - name: users_total
      description: Total number of users in the organization.
      logicalType: integer
      required: true
      examples:
        - 12
    - name: users_added_30d
      description: Number of users added in the last 30 days.
      logicalType: integer
      required: true
      examples:
        - 3
    - name: users_last_signed_up_at
      description: Timestamp when the last user signed up.
      logicalType: timestamp
      examples:
        - "2024-09-10T14:22:00Z"
    - name: dataproducts_total
      description: Total number of data products in the organization.
      logicalType: integer
      required: true
      examples:
        - 42
    - name: dataproducts_added_30d
      description: Number of data products added in the last 30 days.
      logicalType: integer
      required: true
      examples:
        - 5
    - name: dataproducts_outputports_total
      description: Total number of output ports across all data products.
      logicalType: integer
      required: true
      examples:
        - 58
    - name: dataproducts_outputports_with_datacontract_percentage
      description: Percentage of output ports that have an active data contract.
      logicalType: number
      required: true
      examples:
        - 72.5
    - name: dataproducts_outputports_with_testresults_percentage
      description: Percentage of output ports that have test results.
      logicalType: number
      required: true
      examples:
        - 65.0
    - name: dataproducts_last_updated_at
      description: Timestamp when the last data product was updated.
      logicalType: timestamp
      examples:
        - "2024-09-14T09:30:00Z"
    - name: assets_total
      description: Total number of assets in the organization.
      logicalType: integer
      required: true
      examples:
        - 87
    - name: data_platform
      description: The data contract / output port type used for most data products.
      logicalType: string
      examples:
        - bigquery
        - snowflake
        - s3
    - name: updated_at
      description: Timestamp when this record was written by the dbt execution.
      logicalType: timestamp
      required: true
      examples:
        - "2024-09-16T06:05:23Z"

team:
  name: customer-success

servers:
  - server: production
    type: databricks
    catalog: entropy_data_prod
    schema: dp_entropydata_customer_activity_v1

roles:
  - role: customer_success
    access: read

A data product matching this specification would help our customer success team to identify a customer's activity and give quick information on the customer's target data platform.

In Entropy Data, the data contract can be edited and managed through the Data Contract Editor:

Data Contract Editor form view showing schema properties, fundamentals, and terms of use for the customer_activity data contract.

With the contract in place, the dbt implementation becomes straightforward: build the models that fulfill the contract.

Data Product Design

Our Customer Activity data product is a consumer-aligned data product: it takes raw operational data, transforms it, and serves it for a specific use case customer success.

The data flows from our Entropy Data Cloud application (the operational Postgres database) through a source-aligned raw data product that extracts data from Postgres to Databricks (using dlt), into the Customer Activity data product that we want to build now with dbt.

One dbt Project per Data Product

Our recommended approach is to use one dbt project per data product. Each project is owned by one team, lives in its own Git repository so that every data product can be implemented and deployed independently. A best practice that we have learned with Self-contained Systems. This ensures clear ownership and autonomy, core principles of data mesh.

A typical dbt data product project structure looks like this:

dp_entropydata_customer_activity/
├── dbt_project.yml
├── customer-activity.odps.yaml         # Data product metadata
├── models/
│   ├── input_ports/                    # Input ports
│   │   └── sources.yml
│   ├── staging/                        # Internal: cleaning and normalization
│   │   ├── stg_organizations.sql
│   │   ├── stg_members.sql
│   │   ├── stg_data_products.sql
│   │   └── stg_assets.sql
│   ├── intermediate/                   # Internal: business logic
│   │   └── int_customer_activity.sql
│   └── output_ports/                   # Public: output port models
│       ├── customer_activity.sql
│       └── customer_activity.yml
├── tests/                              # Unit tests
│   └── assert_users_total_non_negative.sql
├── datacontracts/                      # Data contract
│   └── customer_activity_v1.odcs.yaml
└── macros/

The key idea: input_ports/, staging/, and intermediate/ models are internal to the data product. Only the output_ports/ models are the public interface, the output ports that other teams consume.

Output Ports as dbt Models

Output port models are usually implemented as SQL views or materialized tables that serve as the stable API of the data product. They abstract away the internal staging and transformation logic and allow changing the implementation without affecting consumers.

Instead of writing dbt model definitions by hand, the Data Contract CLI can generate them directly from the data contract:

datacontract export --format dbt --output models/output_ports/customer_activity.yml datacontracts/customer_activity_v1.odcs.yaml

This generates the dbt model YAML with columns, descriptions, and data types:

# models/output_ports/customer_activity.yml (generated)

version: 2

models:
  - name: customer_activity
    description: Customer activity within Entropy Data for customer success.
    config:
      meta:
        data_contract: entropydata-customer-activity-v1
        owner: customer-success
      materialized: table
      contract:
        enforced: true
    columns:
      - name: organization_id
        description: Unique identifier of the organization.
        data_type: STRING
        constraints:
          - type: not_null
          - type: unique
      - name: organization_vanity_url
        description: Vanity URL of the organization.
        data_type: STRING
      - name: organization_created_by
        description: Email address of the user that created the organization.
        data_type: STRING
      - name: organization_created_at
        description: Timestamp when the organization was created.
        data_type: TIMESTAMP
      - name: users_total
        description: Total number of users in the organization.
        data_type: BIGINT
        constraints:
          - type: not_null
      - name: users_added_30d
        description: Number of users added in the last 30 days.
        data_type: BIGINT
        constraints:
          - type: not_null
      - name: users_last_signed_up_at
        description: Timestamp when the last user signed up.
        data_type: TIMESTAMP
      - name: dataproducts_total
        description: Total number of data products in the organization.
        data_type: BIGINT
        constraints:
          - type: not_null
      - name: dataproducts_added_30d
        description: Number of data products added in the last 30 days.
        data_type: BIGINT
        constraints:
          - type: not_null
      - name: dataproducts_outputports_total
        description: Total number of output ports across all data products.
        data_type: BIGINT
        constraints:
          - type: not_null
      - name: dataproducts_outputports_with_datacontract_percentage
        description: Percentage of output ports that have an active data contract.
        data_type: DOUBLE
        constraints:
          - type: not_null
      - name: dataproducts_outputports_with_testresults_percentage
        description: Percentage of output ports that have test results.
        data_type: DOUBLE
        constraints:
          - type: not_null
      - name: dataproducts_last_updated_at
        description: Timestamp when the last data product was updated.
        data_type: TIMESTAMP
      - name: assets_total
        description: Total number of assets in the organization.
        data_type: BIGINT
        constraints:
          - type: not_null
      - name: data_platform
        description: The data contract / output port type used for most data products.
        data_type: STRING
      - name: updated_at
        description: Timestamp when this record was written by the dbt execution.
        data_type: TIMESTAMP
        constraints:
          - type: not_null

Generating the dbt model file is usually a one-time operation. You can now save the model in Git and add further details as needed.

Now, implement the actual SQL transformation logic for each output model:

-- models/output_ports/customer_activity.sql

{{ config(
    materialized='table',
    schema='op_v1'
) }}

select
    organization_id,
    organization_vanity_url,
    organization_created_by,
    organization_created_at,
    users_total,
    users_added_30d,
    users_last_signed_up_at,
    dataproducts_total,
    dataproducts_added_30d,
    dataproducts_outputports_total,
    dataproducts_outputports_with_datacontract_percentage,
    dataproducts_outputports_with_testresults_percentage,
    dataproducts_last_updated_at,
    assets_total,
    data_platform
from {{ ref('int_customer_activity') }}

Input Ports

A data product typically consumes data from operational systems or other data products. In our data product, we use data that was extracted via dlt from our operational Postgres database to a raw / source-aligned data product on the data platform (Databricks).

In dbt, inputs are modeled as sources. The input_ports/ folder contains the source definitions and simple select models that reference them:

# models/input_ports/sources.yml

version: 2

sources:
  - name: entropydata_postgres
    description: >
      Operational database of the Entropy Data platform.
    database: entropy_data_prod
    schema: entropydata_postgres
    tables:
      - name: organization
      - name: organization_member
      - name: data_product
      - name: asset

Sources referencing other data products require an active data contract between the provider and consumer.

Transformation

Between input and output, the internal models handle the transformation work with the business logic. These are not exposed to consumers.

Staging models clean and normalize raw source data: deduplicate to the latest version, cast types, and rename columns.

-- models/staging/stg_organizations.sql

with source as (
    select *,
        row_number() over (partition by organization_id order by version desc) as _row_num
    from {{ source('entropydata_postgres', 'organization') }}
)

select
    organization_id,
    vanity_url as organization_vanity_url,
    created_by as organization_created_by,
    cast(created_at as timestamp) as organization_created_at
from source
where _row_num = 1

-- models/staging/stg_members.sql

with source as (
    select *,
        row_number() over (partition by organization_member_id order by version desc) as _row_num
    from {{ source('entropydata_postgres', 'organization_member') }}
)

select
    organization_member_id,
    organization_id,
    user_id,
    role,
    cast(created_at as timestamp) as created_at
from source
where _row_num = 1

-- models/staging/stg_data_products.sql

with source as (
    select *,
        row_number() over (partition by data_product_id order by version desc) as _row_num
    from {{ source('entropydata_postgres', 'data_product') }}
)

select
    data_product_id,
    organization_id,
    name,
    status,
    specification_type,
    cast(created_at as timestamp) as created_at,
    cast(updated_at as timestamp) as updated_at
from source
where _row_num = 1

Intermediate models apply business logic: join tables, enrich with master data, compute derived fields.

-- models/intermediate/int_customer_activity.sql

with members_per_org as (
    select
        organization_id,
        count(*) as users_total,
        count(case when created_at >= date_sub(current_date(), 30) then 1 end) as users_added_30d,
        max(created_at) as users_last_signed_up_at
    from {{ ref('stg_members') }}
    group by organization_id
),

data_products_per_org as (
    select
        organization_id,
        count(*) as dataproducts_total,
        count(case when created_at >= date_sub(current_date(), 30) then 1 end) as dataproducts_added_30d,
        -- TODO: add output_port source table to compute these metrics
        cast(0 as bigint) as dataproducts_outputports_total,
        cast(0 as double) as dataproducts_outputports_with_datacontract_percentage,
        cast(0 as double) as dataproducts_outputports_with_testresults_percentage,
        max(updated_at) as dataproducts_last_updated_at,
        first(specification_type) as data_platform
    from {{ ref('stg_data_products') }}
    group by organization_id
),

assets_per_org as (
    select
        organization_id,
        count(*) as assets_total
    from {{ ref('stg_assets') }}
    group by organization_id
)

select
    o.organization_id,
    o.organization_vanity_url,
    o.organization_created_by,
    o.organization_created_at,
    coalesce(m.users_total, 0) as users_total,
    coalesce(m.users_added_30d, 0) as users_added_30d,
    m.users_last_signed_up_at,
    coalesce(dp.dataproducts_total, 0) as dataproducts_total,
    coalesce(dp.dataproducts_added_30d, 0) as dataproducts_added_30d,
    coalesce(dp.dataproducts_outputports_total, 0) as dataproducts_outputports_total,
    coalesce(dp.dataproducts_outputports_with_datacontract_percentage, 0) as dataproducts_outputports_with_datacontract_percentage,
    coalesce(dp.dataproducts_outputports_with_testresults_percentage, 0) as dataproducts_outputports_with_testresults_percentage,
    dp.dataproducts_last_updated_at,
    coalesce(a.assets_total, 0) as assets_total,
    dp.data_platform
from {{ ref('stg_organizations') }} o
left join members_per_org m on o.organization_id = m.organization_id
left join data_products_per_org dp on o.organization_id = dp.organization_id
left join assets_per_org a on o.organization_id = a.organization_id

Building the Data Product

When we execute dbt run, dbt materializes all models and creates the output table on Databricks. The result is a customer_activity table in the dp_entropydata_customer_activity_op_v1 schema, ready to be consumed by the customer success team.

The customer_activity table materialized in Databricks, showing columns like organization_id, organization_vanity_url, and users_total.

Testing

Testing is what turns a dbt project into a trustworthy data product. Without tests, consumers have no reason to trust the data — they will build their own checks, duplicate logic, or simply avoid using the data product altogether. Automated tests make quality visible and give consumers confidence that the data they rely on is correct and complete.

Testing a data product works on two levels: unit tests that verify the internal implementation, and contract tests that verify the output from a consumer's perspective.

Unit Testing

dbt tests verify individual steps across the data pipeline. They are tightly coupled to the implementation and change as the dbt models evolve.

Schema tests are declared in the YAML files and validate not_null, unique, accepted_values, and relationships on each model. Custom data tests are SQL queries in the tests/ folder that assert business rules:

-- tests/assert_users_total_non_negative.sql

select organization_id
from {{ ref('customer_activity') }}
where users_total < 0

If the query returns any rows, the test fails. dbt contracts (since dbt 1.5) additionally enforce column names and data types at build time. The generated dbt model already includes contract.enforced: true.

Contract Testing

The Data Contract CLI tests the output port from the consumer's point of view, against the data contract. This is an acceptance test: it connects to the actual data platform, queries the output tables, and checks schema, row counts, and quality rules defined in the contract.

datacontract test datacontracts/customer_activity_v1.odcs.yaml

Data Contract CLI test results showing schema and quality checks for a customer data product

Contract tests are more stable than dbt tests. They don't change when you refactor the internal staging or intermediate models. As long as the output port still fulfills the contract, the tests pass. This makes them ideal for CI/CD pipelines and for building consumer confidence.

CI/CD Pipeline

Each data product has its own CI/CD pipeline that runs on every push, and periodically:

dbt run materializes all models
dbt test runs all schema and custom data tests
Publish the data product and data contract to Entropy Data
datacontract test runs contract tests against the output port

We use a GitHub Actions workflow for this:

# .github/workflows/data-product.yml

name: Customer Activity Data Product
on:
  push:
    branches: [main]
  schedule:
    - cron: "0 6 * * *"

env:
  API: https://api.entropy-data.com/api
  DBT_DATABRICKS_HOST: ${{ secrets.DBT_DATABRICKS_HOST }}
  DBT_DATABRICKS_HTTP_PATH: ${{ secrets.DBT_DATABRICKS_HTTP_PATH }}
  DBT_DATABRICKS_TOKEN: ${{ secrets.DBT_DATABRICKS_TOKEN }}
  DATACONTRACT_DATABRICKS_TOKEN: ${{ secrets.DBT_DATABRICKS_TOKEN }}
  DATACONTRACT_DATABRICKS_SERVER_HOSTNAME: ${{ secrets.DBT_DATABRICKS_HOST }}
  DATACONTRACT_DATABRICKS_HTTP_PATH: ${{ secrets.DBT_DATABRICKS_HTTP_PATH }}

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install dbt-databricks openlineage-dbt datacontract-cli[databricks]

      - name: Create profiles.yml
        run: |
          mkdir -p ~/.dbt
          cat > ~/.dbt/profiles.yml <<EOF
          dp_entropydata_customer_activity:
            target: prod
            outputs:
              prod:
                type: databricks
                catalog: entropy_data_prod
                schema: dp_entropydata_customer_activity
                host: ${DBT_DATABRICKS_HOST}
                http_path: ${DBT_DATABRICKS_HTTP_PATH}
                token: ${DBT_DATABRICKS_TOKEN}
                threads: 4
          EOF

      - name: dbt deps
        run: dbt deps

      - name: dbt run
        run: dbt-ol run --target prod

      - name: dbt test
        run: dbt test --target prod

      - name: Publish data product
        run: |
          yq -o=json customer-activity.odps.yaml \
            | curl -s -X PUT \
              -H 'Content-Type: application/json' \
              -H "x-api-key: ${{ secrets.ENTROPY_DATA_API_KEY }}" \
              -d @- $API/dataproducts/entropydata-customer-activity

      - name: Publish data contract
        run: |
          yq -o=json datacontracts/customer_activity_v1.odcs.yaml \
            | curl -s -X PUT \
              -H 'Content-Type: application/json' \
              -H "x-api-key: ${{ secrets.ENTROPY_DATA_API_KEY }}" \
              -d @- $API/datacontracts/entropydata-customer-activity-v1

      - name: Data contract test
        run: |
          datacontract test datacontracts/customer_activity_v1.odcs.yaml \
            --server production \
            --publish $API/test-results
        env:
          ENTROPY_DATA_API_KEY: ${{ secrets.ENTROPY_DATA_API_KEY }}

dbt run materializes the models, dbt test runs the unit tests. Then, the data product and data contract metadata are published to Entropy Data via its REST API using yq to convert YAML to JSON. Finally, datacontract test validates the output port from the consumer perspective.

Entropy Data

Entropy Data is a data product marketplace that manages data products, data contracts, and access requests. It integrates well to provide a complete data product experience:

Discovery: consumers browse and find data products in a self-service marketplace
Access management: consumers request access, approvals can trigger RBAC provisioning on the data platform
Governance: track ownership, quality, and lineage across all data products

Here is how our Customer Activity data product looks in the Entropy Data marketplace:

Customer Activity data product in the Entropy Data marketplace, showing data flow, output ports, fundamentals, and audit trail.

Business Value

With this data product in place, our customer success team (actually these are our co-founders...) can now query the customer_activity table directly on Databricks to understand each customer's engagement: how many users they have, whether they are actively adding data products, and what data platform they use. This enables proactive outreach to customers who are ramping up, early detection of inactive accounts, and data-driven prioritization of support efforts. It can support CRM activities, power AI agents that proactively identify where we can help customers build better data products or set up integrations.

The data contract guarantees the schema and quality, so they can build dashboards, agents, and automations on top of it with confidence.