Knowledge
Building Data Products with dbt
dbt is one of the most popular tools for data transformation. Here is how we structure dbt projects as data products with clear ownership, output ports, data contracts, and quality tests.
Why dbt for Data Products?
dbt (data build tool) is a SQL-first transformation framework that runs on top of data warehouses like Snowflake, BigQuery, Databricks, AWS Athena, and DuckDB. It brings software engineering best practices to data: version control, modularity, testing, and documentation.
These properties make dbt a natural fit for implementing data products. A dbt project can represent a data product: it has clear inputs, transformation logic, output models, tests, and documentation, all in one Git repository owned by one team.
Start with a Data Contract
Let's use a real-life example of a data product we at Entropy Data use internally for customer success to help us understand how our customers use the application and if we proactively can support them building better data products with data contracts.
Before writing any dbt model, we start by defining a data contract for the target data product (or more precisely, for the data product's output port). The data contract describes what data you will need to provide, its schema, quality guarantees, and terms of use. Think of it as the requirements specification for our data product.
By designing the contract first, we align on what information is actually needed before investing time into implementation. It keeps the data product small and handy. This contract-first approach avoids building data sets that nobody uses or that does not meet our expectations.
The data contract is written using ODCS format and stored alongside the dbt project:
# datacontracts/customer_activity_v1.odcs.yaml
apiVersion: v3.1.0
kind: DataContract
id: entropydata-customer-activity-v1
name: Customer Activity
version: 1.0.0
status: draft
description:
usage: analytics
purpose: Customer activity within Entropy Data for customer success.
schema:
- name: customer_activity
physicalType: table
properties:
- name: organization_id
description: Unique identifier of the organization.
logicalType: string
required: true
primaryKey: true
examples:
- 550e8400-e29b-41d4-a716-446655440000
- name: organization_vanity_url
description: Vanity URL of the organization.
logicalType: string
examples:
- acme-corp
- name: organization_created_by
description: Email address of the user that created the organization.
logicalType: string
examples:
- admin@acme.com
- name: organization_created_at
description: Timestamp when the organization was created.
logicalType: timestamp
examples:
- "2024-03-15T10:00:00Z"
- name: users_total
description: Total number of users in the organization.
logicalType: integer
required: true
examples:
- 12
- name: users_added_30d
description: Number of users added in the last 30 days.
logicalType: integer
required: true
examples:
- 3
- name: users_last_signed_up_at
description: Timestamp when the last user signed up.
logicalType: timestamp
examples:
- "2024-09-10T14:22:00Z"
- name: dataproducts_total
description: Total number of data products in the organization.
logicalType: integer
required: true
examples:
- 42
- name: dataproducts_added_30d
description: Number of data products added in the last 30 days.
logicalType: integer
required: true
examples:
- 5
- name: dataproducts_outputports_total
description: Total number of output ports across all data products.
logicalType: integer
required: true
examples:
- 58
- name: dataproducts_outputports_with_datacontract_percentage
description: Percentage of output ports that have an active data contract.
logicalType: number
required: true
examples:
- 72.5
- name: dataproducts_outputports_with_testresults_percentage
description: Percentage of output ports that have test results.
logicalType: number
required: true
examples:
- 65.0
- name: dataproducts_last_updated_at
description: Timestamp when the last data product was updated.
logicalType: timestamp
examples:
- "2024-09-14T09:30:00Z"
- name: assets_total
description: Total number of assets in the organization.
logicalType: integer
required: true
examples:
- 87
- name: data_platform
description: The data contract / output port type used for most data products.
logicalType: string
examples:
- bigquery
- snowflake
- s3
- name: updated_at
description: Timestamp when this record was written by the dbt execution.
logicalType: timestamp
required: true
examples:
- "2024-09-16T06:05:23Z"
team:
name: customer-success
servers:
- server: production
type: databricks
catalog: entropy_data_prod
schema: dp_entropydata_customer_activity_v1
roles:
- role: customer_success
access: read
A data product matching this specification would help our customer success team to identify a customer's activity and give quick information on the customer's target data platform.
In Entropy Data, the data contract can be edited and managed through the Data Contract Editor:
With the contract in place, the dbt implementation becomes straightforward: build the models that fulfill the contract.
Data Product Design
Our Customer Activity data product is a consumer-aligned data product: it takes raw operational data, transforms it, and serves it for a specific use case customer success.
The data flows from our Entropy Data Cloud application (the operational Postgres database) through a source-aligned raw data product that extracts data from Postgres to Databricks (using dlt), into the Customer Activity data product that we want to build now with dbt.
One dbt Project per Data Product
Our recommended approach is to use one dbt project per data product. Each project is owned by one team, lives in its own Git repository so that every data product can be implemented and deployed independently. A best practice that we have learned with Self-contained Systems. This ensures clear ownership and autonomy, core principles of data mesh.
A typical dbt data product project structure looks like this:
dp_entropydata_customer_activity/
├── dbt_project.yml
├── customer-activity.odps.yaml # Data product metadata
├── models/
│ ├── input_ports/ # Input ports
│ │ └── sources.yml
│ ├── staging/ # Internal: cleaning and normalization
│ │ ├── stg_organizations.sql
│ │ ├── stg_members.sql
│ │ ├── stg_data_products.sql
│ │ └── stg_assets.sql
│ ├── intermediate/ # Internal: business logic
│ │ └── int_customer_activity.sql
│ └── output_ports/ # Public: output port models
│ ├── customer_activity.sql
│ └── customer_activity.yml
├── tests/ # Unit tests
│ └── assert_users_total_non_negative.sql
├── datacontracts/ # Data contract
│ └── customer_activity_v1.odcs.yaml
└── macros/
The key idea: input_ports/, staging/, and intermediate/ models are internal to the data product.
Only the output_ports/ models are the public interface, the output ports that other teams consume.
Output Ports as dbt Models
Output port models are usually implemented as SQL views or materialized tables that serve as the stable API of the data product. They abstract away the internal staging and transformation logic and allow changing the implementation without affecting consumers.
Instead of writing dbt model definitions by hand, the Data Contract CLI can generate them directly from the data contract:
datacontract export --format dbt --output models/output_ports/customer_activity.yml datacontracts/customer_activity_v1.odcs.yaml
This generates the dbt model YAML with columns, descriptions, and data types:
# models/output_ports/customer_activity.yml (generated)
version: 2
models:
- name: customer_activity
description: Customer activity within Entropy Data for customer success.
config:
meta:
data_contract: entropydata-customer-activity-v1
owner: customer-success
materialized: table
contract:
enforced: true
columns:
- name: organization_id
description: Unique identifier of the organization.
data_type: STRING
constraints:
- type: not_null
- type: unique
- name: organization_vanity_url
description: Vanity URL of the organization.
data_type: STRING
- name: organization_created_by
description: Email address of the user that created the organization.
data_type: STRING
- name: organization_created_at
description: Timestamp when the organization was created.
data_type: TIMESTAMP
- name: users_total
description: Total number of users in the organization.
data_type: BIGINT
constraints:
- type: not_null
- name: users_added_30d
description: Number of users added in the last 30 days.
data_type: BIGINT
constraints:
- type: not_null
- name: users_last_signed_up_at
description: Timestamp when the last user signed up.
data_type: TIMESTAMP
- name: dataproducts_total
description: Total number of data products in the organization.
data_type: BIGINT
constraints:
- type: not_null
- name: dataproducts_added_30d
description: Number of data products added in the last 30 days.
data_type: BIGINT
constraints:
- type: not_null
- name: dataproducts_outputports_total
description: Total number of output ports across all data products.
data_type: BIGINT
constraints:
- type: not_null
- name: dataproducts_outputports_with_datacontract_percentage
description: Percentage of output ports that have an active data contract.
data_type: DOUBLE
constraints:
- type: not_null
- name: dataproducts_outputports_with_testresults_percentage
description: Percentage of output ports that have test results.
data_type: DOUBLE
constraints:
- type: not_null
- name: dataproducts_last_updated_at
description: Timestamp when the last data product was updated.
data_type: TIMESTAMP
- name: assets_total
description: Total number of assets in the organization.
data_type: BIGINT
constraints:
- type: not_null
- name: data_platform
description: The data contract / output port type used for most data products.
data_type: STRING
- name: updated_at
description: Timestamp when this record was written by the dbt execution.
data_type: TIMESTAMP
constraints:
- type: not_null
Generating the dbt model file is usually a one-time operation.
You can now save the model in Git and add further details as needed.
Now, implement the actual SQL transformation logic for each output model:
-- models/output_ports/customer_activity.sql
{{ config(
materialized='table',
schema='op_v1'
) }}
select
organization_id,
organization_vanity_url,
organization_created_by,
organization_created_at,
users_total,
users_added_30d,
users_last_signed_up_at,
dataproducts_total,
dataproducts_added_30d,
dataproducts_outputports_total,
dataproducts_outputports_with_datacontract_percentage,
dataproducts_outputports_with_testresults_percentage,
dataproducts_last_updated_at,
assets_total,
data_platform
from {{ ref('int_customer_activity') }}
Input Ports
A data product typically consumes data from operational systems or other data products. In our data product, we use data that was extracted via dlt from our operational Postgres database to a raw / source-aligned data product on the data platform (Databricks).
In dbt, inputs are modeled as sources.
The input_ports/ folder contains the source definitions and simple select models that reference them:
# models/input_ports/sources.yml
version: 2
sources:
- name: entropydata_postgres
description: >
Operational database of the Entropy Data platform.
database: entropy_data_prod
schema: entropydata_postgres
tables:
- name: organization
- name: organization_member
- name: data_product
- name: asset
Sources referencing other data products require an active data contract between the provider and consumer.
Transformation
Between input and output, the internal models handle the transformation work with the business logic. These are not exposed to consumers.
Staging models clean and normalize raw source data: deduplicate to the latest version, cast types, and rename columns.
-- models/staging/stg_organizations.sql
with source as (
select *,
row_number() over (partition by organization_id order by version desc) as _row_num
from {{ source('entropydata_postgres', 'organization') }}
)
select
organization_id,
vanity_url as organization_vanity_url,
created_by as organization_created_by,
cast(created_at as timestamp) as organization_created_at
from source
where _row_num = 1
-- models/staging/stg_members.sql
with source as (
select *,
row_number() over (partition by organization_member_id order by version desc) as _row_num
from {{ source('entropydata_postgres', 'organization_member') }}
)
select
organization_member_id,
organization_id,
user_id,
role,
cast(created_at as timestamp) as created_at
from source
where _row_num = 1
-- models/staging/stg_data_products.sql
with source as (
select *,
row_number() over (partition by data_product_id order by version desc) as _row_num
from {{ source('entropydata_postgres', 'data_product') }}
)
select
data_product_id,
organization_id,
name,
status,
specification_type,
cast(created_at as timestamp) as created_at,
cast(updated_at as timestamp) as updated_at
from source
where _row_num = 1
Intermediate models apply business logic: join tables, enrich with master data, compute derived fields.
-- models/intermediate/int_customer_activity.sql
with members_per_org as (
select
organization_id,
count(*) as users_total,
count(case when created_at >= date_sub(current_date(), 30) then 1 end) as users_added_30d,
max(created_at) as users_last_signed_up_at
from {{ ref('stg_members') }}
group by organization_id
),
data_products_per_org as (
select
organization_id,
count(*) as dataproducts_total,
count(case when created_at >= date_sub(current_date(), 30) then 1 end) as dataproducts_added_30d,
-- TODO: add output_port source table to compute these metrics
cast(0 as bigint) as dataproducts_outputports_total,
cast(0 as double) as dataproducts_outputports_with_datacontract_percentage,
cast(0 as double) as dataproducts_outputports_with_testresults_percentage,
max(updated_at) as dataproducts_last_updated_at,
first(specification_type) as data_platform
from {{ ref('stg_data_products') }}
group by organization_id
),
assets_per_org as (
select
organization_id,
count(*) as assets_total
from {{ ref('stg_assets') }}
group by organization_id
)
select
o.organization_id,
o.organization_vanity_url,
o.organization_created_by,
o.organization_created_at,
coalesce(m.users_total, 0) as users_total,
coalesce(m.users_added_30d, 0) as users_added_30d,
m.users_last_signed_up_at,
coalesce(dp.dataproducts_total, 0) as dataproducts_total,
coalesce(dp.dataproducts_added_30d, 0) as dataproducts_added_30d,
coalesce(dp.dataproducts_outputports_total, 0) as dataproducts_outputports_total,
coalesce(dp.dataproducts_outputports_with_datacontract_percentage, 0) as dataproducts_outputports_with_datacontract_percentage,
coalesce(dp.dataproducts_outputports_with_testresults_percentage, 0) as dataproducts_outputports_with_testresults_percentage,
dp.dataproducts_last_updated_at,
coalesce(a.assets_total, 0) as assets_total,
dp.data_platform
from {{ ref('stg_organizations') }} o
left join members_per_org m on o.organization_id = m.organization_id
left join data_products_per_org dp on o.organization_id = dp.organization_id
left join assets_per_org a on o.organization_id = a.organization_id
Building the Data Product
When we execute dbt run, dbt materializes all models and creates the output table on Databricks.
The result is a customer_activity table in the dp_entropydata_customer_activity_op_v1 schema, ready to be consumed by the customer success team.
Testing
Testing is what turns a dbt project into a trustworthy data product. Without tests, consumers have no reason to trust the data — they will build their own checks, duplicate logic, or simply avoid using the data product altogether. Automated tests make quality visible and give consumers confidence that the data they rely on is correct and complete.
Testing a data product works on two levels: unit tests that verify the internal implementation, and contract tests that verify the output from a consumer's perspective.
Unit Testing
dbt tests verify individual steps across the data pipeline. They are tightly coupled to the implementation and change as the dbt models evolve.
Schema tests are declared in the YAML files and validate not_null, unique, accepted_values, and relationships on each model.
Custom data tests are SQL queries in the tests/ folder that assert business rules:
-- tests/assert_users_total_non_negative.sql
select organization_id
from {{ ref('customer_activity') }}
where users_total < 0
If the query returns any rows, the test fails.
dbt contracts (since dbt 1.5) additionally enforce column names and data types at build time.
The generated dbt model already includes contract.enforced: true.
Contract Testing
The Data Contract CLI tests the output port from the consumer's point of view, against the data contract. This is an acceptance test: it connects to the actual data platform, queries the output tables, and checks schema, row counts, and quality rules defined in the contract.
datacontract test datacontracts/customer_activity_v1.odcs.yaml
Contract tests are more stable than dbt tests. They don't change when you refactor the internal staging or intermediate models. As long as the output port still fulfills the contract, the tests pass. This makes them ideal for CI/CD pipelines and for building consumer confidence.
CI/CD Pipeline
Each data product has its own CI/CD pipeline that runs on every push, and periodically:
dbt runmaterializes all modelsdbt testruns all schema and custom data tests- Publish the data product and data contract to Entropy Data
datacontract testruns contract tests against the output port
We use a GitHub Actions workflow for this:
# .github/workflows/data-product.yml
name: Customer Activity Data Product
on:
push:
branches: [main]
schedule:
- cron: "0 6 * * *"
env:
API: https://api.entropy-data.com/api
DBT_DATABRICKS_HOST: ${{ secrets.DBT_DATABRICKS_HOST }}
DBT_DATABRICKS_HTTP_PATH: ${{ secrets.DBT_DATABRICKS_HTTP_PATH }}
DBT_DATABRICKS_TOKEN: ${{ secrets.DBT_DATABRICKS_TOKEN }}
DATACONTRACT_DATABRICKS_TOKEN: ${{ secrets.DBT_DATABRICKS_TOKEN }}
DATACONTRACT_DATABRICKS_SERVER_HOSTNAME: ${{ secrets.DBT_DATABRICKS_HOST }}
DATACONTRACT_DATABRICKS_HTTP_PATH: ${{ secrets.DBT_DATABRICKS_HTTP_PATH }}
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install dbt-databricks openlineage-dbt datacontract-cli[databricks]
- name: Create profiles.yml
run: |
mkdir -p ~/.dbt
cat > ~/.dbt/profiles.yml <<EOF
dp_entropydata_customer_activity:
target: prod
outputs:
prod:
type: databricks
catalog: entropy_data_prod
schema: dp_entropydata_customer_activity
host: ${DBT_DATABRICKS_HOST}
http_path: ${DBT_DATABRICKS_HTTP_PATH}
token: ${DBT_DATABRICKS_TOKEN}
threads: 4
EOF
- name: dbt deps
run: dbt deps
- name: dbt run
run: dbt-ol run --target prod
- name: dbt test
run: dbt test --target prod
- name: Publish data product
run: |
yq -o=json customer-activity.odps.yaml \
| curl -s -X PUT \
-H 'Content-Type: application/json' \
-H "x-api-key: ${{ secrets.ENTROPY_DATA_API_KEY }}" \
-d @- $API/dataproducts/entropydata-customer-activity
- name: Publish data contract
run: |
yq -o=json datacontracts/customer_activity_v1.odcs.yaml \
| curl -s -X PUT \
-H 'Content-Type: application/json' \
-H "x-api-key: ${{ secrets.ENTROPY_DATA_API_KEY }}" \
-d @- $API/datacontracts/entropydata-customer-activity-v1
- name: Data contract test
run: |
datacontract test datacontracts/customer_activity_v1.odcs.yaml \
--server production \
--publish $API/test-results
env:
ENTROPY_DATA_API_KEY: ${{ secrets.ENTROPY_DATA_API_KEY }}
dbt run materializes the models, dbt test runs the unit tests.
Then, the data product and data contract metadata are published to Entropy Data via its REST API using yq to convert YAML to JSON.
Finally, datacontract test validates the output port from the consumer perspective.
Entropy Data
Entropy Data is a data product marketplace that manages data products, data contracts, and access requests. It integrates well to provide a complete data product experience:
- Discovery: consumers browse and find data products in a self-service marketplace
- Access management: consumers request access, approvals can trigger RBAC provisioning on the data platform
- Governance: track ownership, quality, and lineage across all data products
Here is how our Customer Activity data product looks in the Entropy Data marketplace:
Business Value
With this data product in place, our customer success team (actually these are our co-founders...) can now query the customer_activity table directly on Databricks to understand each customer's engagement:
how many users they have, whether they are actively adding data products, and what data platform they use.
This enables proactive outreach to customers who are ramping up, early detection of inactive accounts, and data-driven prioritization of support efforts.
It can support CRM activities, power AI agents that proactively identify where we can help customers build better data products or set up integrations.
The data contract guarantees the schema and quality, so they can build dashboards, agents, and automations on top of it with confidence.
Sign up now for free, or explore the clickable demo of Entropy Data.