Teradata configurations
General
-
Set
quote_columns
- to prevent a warning, make sure to explicitly set a value forquote_columns
in yourdbt_project.yml
. See the doc on quote_columns for more information.seeds:
+quote_columns: false #or `true` if you have csv column headers with spaces
Models
table
-
table_kind
- define the table kind. Legal values areMULTISET
(default for ANSI transaction mode required bydbt-teradata
) andSET
, e.g.:- in sql materialization definition file:
{{
config(
materialized="table",
table_kind="SET"
)
}} - in seed configuration:
seeds:
<project-name>:
table_kind: "SET"
For details, see CREATE TABLE documentation.
- in sql materialization definition file:
-
table_option
- defines table options. The config supports multiple statements. The definition below uses the Teradata syntax definition to explain what statements are allowed. Square brackets[]
denote optional parameters. The pipe symbol|
separates statements. Use commas to combine multiple statements as shown in the examples below:{ MAP = map_name [COLOCATE USING colocation_name] |
[NO] FALLBACK [PROTECTION] |
WITH JOURNAL TABLE = table_specification |
[NO] LOG |
[ NO | DUAL ] [BEFORE] JOURNAL |
[ NO | DUAL | LOCAL | NOT LOCAL ] AFTER JOURNAL |
CHECKSUM = { DEFAULT | ON | OFF } |
FREESPACE = integer [PERCENT] |
mergeblockratio |
datablocksize |
blockcompression |
isolated_loading
}where:
- mergeblockratio:
{ DEFAULT MERGEBLOCKRATIO |
MERGEBLOCKRATIO = integer [PERCENT] |
NO MERGEBLOCKRATIO
} - datablocksize:
DATABLOCKSIZE = {
data_block_size [ BYTES | KBYTES | KILOBYTES ] |
{ MINIMUM | MAXIMUM | DEFAULT } DATABLOCKSIZE
} - blockcompression:
BLOCKCOMPRESSION = { AUTOTEMP | MANUAL | ALWAYS | NEVER | DEFAULT }
[, BLOCKCOMPRESSIONALGORITHM = { ZLIB | ELZS_H | DEFAULT } ]
[, BLOCKCOMPRESSIONLEVEL = { value | DEFAULT } ] - isolated_loading:
WITH [NO] [CONCURRENT] ISOLATED LOADING [ FOR { ALL | INSERT | NONE } ]
Examples:
- in sql materialization definition file:
{{
config(
materialized="table",
table_option="NO FALLBACK"
)
}}{{
config(
materialized="table",
table_option="NO FALLBACK, NO JOURNAL"
)
}}{{
config(
materialized="table",
table_option="NO FALLBACK, NO JOURNAL, CHECKSUM = ON,
NO MERGEBLOCKRATIO,
WITH CONCURRENT ISOLATED LOADING FOR ALL"
)
}} - in seed configuration:
seeds:
<project-name>:
table_option:"NO FALLBACK"seeds:
<project-name>:
table_option:"NO FALLBACK, NO JOURNAL"seeds:
<project-name>:
table_option: "NO FALLBACK, NO JOURNAL, CHECKSUM = ON,
NO MERGEBLOCKRATIO,
WITH CONCURRENT ISOLATED LOADING FOR ALL"
For details, see CREATE TABLE documentation.
- mergeblockratio:
-
with_statistics
- should statistics be copied from the base table. For example:{{
config(
materialized="table",
with_statistics="true"
)
}}For details, see CREATE TABLE documentation.
-
index
- defines table indices:[UNIQUE] PRIMARY INDEX [index_name] ( index_column_name [,...] ) |
NO PRIMARY INDEX |
PRIMARY AMP [INDEX] [index_name] ( index_column_name [,...] ) |
PARTITION BY { partitioning_level | ( partitioning_level [,...] ) } |
UNIQUE INDEX [ index_name ] [ ( index_column_name [,...] ) ] [loading] |
INDEX [index_name] [ALL] ( index_column_name [,...] ) [ordering] [loading]
[,...]where:
- partitioning_level:
{ partitioning_expression |
COLUMN [ [NO] AUTO COMPRESS |
COLUMN [ [NO] AUTO COMPRESS ] [ ALL BUT ] column_partition ]
} [ ADD constant ] - ordering:
ORDER BY [ VALUES | HASH ] [ ( order_column_name ) ]
- loading:
WITH [NO] LOAD IDENTITY
Examples:
- in sql materialization definition file:
{{
config(
materialized="table",
index="UNIQUE PRIMARY INDEX ( GlobalID )"
)
}}ℹ️ Note, unlike in
table_option
, there are no commas between index statements!{{
config(
materialized="table",
index="PRIMARY INDEX(id)
PARTITION BY RANGE_N(create_date
BETWEEN DATE '2020-01-01'
AND DATE '2021-01-01'
EACH INTERVAL '1' MONTH)"
)
}}{{
config(
materialized="table",
index="PRIMARY INDEX(id)
PARTITION BY RANGE_N(create_date
BETWEEN DATE '2020-01-01'
AND DATE '2021-01-01'
EACH INTERVAL '1' MONTH)
INDEX index_attrA (attrA) WITH LOAD IDENTITY"
)
}} - in seed configuration:
seeds:
<project-name>:
index: "UNIQUE PRIMARY INDEX ( GlobalID )"ℹ️ Note, unlike in
table_option
, there are no commas between index statements!seeds:
<project-name>:
index: "PRIMARY INDEX(id)
PARTITION BY RANGE_N(create_date
BETWEEN DATE '2020-01-01'
AND DATE '2021-01-01'
EACH INTERVAL '1' MONTH)"seeds:
<project-name>:
index: "PRIMARY INDEX(id)
PARTITION BY RANGE_N(create_date
BETWEEN DATE '2020-01-01'
AND DATE '2021-01-01'
EACH INTERVAL '1' MONTH)
INDEX index_attrA (attrA) WITH LOAD IDENTITY"
- partitioning_level:
Seeds
As explained in dbt seeds documentation, seeds should not be used to load raw data (for example, large CSV exports from a production database).
Since seeds are version controlled, they are best suited to files that contain business-specific logic, for example a list of country codes or user IDs of employees.
Loading CSVs using dbt's seed functionality is not performant for large files. Consider using a different tool to load these CSVs into your data warehouse.
-
use_fastload
- use fastload when handlingdbt seed
command. The option will likely speed up loading when your seed files have hundreds of thousands of rows. You can set this seed configuration option in yourproject.yml
file, e.g.:seeds:
<project-name>:
+use_fastload: true
Snapshots
Snapshots use the HASHROW function of the Teradata database to generate a unique hash value for the dbt_scd_id
column.
To use your own hash UDF, there is a configuration option in the snapshot model called snapshot_hash_udf
, which defaults to HASHROW. You can provide a value like <database_name.hash_udf_name>
. If you only provide hash_udf_name
, it uses the same schema as the model runs.
For example, in the snapshots/snapshot_example.sql
file:
{% snapshot snapshot_example %}
{{
config(
target_schema='snapshots',
unique_key='id',
strategy='check',
check_cols=["c2"],
snapshot_hash_udf='GLOBAL_FUNCTIONS.hash_md5'
)
}}
select * from {{ ref('order_payments') }}
{% endsnapshot %}
Grants
Grants are supported in dbt-teradata adapter with release version 1.2.0 and above. You can use grants to manage access to the datasets you're producing with dbt. To implement these permissions, define grants as resource configs on each model, seed, or snapshot. Define the default grants that apply to the entire project in your dbt_project.yml
, and define model-specific grants within each model's SQL or YAML file.
For example: models/schema.yml
models:
- name: model_name
config:
grants:
select: ['user_a', 'user_b']
Another example for adding multiple grants:
models:
- name: model_name
config:
materialized: table
grants:
select: ["user_b"]
insert: ["user_c"]
ℹ️
copy_grants
is not supported in Teradata.
Refer to grants for more information on Grants.
Query band
Query band in dbt-teradata can be set on three levels:
-
Profiles level: In the
profiles.yml
file, the user can providequery_band
using the following example:query_band: 'application=dbt;'
-
Project level: In the
dbt_project.yml
file, the user can providequery_band
using the following example:models:
Project_name:
+query_band: "app=dbt;model={model};" -
Model level: It can be set on the model SQL file or model level configuration on YAML files:
{{ config( query_band='sql={model};' ) }}
Users can set query_band
at any level or on all levels. With profiles-level query_band
, dbt-teradata will set the query_band
for the first time for the session, and subsequently for model and project level query band will be updated with respective configuration.
If a user sets some key-value pair with value as '{model}'
, internally this '{model}'
will be replaced with model name, which can be useful for telemetry tracking of sql/ dbql logging.
models:
Project_name:
+query_band: "app=dbt;model={model};"
- For example, if the model the user is running is
stg_orders
,{model}
will be replaced withstg_orders
in runtime. - If no
query_band
is set by the user, the default query_band used will be:org=teradata-internal-telem;appname=dbt;
Unit testing
- Unit testing is supported in dbt-teradata, allowing users to write and execute unit tests using the dbt test command.
- For detailed guidance, refer to the dbt unit tests documentation.
In Teradata, reusing the same alias across multiple common table expressions (CTEs) or subqueries within a single model is not permitted, as it results in parsing errors; therefore, it is essential to assign unique aliases to each CTE or subquery to ensure proper query execution.
valid_history incremental materialization strategy
This is available in early access
This strategy is designed to manage historical data efficiently within a Teradata environment, leveraging dbt features to ensure data quality and optimal resource usage. In temporal databases, valid time is crucial for applications like historical reporting, ML training datasets, and forensic analysis.
{{
config(
materialized='incremental',
unique_key='id',
on_schema_change='fail',
incremental_strategy='valid_history',
valid_period='valid_period_col',
use_valid_to_time='no',
)
}}
The valid_history
incremental strategy requires the following parameters:
unique_key
: The primary key of the model (excluding the valid time components), specified as a column name or list of column names.valid_period
: Name of the model column indicating the period for which the record is considered to be valid. The datatype must bePERIOD(DATE)
orPERIOD(TIMESTAMP)
.use_valid_to_time
: Whether the end bound value of the valid period in the input is considered by the strategy when building the valid timeline. Useno
if you consider your record to be valid until changed (and supply any value greater to the begin bound for the end bound of the period. A typical convention is9999-12-31
of ``9999-12-31 23:59:59.999999). Use
yes` if you know until when the record is valid (typically this is a correction in the history timeline).
The valid_history strategy in dbt-teradata involves several critical steps to ensure the integrity and accuracy of historical data management:
- Remove duplicates and conflicting values from the source data:
- This step ensures that the data is clean and ready for further processing by eliminating any redundant or conflicting records.
- The process of removing primary key duplicates (two or more records with the same value for the
unique_key
and BEGIN() bond of thevalid_period
fields) in the dataset produced by the model. If such duplicates exist, the row with the lowest value is retained for all non-primary-key fields (in the order specified in the model). Full-row duplicates are always de-duplicated.
- Identify and adjust overlapping time slices:
- Overlapping or adjacent time periods in the data are corrected to maintain a consistent and non-overlapping timeline. To achieve this, the macro adjusts the valid period end bound of a record to align with the begin bound of the next record (if they overlap or are adjacent) within the same
unique_key
group. Ifuse_valid_to_time = 'yes'
, the valid period end bound provided in the source data is used. Otherwise, a default end date is applied for missing bounds, and adjustments are made accordingly.
- Overlapping or adjacent time periods in the data are corrected to maintain a consistent and non-overlapping timeline. To achieve this, the macro adjusts the valid period end bound of a record to align with the begin bound of the next record (if they overlap or are adjacent) within the same
- Manage records needing to be adjusted, deleted, or split based on the source and target data:
- This involves handling scenarios where records in the source data overlap with or need to replace records in the target data, ensuring that the historical timeline remains accurate.
- Compact history:
- Normalize and compact the history by merging records of adjacent time periods with the same value, optimizing database storage and performance. We use the function TD_NORMALIZE_MEET for this purpose.
- Delete existing overlapping records from the target table:
- Before inserting new or updated records, any existing records in the target table that overlap with the new data are removed to prevent conflicts.
- Insert the processed data into the target table:
-
Finally, the cleaned and adjusted data is inserted into the target table, ensuring that the historical data is up-to-date and accurately reflects the intended timeline.
-
These steps collectively ensure that the valid_history strategy effectively manages historical data, maintaining its integrity and accuracy while optimizing performance.
An illustration demonstrating the source sample data and its corresponding target data:
-- Source data
pk | valid_from | value_txt1 | value_txt2
======================================================================
1 | 2024-03-01 00:00:00.0000 | A | x1
1 | 2024-03-12 00:00:00.0000 | B | x1
1 | 2024-03-12 00:00:00.0000 | B | x2
1 | 2024-03-25 00:00:00.0000 | A | x2
2 | 2024-03-01 00:00:00.0000 | A | x1
2 | 2024-03-12 00:00:00.0000 | C | x1
2 | 2024-03-12 00:00:00.0000 | D | x1
2 | 2024-03-13 00:00:00.0000 | C | x1
2 | 2024-03-14 00:00:00.0000 | C | x1
-- Target data
pk | valid_period | value_txt1 | value_txt2
===================================================================================================
1 | PERIOD(TIMESTAMP)[2024-03-01 00:00:00.0, 2024-03-12 00:00:00.0] | A | x1
1 | PERIOD(TIMESTAMP)[2024-03-12 00:00:00.0, 2024-03-25 00:00:00.0] | B | x1
1 | PERIOD(TIMESTAMP)[2024-03-25 00:00:00.0, 9999-12-31 23:59:59.9999] | A | x2
2 | PERIOD(TIMESTAMP)[2024-03-01 00:00:00.0, 2024-03-12 00:00:00.0] | A | x1
2 | PERIOD(TIMESTAMP)[2024-03-12 00:00:00.0, 9999-12-31 23:59:59.9999] | C | x1
Common Teradata-specific tasks
-
collect statistics - when a table is created or modified significantly, there might be a need to tell Teradata to collect statistics for the optimizer. It can be done using
COLLECT STATISTICS
command. You can perform this step using dbt'spost-hooks
, e.g.:{{ config(
post_hook=[
"COLLECT STATISTICS ON {{ this }} COLUMN (column_1, column_2 ...);"
]
)}}See Collecting Statistics documentation for more information.