SCDType2DeltaMerge¶
It is a Delta merge operation that stores and manages both current and historical data over time.
Type¶
transform
Class¶
fire.nodes.etl.NodeSCDType2DeltaMerge
Fields¶
Name |
Title |
Description |
|---|---|---|
targetType |
Target Type |
Target Type Table or Path |
targetLocation |
Target Delta Path |
Delta Path of the target location |
tableName |
TableName |
TableName |
keyColumns |
Key Columns |
Merge Check. |
active_flag_col |
Active Flag Column |
Used to mark which record is the current active one. |
active_flag_vals |
Active Flag Value |
True or False OR 1 or 0. |
start_date_col |
Start Date Column Name |
Column Name when a particular version of a record became valid or active |
end_date_col |
End Date Column Name |
Column Name when a particular version of a record became invalid or inactive |
start_date_val |
Start Date Column Value |
date value for particular version of a record became valid or active |
end_date_val |
End Date Column Value |
date value for a particular version of a record became valid or active |
Details¶
SCDType2 Delta Merge Node Details¶
It is a Delta merge operation that stores and manages both current and historical data over time.
Parameters to be set:¶
OUTPUT STORAGE LEVEL: Choose how the data should be stored (e.g., DEFAULT, DISK_ONLY).
TARGET TYPE: Select the type of target for the merge operation. This could be a Delta Lake table or a path to a directory.
TARGET DELTA PATH: If “TARGET TYPE” is set to “Delta”, specify the directory where the Delta Lake table is located. Use the “Browse” button to navigate and select the location.
TABLENAME: If “TARGET TYPE” is set to “Table”, specify the name of the Delta Lake table.
KEY COLUMNS: Specify the column(s) that uniquely identify rows in the dimension table (e.g., customer_id, product_id). These columns are used to match rows between the source DataFrame and the target Delta Lake table. Select the key column(s) from the “Available” list and move them to the “Selected” list using the arrow buttons.
ACTIVE FLAG COLUMN: Specify the name of the column in the Delta Lake table that indicates whether a record is currently active. This column is typically a boolean or flag column (e.g., is_current, is_active).
ACTIVE FLAG VALUE: Specify the value that represents an “active” record in the ACTIVE FLAG COLUMN. This is usually true or 1, but it could be a different value depending on your data.
START DATE COLUMN NAME: Specify the name of the column in the Delta Lake table that will store the start date of a record’s validity. This column is used in SCD2 to track when a particular version of a record became active (e.g., effective_start_date, valid_from).
END DATE COLUMN NAME: Specify the name of the column in the Delta Lake table that will store the end date of a record’s validity. This column is used in SCD2 to track when a particular version of a record became inactive (e.g., effective_end_date, valid_to).
START DATE COLUMN VALUE: Specify the value to be used for the START DATE COLUMN when a new record is inserted or an existing record is updated. This is often a function like current_date() to use the current date, but it could be a literal value or an expression.
END DATE COLUMN VALUE: Specify the value to be used for the END DATE COLUMN when a new record is inserted or an existing record is updated. This is often a very large date in the future (to represent “infinity”) or a function like current_date() to use the current date (if you’re updating the end date of the previous record).
Examples¶
SCD Type 2 Delta Merge Node Examples¶
Example of Performing an SCD2 Merge¶
OUTPUT STORAGE LEVEL: DEFAULT
TARGET TYPE: Delta
TARGET DELTA PATH: /user/hive/warehouse/dim_customers
KEY COLUMNS: customer_id
ACTIVE FLAG COLUMN: is_current
ACTIVE FLAG VALUE: true
START DATE COLUMN NAME: effective_start_date
END DATE COLUMN NAME: effective_end_date
START DATE COLUMN VALUE: current_date()
END DATE COLUMN VALUE: date_add(current_date(), 10000) (adding 10000 days to represent a date far in the future)
This configuration would:
Perform an SCD2 merge on the dim_customers Delta Lake table.
Use customer_id to match rows.
Use is_current as the active flag column, with true representing active records.
Use effective_start_date and effective_end_date to track the validity period of each record.
Set the effective_start_date to the current date for new and updated records.
Set the effective_end_date to a date far in the future for new and updated records.
This setup ensures that you maintain a full history of changes in your customer dimension table, allowing you to analyze customer data over time.