Data Drift¶

This node calculates the Population Stability Index (PSI) for a set of features by comparing a reference dataset to a test dataset. It is designed to identify potential data drift in both continuous and categorical features.

Input¶

A reference dataset and a test dataset provided as DataFrames. The reference dataset serves as the baseline distribution, while the test dataset is used to detect any drift.

Output¶

A Spark DataFrame with two columns: ‘feature_name’ and ‘psi_value’. Each row represents a feature and its corresponding PSI value, indicating the level of drift.

Type¶

ml-estimator

Class¶

fire.nodes.ml.NodeDataDrift

Fields¶

Name	Title	Description
inputCols	Input Columns	A list of feature names on which the PSI will be calculated.
categoricalCols	Categorical Columns	A list of features that are categorical. All other features are treated as continuous.
numBins	Num Bins	The number of bins to use when binning continuous features for PSI calculation.