Data Drift¶
This node calculates the Population Stability Index (PSI) for a set of features by comparing a reference dataset to a test dataset. It is designed to identify potential data drift in both continuous and categorical features.
Input¶
A reference dataset and a test dataset provided as DataFrames. The reference dataset serves as the baseline distribution, while the test dataset is used to detect any drift.
Output¶
A Spark DataFrame with two columns: ‘feature_name’ and ‘psi_value’. Each row represents a feature and its corresponding PSI value, indicating the level of drift.
Type¶
ml-estimator
Class¶
fire.nodes.ml.NodeDataDrift
Fields¶
Name |
Title |
Description |
|---|---|---|
inputCols |
Input Columns |
A list of feature names on which the PSI will be calculated. |
categoricalCols |
Categorical Columns |
A list of features that are categorical. All other features are treated as continuous. |
numBins |
Num Bins |
The number of bins to use when binning continuous features for PSI calculation. |