Data Drift

This node calculates the Population Stability Index (PSI) for a set of features by comparing a reference dataset to a test dataset. It is designed to identify potential data drift in both continuous and categorical features.

Input

A reference dataset and a test dataset provided as DataFrames. The reference dataset serves as the baseline distribution, while the test dataset is used to detect any drift.

Output

A Spark DataFrame with two columns: ‘feature_name’ and ‘psi_value’. Each row represents a feature and its corresponding PSI value, indicating the level of drift.

Type

ml-estimator

Class

fire.nodes.ml.NodeDataDrift

Fields

Name

Title

Description

inputCols

Input Columns

A list of feature names on which the PSI will be calculated.

categoricalCols

Categorical Columns

A list of features that are categorical. All other features are treated as continuous.

numBins

Num Bins

The number of bins to use when binning continuous features for PSI calculation.