XGBoost Classifier¶

Input¶

It takes in a DataFrame as input and performs XGBoost Classification

Output¶

The XGBoost Model generated is passed along to the next nodes. The input DataFrame is also passed along to the next nodes

Type¶

ml-estimator

Class¶

fire.nodes.ml.NodeXGBoostClassifier

Fields¶

Name	Title	Description
featuresCol	Features Column	Features column of type vectorUDT for model fitting
labelCol	Label Column	The label column for model fitting
predictionCol	Prediction Column	The prediction column created during model scoring.
splitRatio	Split Ratio	Split Ratio
numClass	Num Class
maxDepth	Max Depth	The Maximum depth of a tree
maxBins	Max Bins	The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.
maxLeaves	Max Leaves
numRound	Num Round
numWorkers	Num Workers
objective	Objective
eta	Eta
regLambda	Reg Lambda
regAlpha	Reg Alpha
subsample	Sub Aample
sampleType	Sample Type
treeMethod	Tree Method
useExternalMemory	Use External Memory
seed	Seed
baseScore	Base Score
minChildWeight	Min Child Weight
colsampleBylevel	Col Sample By Level
colsampleBytree	Col Sample By Tree
minSplitLoss	Min Split Loss
maxDeltaStep	Max Delta Step
sketchEps	Sketch Eps
scalePosWeight	Scale Pos Weight
growPlicy	Grow Policy
normalizeType	Normalize Type
skipDrop	Skip Drop
rateDrop	Rate Drop

Details¶

Deatils: https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#xgboost4j-spark-tutorial-version-0-9

XGBoost Classifier Node Details¶

This node implements the XGBoost algorithm for classification tasks. It can be used for a variety of classification problems, including binary classification (e.g., spam detection) and multi-class classification (e.g., image recognition).

Key Parameters:

Features Column: The name of the column containing the features used for training.

Label Column: The name of the column containing the target variable to be predicted.

Prediction Column: The name of the column where the predicted class probabilities will be stored.

Num Class: The number of classes in the classification problem.

Max Depth: The maximum depth of each tree in the ensemble. Higher values can lead to overfitting.

Max Bins: The maximum number of bins to use for histogram-based approximations.

Max Leaves: The maximum number of leaves per tree.

Num Round: The number of boosting rounds (trees) to build.

Num Workers: The number of threads to use for parallel processing.

Objective: The objective function to optimize. ‘multi:softprob’ is used for multi-class classification.

Eta: The learning rate, which controls the step size at each boosting round.

Examples¶

XGBoost Classifier Node Example¶

Scenario:

Let’s assume we have a dataset containing information about customers, including features like age, income, purchase history, etc., and the corresponding target variable being the customer’s preferred product category.

Configuration:

Features Column: “customer_features”
Label Column: “product_category”
Prediction Column: “predicted_probabilities”
Num Class: 3 (assuming three product categories)
Max Depth: 6
Num Round: 100
Eta: 0.3
Objective: “multi:softprob”

Execution:

When this node is executed, the XGBoost algorithm will train a classification model using the specified parameters. The model will then be used to predict the probabilities of each product category for new customer data points.

Output:

The predicted probabilities for each product category will be stored in the “predicted_probabilities” column of the output dataset. This column will likely be a list or array containing the probabilities for each class.