XGBoost Classifier

Input

It takes in a DataFrame as input and performs XGBoost Classification

Output

The XGBoost Model generated is passed along to the next nodes. The input DataFrame is also passed along to the next nodes

Type

ml-estimator

Class

fire.nodes.ml.NodeXGBoostClassifier

Fields

Name

Title

Description

featuresCol

Features Column

Features column of type vectorUDT for model fitting

labelCol

Label Column

The label column for model fitting

predictionCol

Prediction Column

The prediction column created during model scoring.

splitRatio

Split Ratio

Split Ratio

numClass

Num Class

maxDepth

Max Depth

The Maximum depth of a tree

maxBins

Max Bins

The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.

maxLeaves

Max Leaves

numRound

Num Round

numWorkers

Num Workers

objective

Objective

eta

Eta

regLambda

Reg Lambda

regAlpha

Reg Alpha

subsample

Sub Aample

sampleType

Sample Type

treeMethod

Tree Method

useExternalMemory

Use External Memory

seed

Seed

baseScore

Base Score

minChildWeight

Min Child Weight

colsampleBylevel

Col Sample By Level

colsampleBytree

Col Sample By Tree

minSplitLoss

Min Split Loss

maxDeltaStep

Max Delta Step

sketchEps

Sketch Eps

scalePosWeight

Scale Pos Weight

growPlicy

Grow Policy

normalizeType

Normalize Type

skipDrop

Skip Drop

rateDrop

Rate Drop

Details

Deatils: https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#xgboost4j-spark-tutorial-version-0-9

XGBoost Classifier Node Details

This node implements the XGBoost algorithm for classification tasks. It can be used for a variety of classification problems, including binary classification (e.g., spam detection) and multi-class classification (e.g., image recognition).

Key Parameters:

Features Column: The name of the column containing the features used for training.

Label Column: The name of the column containing the target variable to be predicted.

Prediction Column: The name of the column where the predicted class probabilities will be stored.

Num Class: The number of classes in the classification problem.

Max Depth: The maximum depth of each tree in the ensemble. Higher values can lead to overfitting.

Max Bins: The maximum number of bins to use for histogram-based approximations.

Max Leaves: The maximum number of leaves per tree.

Num Round: The number of boosting rounds (trees) to build.

Num Workers: The number of threads to use for parallel processing.

Objective: The objective function to optimize. ‘multi:softprob’ is used for multi-class classification.

Eta: The learning rate, which controls the step size at each boosting round.

Examples

XGBoost Classifier Node Example

Scenario:

Let’s assume we have a dataset containing information about customers, including features like age, income, purchase history, etc., and the corresponding target variable being the customer’s preferred product category.

Configuration:

  1. Features Column: “customer_features”

  2. Label Column: “product_category”

  3. Prediction Column: “predicted_probabilities”

  4. Num Class: 3 (assuming three product categories)

  5. Max Depth: 6

  6. Num Round: 100

  7. Eta: 0.3

  8. Objective: “multi:softprob”

Execution:

When this node is executed, the XGBoost algorithm will train a classification model using the specified parameters. The model will then be used to predict the probabilities of each product category for new customer data points.

Output:

The predicted probabilities for each product category will be stored in the “predicted_probabilities” column of the output dataset. This column will likely be a list or array containing the probabilities for each class.