GBT Classifier¶

Gradient-Boosted Trees (GBTs) is a learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features. Note: Multiclass labels are not currently supported.

Input¶

It takes in a DataFrame as input and performs GBT Classification

Output¶

The GBT Model generated is passed along to the next nodes. The input DataFrame is also passed along to the next nodes

Type¶

ml-estimator

Class¶

fire.nodes.ml.NodeGBTClassifier

Fields¶

Name	Title	Description
featuresCol	Features Column	Features column of type vectorUDT for model fitting
labelCol	Label Column	The label column for model fitting
predictionCol	Prediction Column	The prediction column created during model scoring.
impurity	Impurity	The Criterion used for information gain calculation
lossType	Loss Function	The Loss function which GBT tries to minimize
splitRatio	Split Ratio	Split Ratio
maxBins	Max Bins	The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.
maxDepth	Max Depth	The Maximum depth of a tree
maxIter	Max Iterations	The maximum number of iterations(>=0)(a.k.a numtrees)
minInfoGain	Min Information Gain	The Minimum information gain for a split to be considered at a tree node
minInstancesPerNode	Min Instances Per Node	The Minimum number of instances each child must have after split
subsamplingRate	Subsampling Rate	The fraction of the training data used for learning each decision tree.
seed	Seed	The random seed
stepSize	Step Size	Step size (a.k.a. learning rate), The step size to be used for each iteration of optimization.
cacheNodeIds	Cache Node Ids	The caching nodes IDs. Can speed up training of deeper trees.
checkpointInterval	Checkpoint Interval	The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1)
maxMemoryInMB	Max memory	Maximum memory in MB allocated to histogram aggregation.
validationIndicatorCol	Validation Indicator Column	Param for name of the column that indicates whether each row is for training or for validation.
featureSubsetStrategy	Feature Subset Strategy	The number of features to consider for splits at each tree node
minWeightFractionPerNode	Min Weight Fraction per Node	Minimum fraction of the weighted sample count that each child must have after split.
weightCol	Weight Column	Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
gridSearch	Grid Search
minInfoGainGrid	Min Information Gain Param Grid Search	Min Information Gain Parameters for Grid Search
maxBinsGrid	Max Bins Param Grid Search	Max Bins Parameters for Grid Search
maxDepthGrid	Max Depth Param Grid Search	Max Depth Parameters for Grid Search
maxIterGrid	Max Iteration Param Grid Search	Max Iteration Parameters for Grid Search
confusionMatrix	Confusion Matrix
output_confusion_matrix_chart	Output Confusion Matrix Chart	whether to display confusion matrix chart.
cm_chart_title	Confusion Matrix Chart Title	Title name to display in Confusion Matrix Chart
cm_chart_description	Confusion Matrix Chart Description	Description to display in Confusion Matrix CHart
confusionMatrixTargetLegend	Confusion Matrix Target Legend	Legend name to display for Target in Confusion Matrix
confusionMatrixPredictedLabelLegend	Confusion Matrix PredictedLabel Legend	Legend name to display for Predicted Label in Confusion Matrix
confusionMatrixCountLegend	Confusion Matrix Count Legend	Legend name to display for Count in Confusion Matrix
Description	Confusion Matrix Description
confusionMatrixRowDescription	Confusion Matrix Outcome description	One can provide the business details of the outcome of the confusion matrix rows
ROC Curve	ROC Curve
output_roc_curve	Output ROC Curve	whether to display confusion matrix chart.
roc_title	ROC Curve Chart Title	Title name to display in ROC Curve Chart
roc_description	ROC Curve Chart Description	Add Description for ROC Curve Chart
xlabel	X Label	X label
ylabel	Y Label	Y Label

Details¶

Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees.

More details are available at : http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier

Examples¶

Below example is available at :https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier

import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.

val data = spark.read.format(“libsvm”).load(“data/mllib/sample_libsvm_data.txt”)

// Index labels, adding metadata to the label column.

// Fit on whole dataset to include all labels in index.

val labelIndexer = new StringIndexer()

.setInputCol(“label”)

.setOutputCol(“indexedLabel”)

.fit(data)

// Automatically identify categorical features, and index them.

// Set maxCategories so features with > 4 distinct values are treated as continuous.

val featureIndexer = new VectorIndexer()

.setInputCol(“features”)

.setOutputCol(“indexedFeatures”)

.setMaxCategories(4)

.fit(data)

// Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a GBT model.

val gbt = new GBTClassifier()

.setLabelCol(“indexedLabel”)

.setFeaturesCol(“indexedFeatures”)

.setMaxIter(10)

.setFeatureSubsetStrategy(“auto”)

// Convert indexed labels back to original labels.

val labelConverter = new IndexToString()

.setInputCol(“prediction”)

.setOutputCol(“predictedLabel”)

.setLabels(labelIndexer.labelsArray(0))

// Chain indexers and GBT in a Pipeline.

val pipeline = new Pipeline()

.setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))

// Train model. This also runs the indexers.

val model = pipeline.fit(trainingData)

// Make predictions.

val predictions = model.transform(testData)

// Select example rows to display.

predictions.select(“predictedLabel”, “label”, “features”).show(5)

// Select (prediction, true label) and compute test error.

val evaluator = new MulticlassClassificationEvaluator()

.setLabelCol(“indexedLabel”)

.setPredictionCol(“prediction”)

.setMetricName(“accuracy”)

val accuracy = evaluator.evaluate(predictions)

println(s”Test Error = ${1.0 - accuracy}”)

val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]

println(s”Learned classification GBT model:\n ${gbtModel.toDebugString}”)