GBT Regression

It supports both continuous and categorical features.

Input

This takes in a DataFrame and performs Logistic Regression

Output

It generates the GBTRegression and passes it to the next Predict and ModelSave Nodes. The input DataFrame is also passed along to the next nodes.

Type

ml-estimator

Class

fire.nodes.ml.NodeGBTRegression

Fields

Name

Title

Description

featuresCol

Features Column

Features column of type vectorUDT for model fitting

labelCol

Label Column

The label column for model fitting

predictionCol

Prediction Column

The prediction column created during model scoring.

splitRatio

Split Ratio

Split Ratio

impurity

Impurity

The Criterion used for information gain calculation

lossType

Loss Function

The Loss function which GBT tries to minimize

maxBins

Max Bins

The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.

maxDepth

Max Depth

The Maximum depth of a tree

maxIter

Max Iterations

The maximum number of iterations(>=0)(a.k.a numtrees)

minInfoGain

Min Information Gain

The Minimum information gain for a split to be considered at a tree node

minInstancesPerNode

Min Instances Per Node

The Minimum number of instances each child must have after split

subsamplingRate

Subsampling Rate

The fraction of the training data used for learning each decision tree.

seed

Seed

The random seed

stepSize

Step Size

Step size (a.k.a. learning rate), The step size to be used for each iteration of optimization.

cacheNodeIds

Cache Node Ids

The caching nodes IDs. Can speed up training of deeper trees.

checkpointInterval

Checkpoint Interval

The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1)

maxMemoryInMB

Max memory

Maximum memory in MB allocated to histogram aggregation.

validationIndicatorCol

Validation Indicator Column

Param for name of the column that indicates whether each row is for training or for validation.

featureSubsetStrategy

Feature Subset Strategy

The number of features to consider for splits at each tree node

minWeightFractionPerNode

Min Weight Fraction per Node

Minimum fraction of the weighted sample count that each child must have after split.

weightCol

Weight Column

Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.

gridSearch

Grid Search

minInfoGainGrid

Min Info Gain Grid Search

Min Info Gain Grid Search

maxBinsGrid

Max Bins Grid Search

Max Bins for Grid Search

maxDepthGrid

Max Depth Grid Search

Regularization Parameters for Grid Search

maxIterGrid

Max Iterations Grid Search

Max Iterations for Grid Search

Details

Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs iteratively train decision trees in order to minimize a loss function.

The spark.ml implementation supports GBTs for binary classification and for regression, using both continuous and categorical features.

More details are available at Apache Spark ML docs page:

http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-regression

Examples

Below example is available at : https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-regression

import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.feature.VectorIndexer

import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}

// Load and parse the data file, converting it to a DataFrame.

val data = spark.read.format(“libsvm”).load(“data/mllib/sample_libsvm_data.txt”)

// Automatically identify categorical features, and index them.

// Set maxCategories so features with > 4 distinct values are treated as continuous.

val featureIndexer = new VectorIndexer()

.setInputCol(“features”)

.setOutputCol(“indexedFeatures”)

.setMaxCategories(4)

.fit(data)

// Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a GBT model.

val gbt = new GBTRegressor()

.setLabelCol(“label”)

.setFeaturesCol(“indexedFeatures”)

.setMaxIter(10)

// Chain indexer and GBT in a Pipeline.

val pipeline = new Pipeline()

.setStages(Array(featureIndexer, gbt))

// Train model. This also runs the indexer.

val model = pipeline.fit(trainingData)

// Make predictions.

val predictions = model.transform(testData)

// Select example rows to display.

predictions.select(“prediction”, “label”, “features”).show(5)

// Select (prediction, true label) and compute test error.

val evaluator = new RegressionEvaluator()

.setLabelCol(“label”)

.setPredictionCol(“prediction”)

.setMetricName(“rmse”)

val rmse = evaluator.evaluate(predictions)

println(s”Root Mean Squared Error (RMSE) on test data = $rmse”)

val gbtModel = model.stages(1).asInstanceOf[GBTRegressionModel]

println(s”Learned regression GBT model:\n ${gbtModel.toDebugString}”)