Random Forest Regression¶

It supports both continuous and categorical features.

Input¶

This takes in a DataFrame and performs Random Forest Regression

Output¶

It generates the Random Forest Regression Model and passes it to the next Predict and ModelSave Nodes. The input DataFrame is also passed along to the next nodes.

Type¶

ml-estimator

Class¶

fire.nodes.ml.NodeRandomForestRegression

Fields¶

Name	Title	Description
featuresCol	Features Column	Features column of type vectorUDT for model fitting
labelCol	Label Column	The label column for model fitting
predictionCol	Prediction Column	The prediction column created during model scoring.
splitRatio	Split Ratio	Split Ratio
featureSubsetStrategy	Feature Subset Strategy	The number of features to consider for splits at each tree node.
impurity	Impurity	The Criterion used for information gain calculation
maxBins	Max Bins	The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.
maxDepth	Max Depth	The Maximum depth of a tree
minInfoGain	Min Information Gain	The Minimum information gain for a split to be considered at a tree node
minInstancesPerNode	Min Instances Per Node	The Minimum number of instances each child must have after split
numTrees	Num Trees	The number of trees to train
subsamplingRate	Subsampling Rate	The fraction of the training data used for learning each decision tree.
seed	Seed	The random seed
cacheNodeIds	Cache Node Ids	The caching nodes IDs. Can speed up training of deeper trees.
checkpointInterval	Checkpoint Interval	The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1)
maxMemoryInMB	Max memory	Maximum memory in MB allocated to histogram aggregation.
minWeightFractionPerNode	Min weight fraction per node	Minimum fraction of the weighted sample count that each child must have after split
bootstrap	Bootstrap	Whether bootstrap samples are used when building trees.
weightCol	Weight Column	Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
gridSearch	Grid Search
minInfoGainGrid	Min Info Gain Grid Search	Min Info Gain Grid Search
maxBinsGrid	Max Bins Grid Search	Max Bins for Grid Search
maxDepthGrid	Max Depth Grid Search	Regularization Parameters for Grid Search
numTreesGrid	Num Trees Grid Search	Number of trees for Grid Search

Details¶

Random forests are a popular family of classification and regression methods.

The DataFrame API supports two major tree ensemble algorithms: Random Forests and Gradient-Boosted Trees (GBTs). Both use spark.ml decision trees as their base models.

More details are available at Apache Spark ML docs page :

http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression

Examples¶

Below example is available at : https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression

import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.feature.VectorIndexer

import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}

// Load and parse the data file, converting it to a DataFrame.

val data = spark.read.format(“libsvm”).load(“data/mllib/sample_libsvm_data.txt”)

// Automatically identify categorical features, and index them.

// Set maxCategories so features with > 4 distinct values are treated as continuous.

val featureIndexer = new VectorIndexer()

.setInputCol(“features”)

.setOutputCol(“indexedFeatures”)

.setMaxCategories(4)

.fit(data)

// Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.

val rf = new RandomForestRegressor()

.setLabelCol(“label”)

.setFeaturesCol(“indexedFeatures”)

// Chain indexer and forest in a Pipeline.

val pipeline = new Pipeline()

.setStages(Array(featureIndexer, rf))

// Train model. This also runs the indexer.

val model = pipeline.fit(trainingData)

// Make predictions.

val predictions = model.transform(testData)

// Select example rows to display.

predictions.select(“prediction”, “label”, “features”).show(5)

// Select (prediction, true label) and compute test error.

val evaluator = new RegressionEvaluator()

.setLabelCol(“label”)

.setPredictionCol(“prediction”)

.setMetricName(“rmse”)

val rmse = evaluator.evaluate(predictions)

println(s”Root Mean Squared Error (RMSE) on test data = $rmse”)

val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]

println(s”Learned regression forest model:\n ${rfModel.toDebugString}”)