Random Forest Regression¶
It supports both continuous and categorical features.
Input¶
This takes in a DataFrame and performs Random Forest Regression
Output¶
It generates the Random Forest Regression Model and passes it to the next Predict and ModelSave Nodes. The input DataFrame is also passed along to the next nodes.
Type¶
ml-estimator
Class¶
fire.nodes.ml.NodeRandomForestRegression
Fields¶
Name |
Title |
Description |
|---|---|---|
featuresCol |
Features Column |
Features column of type vectorUDT for model fitting |
labelCol |
Label Column |
The label column for model fitting |
predictionCol |
Prediction Column |
The prediction column created during model scoring. |
splitRatio |
Split Ratio |
Split Ratio |
featureSubsetStrategy |
Feature Subset Strategy |
The number of features to consider for splits at each tree node. |
impurity |
Impurity |
The Criterion used for information gain calculation |
maxBins |
Max Bins |
The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature. |
maxDepth |
Max Depth |
The Maximum depth of a tree |
minInfoGain |
Min Information Gain |
The Minimum information gain for a split to be considered at a tree node |
minInstancesPerNode |
Min Instances Per Node |
The Minimum number of instances each child must have after split |
numTrees |
Num Trees |
The number of trees to train |
subsamplingRate |
Subsampling Rate |
The fraction of the training data used for learning each decision tree. |
seed |
Seed |
The random seed |
cacheNodeIds |
Cache Node Ids |
The caching nodes IDs. Can speed up training of deeper trees. |
checkpointInterval |
Checkpoint Interval |
The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1) |
maxMemoryInMB |
Max memory |
Maximum memory in MB allocated to histogram aggregation. |
minWeightFractionPerNode |
Min weight fraction per node |
Minimum fraction of the weighted sample count that each child must have after split |
bootstrap |
Bootstrap |
Whether bootstrap samples are used when building trees. |
weightCol |
Weight Column |
Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0. |
gridSearch |
Grid Search |
|
minInfoGainGrid |
Min Info Gain Grid Search |
Min Info Gain Grid Search |
maxBinsGrid |
Max Bins Grid Search |
Max Bins for Grid Search |
maxDepthGrid |
Max Depth Grid Search |
Regularization Parameters for Grid Search |
numTreesGrid |
Num Trees Grid Search |
Number of trees for Grid Search |
Details¶
Random forests are a popular family of classification and regression methods.
The DataFrame API supports two major tree ensemble algorithms: Random Forests and Gradient-Boosted Trees (GBTs). Both use spark.ml decision trees as their base models.
More details are available at Apache Spark ML docs page :
http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression
Examples¶
Below example is available at : https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format(“libsvm”).load(“data/mllib/sample_libsvm_data.txt”)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol(“features”)
.setOutputCol(“indexedFeatures”)
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestRegressor()
.setLabelCol(“label”)
.setFeaturesCol(“indexedFeatures”)
// Chain indexer and forest in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, rf))
// Train model. This also runs the indexer.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select(“prediction”, “label”, “features”).show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
.setLabelCol(“label”)
.setPredictionCol(“prediction”)
.setMetricName(“rmse”)
val rmse = evaluator.evaluate(predictions)
println(s”Root Mean Squared Error (RMSE) on test data = $rmse”)
val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]
println(s”Learned regression forest model:\n ${rfModel.toDebugString}”)