Random Forest Classifier

Supports both binary and multiclass labels, as well as both continuous and categorical features.

Input

Takes in a DataFrame and performs Random Forest Classification

Output

Random Forest Classification Model generated is passed along to the next nodes. The input DataFrame is also passed along to the next nodes

Type

ml-estimator

Class

fire.nodes.ml.NodeRandomForestClassifier

Fields

Name

Title

Description

featuresCol

Features Column

Features column of type vectorUDT for model fitting

labelCol

Label Column

The label column for model fitting

predictionCol

Prediction Column

The prediction column created during model scoring.

splitRatio

Split Ratio

Split Ratio

path

Path

Save Confusion Matrix to Path

featureSubsetStrategy

Feature Subset Strategy

The number of features to consider for splits at each tree node.

impurity

Impurity

The Criterion used for information gain calculation

maxBins

Max Bins

The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.

maxDepth

Max Depth

The Maximum depth of a tree

minInfoGain

Min Information Gain

The Minimum information gain for a split to be considered at a tree node

minInstancesPerNode

Min Instances Per Node

The Minimum number of instances each child must have after split

numTrees

Num Trees

The number of trees to train

subsamplingRate

Subsampling Rate

The fraction of the training data used for learning each decision tree.

seed

Seed

The random seed

cacheNodeIds

Cache Node Ids

The caching nodes IDs. Can speed up training of deeper trees.

checkpointInterval

Checkpoint Interval

The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1)

maxMemoryInMB

Max memory

Maximum memory in MB allocated to histogram aggregation.

minWeightFractionPerNode

Min weight fraction per node

Minimum fraction of the weighted sample count that each child must have after split

bootstrap

Bootstrap

Whether bootstrap samples are used when building trees.

weightCol

Weight Column

Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.

gridSearch

Grid Search

minInfoGainGrid

Min Information Gain Param Grid Search

Min Information Gain Parameters for Grid Search

maxBinsGrid

Max Bins Param Grid Search

Max Bins Parameters for Grid Search

maxDepthGrid

Max Depth Param Grid Search

Max Depth Parameters for Grid Search

numTreesGrid

Number trees Param Grid Search

Total number of trees Parameters for Grid Search

confusionMatrix

Confusion Matrix

output_confusion_matrix_chart

Output Confusion Matrix Chart

Whether to display Confusion Matrix Chart.

cmChartTitle

Confusion Matrix Chart Title

Title name to display in Confusion Matrix Chart

cmChartDescription

Confusion Matrix Chart Description

Description to display in Confusion Matrix Chart

confusionMatrixTargetLegend

Confusion Matrix Target Legend

Legend name to display for Target in Confusion Matrix

confusionMatrixPredictedLabelLegend

Confusion Matrix PredictedLabel Legend

Legend name to display for Predicted Label in Confusion Matrix

Description

Confusion Matrix Description

confusionMatrixRowDescription

Confusion Matrix Outcome description

Add the business details of the outcome of the confusion matrix rows

ROC Curve

ROC Curve

output_roc_chart

Output ROC Curve

Whether to display confusion matrix chart.

roc_title

ROC Curve Chart Title

Title name to display in ROC Curve Chart

roc_description

ROC Curve Chart Description

Add Description for ROC Curve Chart

xlabel

X Label

X label

ylabel

Y Label

Y Label

Details

Random forests are a popular family of classification and regression methods.

Random forests supports both binary and multiclass labels, as well as both continuous and categorical features.

Random forests are ensembles of decision trees. Random forests combine many decision trees in order to reduce the risk of overfitting. The spark.ml implementation supports random forests for binary and multiclass classification and for regression, using both continuous and categorical features.

More details are available at Apache Spark ML docs page:

http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier

Examples

Below example is available at : https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier

import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.

val data = spark.read.format(“libsvm”).load(“data/mllib/sample_libsvm_data.txt”)

// Index labels, adding metadata to the label column.

// Fit on whole dataset to include all labels in index.

val labelIndexer = new StringIndexer()

.setInputCol(“label”)

.setOutputCol(“indexedLabel”)

.fit(data)

// Automatically identify categorical features, and index them.

// Set maxCategories so features with > 4 distinct values are treated as continuous.

val featureIndexer = new VectorIndexer()

.setInputCol(“features”)

.setOutputCol(“indexedFeatures”)

.setMaxCategories(4)

.fit(data)

// Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.

val rf = new RandomForestClassifier()

.setLabelCol(“indexedLabel”)

.setFeaturesCol(“indexedFeatures”)

.setNumTrees(10)

// Convert indexed labels back to original labels.

val labelConverter = new IndexToString()

.setInputCol(“prediction”)

.setOutputCol(“predictedLabel”)

.setLabels(labelIndexer.labelsArray(0))

// Chain indexers and forest in a Pipeline.

val pipeline = new Pipeline()

.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model. This also runs the indexers.

val model = pipeline.fit(trainingData)

// Make predictions.

val predictions = model.transform(testData)

// Select example rows to display.

predictions.select(“predictedLabel”, “label”, “features”).show(5)

// Select (prediction, true label) and compute test error.

val evaluator = new MulticlassClassificationEvaluator()

.setLabelCol(“indexedLabel”)

.setPredictionCol(“prediction”)

.setMetricName(“accuracy”)

val accuracy = evaluator.evaluate(predictions)

println(s”Test Error = ${(1.0 - accuracy)}”)

val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]

println(s”Learned classification forest model:\n ${rfModel.toDebugString}”)