Random Forest Classifier
===========

Supports both binary and multiclass labels, as well as both continuous and categorical features.

Input
--------------
Takes in a DataFrame and performs Random Forest Classification

Output
--------------
Random Forest Classification Model generated is passed along to the next nodes. The input DataFrame is also passed along to the next nodes

Type
--------- 

ml-estimator

Class
--------- 

fire.nodes.ml.NodeRandomForestClassifier

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - featuresCol
        - Features Column
        - Features column of type vectorUDT for model fitting
      * - labelCol
        - Label Column
        - The label column for model fitting
      * - predictionCol
        - Prediction Column
        - The prediction column created during model scoring.
      * - splitRatio
        - Split Ratio
        - Split Ratio
      * - path
        - Path
        - Save Confusion Matrix to Path
      * - featureSubsetStrategy
        - Feature Subset Strategy
        - The number of features to consider for splits at each tree node.
      * - impurity
        - Impurity
        - The Criterion used for information gain calculation
      * - maxBins
        - Max Bins
        - The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.
      * - maxDepth
        - Max Depth
        - The Maximum depth of a tree
      * - minInfoGain
        - Min Information Gain
        - The Minimum information gain for a split to be considered at a tree node
      * - minInstancesPerNode
        - Min Instances Per Node
        - The Minimum number of instances each child must have after split
      * - numTrees
        - Num Trees
        - The number of trees to train
      * - subsamplingRate
        - Subsampling Rate
        - The fraction of the training data used for learning each decision tree.
      * - seed
        - Seed
        - The random seed
      * - cacheNodeIds
        - Cache Node Ids
        - The caching nodes IDs. Can speed up training of deeper trees.
      * - checkpointInterval
        - Checkpoint Interval
        - The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1)
      * - maxMemoryInMB
        - Max memory
        - Maximum memory in MB allocated to histogram aggregation.
      * - minWeightFractionPerNode
        - Min weight fraction per node
        - Minimum fraction of the weighted sample count that each child must have after split
      * - bootstrap
        - Bootstrap
        - Whether bootstrap samples are used when building trees.
      * - weightCol
        - Weight Column
        - Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
      * - gridSearch
        - Grid Search
        - 
      * - minInfoGainGrid
        - Min Information Gain Param Grid Search
        - Min Information Gain Parameters for Grid Search
      * - maxBinsGrid
        - Max Bins Param Grid Search
        - Max Bins Parameters for Grid Search
      * - maxDepthGrid
        - Max Depth Param Grid Search
        - Max Depth Parameters for Grid Search
      * - numTreesGrid
        - Number trees Param Grid Search
        - Total number of trees Parameters for Grid Search
      * - confusionMatrix
        - Confusion Matrix
        - 
      * - output_confusion_matrix_chart
        - Output Confusion Matrix Chart
        - Whether to display Confusion Matrix Chart.
      * - cmChartTitle
        - Confusion Matrix Chart Title
        - Title name to display in Confusion Matrix Chart
      * - cmChartDescription
        - Confusion Matrix Chart Description
        - Description to display in Confusion Matrix Chart
      * - confusionMatrixTargetLegend
        - Confusion Matrix Target Legend
        - Legend name to display for Target in Confusion Matrix
      * - confusionMatrixPredictedLabelLegend
        - Confusion Matrix PredictedLabel Legend
        - Legend name to display for Predicted Label in Confusion Matrix
      * - Description
        - Confusion Matrix Description
        - 
      * - confusionMatrixRowDescription
        - Confusion Matrix Outcome description
        - Add the business details of the outcome of the confusion matrix rows
      * - ROC Curve
        - ROC Curve
        - 
      * - output_roc_chart
        - Output ROC Curve
        - Whether to display confusion matrix chart.
      * - roc_title
        - ROC Curve Chart Title
        - Title name to display in ROC Curve Chart
      * - roc_description
        - ROC Curve Chart Description
        - Add Description for ROC Curve Chart
      * - xlabel
        - X Label
        - X label
      * - ylabel
        - Y Label
        - Y Label


Details
-------
Random forests are a popular family of classification and regression methods.

Random forests supports both binary and multiclass labels, as well as both continuous and categorical features.


Random forests are ensembles of decision trees. Random forests combine many decision trees in order to reduce the risk of overfitting. The spark.ml implementation supports random forests for binary and multiclass classification and for regression, using both continuous and categorical features.


More details are available at Apache Spark ML docs page:


http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier


Examples
-------
Below example is available at : https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier


import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}


// Load and parse the data file, converting it to a DataFrame.

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")


// Index labels, adding metadata to the label column.

// Fit on whole dataset to include all labels in index.

val labelIndexer = new StringIndexer()

  .setInputCol("label")

  .setOutputCol("indexedLabel")

  .fit(data)

// Automatically identify categorical features, and index them.

// Set maxCategories so features with > 4 distinct values are treated as continuous.

val featureIndexer = new VectorIndexer()

  .setInputCol("features")

  .setOutputCol("indexedFeatures")

  .setMaxCategories(4)

  .fit(data)


// Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))


// Train a RandomForest model.

val rf = new RandomForestClassifier()

  .setLabelCol("indexedLabel")

  .setFeaturesCol("indexedFeatures")

  .setNumTrees(10)


// Convert indexed labels back to original labels.

val labelConverter = new IndexToString()

  .setInputCol("prediction")

  .setOutputCol("predictedLabel")

  .setLabels(labelIndexer.labelsArray(0))


// Chain indexers and forest in a Pipeline.

val pipeline = new Pipeline()

  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))


// Train model. This also runs the indexers.

val model = pipeline.fit(trainingData)


// Make predictions.

val predictions = model.transform(testData)


// Select example rows to display.

predictions.select("predictedLabel", "label", "features").show(5)


// Select (prediction, true label) and compute test error.

val evaluator = new MulticlassClassificationEvaluator()

  .setLabelCol("indexedLabel")

  .setPredictionCol("prediction")

  .setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions)

println(s"Test Error = ${(1.0 - accuracy)}")


val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]

println(s"Learned classification forest model:\\n ${rfModel.toDebugString}")