Random Forest Regression
===========

It supports both continuous and categorical features.

Input
--------------
This takes in a DataFrame and performs Random Forest Regression

Output
--------------
It generates the Random Forest Regression Model and passes it to the next Predict and ModelSave Nodes. The input DataFrame is also passed along to the next nodes.

Type
--------- 

ml-estimator

Class
--------- 

fire.nodes.ml.NodeRandomForestRegression

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - featuresCol
        - Features Column
        - Features column of type vectorUDT for model fitting
      * - labelCol
        - Label Column
        - The label column for model fitting
      * - predictionCol
        - Prediction Column
        - The prediction column created during model scoring.
      * - splitRatio
        - Split Ratio
        - Split Ratio
      * - featureSubsetStrategy
        - Feature Subset Strategy
        - The number of features to consider for splits at each tree node.
      * - impurity
        - Impurity
        - The Criterion used for information gain calculation
      * - maxBins
        - Max Bins
        - The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.
      * - maxDepth
        - Max Depth
        - The Maximum depth of a tree
      * - minInfoGain
        - Min Information Gain
        - The Minimum information gain for a split to be considered at a tree node
      * - minInstancesPerNode
        - Min Instances Per Node
        - The Minimum number of instances each child must have after split
      * - numTrees
        - Num Trees
        - The number of trees to train
      * - subsamplingRate
        - Subsampling Rate
        - The fraction of the training data used for learning each decision tree.
      * - seed
        - Seed
        - The random seed
      * - cacheNodeIds
        - Cache Node Ids
        - The caching nodes IDs. Can speed up training of deeper trees.
      * - checkpointInterval
        - Checkpoint Interval
        - The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1)
      * - maxMemoryInMB
        - Max memory
        - Maximum memory in MB allocated to histogram aggregation.
      * - minWeightFractionPerNode
        - Min weight fraction per node
        - Minimum fraction of the weighted sample count that each child must have after split
      * - bootstrap
        - Bootstrap
        - Whether bootstrap samples are used when building trees.
      * - weightCol
        - Weight Column
        - Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
      * - gridSearch
        - Grid Search
        - 
      * - minInfoGainGrid
        - Min Info Gain Grid Search
        - Min Info Gain Grid Search
      * - maxBinsGrid
        - Max Bins Grid Search
        - Max Bins for Grid Search
      * - maxDepthGrid
        - Max Depth Grid Search
        - Regularization Parameters for Grid Search
      * - numTreesGrid
        - Num Trees Grid Search
        - Number of trees for Grid Search


Details
-------
Random forests are a popular family of classification and regression methods.

The DataFrame API supports two major tree ensemble algorithms: Random Forests and Gradient-Boosted Trees (GBTs). Both use spark.ml decision trees as their base models.


More details are available at Apache Spark ML docs page :


http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression


Examples
-------
Below example is available at : https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression


import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.feature.VectorIndexer

import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}


// Load and parse the data file, converting it to a DataFrame.

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")


// Automatically identify categorical features, and index them.

// Set maxCategories so features with > 4 distinct values are treated as continuous.

val featureIndexer = new VectorIndexer()

  .setInputCol("features")

  .setOutputCol("indexedFeatures")

  .setMaxCategories(4)

  .fit(data)


// Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))


// Train a RandomForest model.

val rf = new RandomForestRegressor()

  .setLabelCol("label")

  .setFeaturesCol("indexedFeatures")


// Chain indexer and forest in a Pipeline.

val pipeline = new Pipeline()

  .setStages(Array(featureIndexer, rf))


// Train model. This also runs the indexer.

val model = pipeline.fit(trainingData)


// Make predictions.

val predictions = model.transform(testData)


// Select example rows to display.

predictions.select("prediction", "label", "features").show(5)


// Select (prediction, true label) and compute test error.

val evaluator = new RegressionEvaluator()

  .setLabelCol("label")

  .setPredictionCol("prediction")

  .setMetricName("rmse")

val rmse = evaluator.evaluate(predictions)

println(s"Root Mean Squared Error (RMSE) on test data = $rmse")


val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]

println(s"Learned regression forest model:\\n ${rfModel.toDebugString}")