GBT Regression
===========

It supports both continuous and categorical features.

Input
--------------
This takes in a DataFrame and performs Logistic Regression

Output
--------------
It generates the GBTRegression and passes it to the next Predict and ModelSave Nodes. The input DataFrame is also passed along to the next nodes.

Type
--------- 

ml-estimator

Class
--------- 

fire.nodes.ml.NodeGBTRegression

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - featuresCol
        - Features Column
        - Features column of type vectorUDT for model fitting
      * - labelCol
        - Label Column
        - The label column for model fitting
      * - predictionCol
        - Prediction Column
        - The prediction column created during model scoring.
      * - splitRatio
        - Split Ratio
        - Split Ratio
      * - impurity
        - Impurity
        - The Criterion used for information gain calculation
      * - lossType
        - Loss Function
        - The Loss function which GBT tries to minimize
      * - maxBins
        - Max Bins
        - The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.
      * - maxDepth
        - Max Depth
        - The Maximum depth of a tree
      * - maxIter
        - Max Iterations
        - The maximum number of iterations(>=0)(a.k.a numtrees)
      * - minInfoGain
        - Min Information Gain
        - The Minimum information gain for a split to be considered at a tree node
      * - minInstancesPerNode
        - Min Instances Per Node
        - The Minimum number of instances each child must have after split
      * - subsamplingRate
        - Subsampling Rate
        - The fraction of the training data used for learning each decision tree.
      * - seed
        - Seed
        - The random seed
      * - stepSize
        - Step Size
        - Step size (a.k.a. learning rate), The step size to be used for each iteration of optimization.
      * - cacheNodeIds
        - Cache Node Ids
        - The caching nodes IDs. Can speed up training of deeper trees.
      * - checkpointInterval
        - Checkpoint Interval
        - The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1)
      * - maxMemoryInMB
        - Max memory
        - Maximum memory in MB allocated to histogram aggregation.
      * - validationIndicatorCol
        - Validation Indicator Column
        - Param for name of the column that indicates whether each row is for training or for validation.
      * - featureSubsetStrategy
        - Feature Subset Strategy
        - The number of features to consider for splits at each tree node
      * - minWeightFractionPerNode
        - Min Weight Fraction per Node
        - Minimum fraction of the weighted sample count that each child must have after split.
      * - weightCol
        - Weight Column
        - Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
      * - gridSearch
        - Grid Search
        - 
      * - minInfoGainGrid
        - Min Info Gain Grid Search
        - Min Info Gain Grid Search
      * - maxBinsGrid
        - Max Bins Grid Search
        - Max Bins for Grid Search
      * - maxDepthGrid
        - Max Depth Grid Search
        - Regularization Parameters for Grid Search
      * - maxIterGrid
        - Max Iterations Grid Search
        - Max Iterations for Grid Search


Details
-------
Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs iteratively train decision trees in order to minimize a loss function. 

The spark.ml implementation supports GBTs for binary classification and for regression, using both continuous and categorical features.


More details are available at Apache Spark ML docs page:


http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-regression


Examples
-------
Below example is available at : https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-regression


import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.feature.VectorIndexer

import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}


// Load and parse the data file, converting it to a DataFrame.

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")


// Automatically identify categorical features, and index them.

// Set maxCategories so features with > 4 distinct values are treated as continuous.

val featureIndexer = new VectorIndexer()

  .setInputCol("features")

  .setOutputCol("indexedFeatures")

  .setMaxCategories(4)

  .fit(data)


// Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))


// Train a GBT model.

val gbt = new GBTRegressor()

  .setLabelCol("label")

  .setFeaturesCol("indexedFeatures")

  .setMaxIter(10)


// Chain indexer and GBT in a Pipeline.

val pipeline = new Pipeline()

  .setStages(Array(featureIndexer, gbt))


// Train model. This also runs the indexer.

val model = pipeline.fit(trainingData)


// Make predictions.

val predictions = model.transform(testData)


// Select example rows to display.

predictions.select("prediction", "label", "features").show(5)


// Select (prediction, true label) and compute test error.

val evaluator = new RegressionEvaluator()

  .setLabelCol("label")

  .setPredictionCol("prediction")

  .setMetricName("rmse")

val rmse = evaluator.evaluate(predictions)

println(s"Root Mean Squared Error (RMSE) on test data = $rmse")


val gbtModel = model.stages(1).asInstanceOf[GBTRegressionModel]

println(s"Learned regression GBT model:\\n ${gbtModel.toDebugString}")