GBT Classifier
===========

Gradient-Boosted Trees (GBTs) is a learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features. Note: Multiclass labels are not currently supported.

Input
--------------
It takes in a DataFrame as input and performs GBT Classification

Output
--------------
The GBT Model generated is passed along to the next nodes. The input DataFrame is also passed along to the next nodes

Type
--------- 

ml-estimator

Class
--------- 

fire.nodes.ml.NodeGBTClassifier

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - featuresCol
        - Features Column
        - Features column of type vectorUDT for model fitting
      * - labelCol
        - Label Column
        - The label column for model fitting
      * - predictionCol
        - Prediction Column
        - The prediction column created during model scoring.
      * - impurity
        - Impurity
        - The Criterion used for information gain calculation
      * - lossType
        - Loss Function
        - The Loss function which GBT tries to minimize
      * - splitRatio
        - Split Ratio
        - Split Ratio
      * - maxBins
        - Max Bins
        - The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature.
      * - maxDepth
        - Max Depth
        - The Maximum depth of a tree
      * - maxIter
        - Max Iterations
        - The maximum number of iterations(>=0)(a.k.a numtrees)
      * - minInfoGain
        - Min Information Gain
        - The Minimum information gain for a split to be considered at a tree node
      * - minInstancesPerNode
        - Min Instances Per Node
        - The Minimum number of instances each child must have after split
      * - subsamplingRate
        - Subsampling Rate
        - The fraction of the training data used for learning each decision tree.
      * - seed
        - Seed
        - The random seed
      * - stepSize
        - Step Size
        - Step size (a.k.a. learning rate), The step size to be used for each iteration of optimization.
      * - cacheNodeIds
        - Cache Node Ids
        - The caching nodes IDs. Can speed up training of deeper trees.
      * - checkpointInterval
        - Checkpoint Interval
        - The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1)
      * - maxMemoryInMB
        - Max memory
        - Maximum memory in MB allocated to histogram aggregation.
      * - validationIndicatorCol
        - Validation Indicator Column
        - Param for name of the column that indicates whether each row is for training or for validation.
      * - featureSubsetStrategy
        - Feature Subset Strategy
        - The number of features to consider for splits at each tree node
      * - minWeightFractionPerNode
        - Min Weight Fraction per Node
        - Minimum fraction of the weighted sample count that each child must have after split.
      * - weightCol
        - Weight Column
        - Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
      * - gridSearch
        - Grid Search
        - 
      * - minInfoGainGrid
        - Min Information Gain Param Grid Search
        - Min Information Gain Parameters for Grid Search
      * - maxBinsGrid
        - Max Bins Param Grid Search
        - Max Bins Parameters for Grid Search
      * - maxDepthGrid
        - Max Depth Param Grid Search
        - Max Depth Parameters for Grid Search
      * - maxIterGrid
        - Max Iteration Param Grid Search
        - Max Iteration Parameters for Grid Search
      * - confusionMatrix
        - Confusion Matrix
        - 
      * - output_confusion_matrix_chart
        - Output Confusion Matrix Chart
        - whether to display confusion matrix chart.
      * - cm_chart_title
        - Confusion Matrix Chart Title
        - Title name to display in Confusion Matrix Chart
      * - cm_chart_description
        - Confusion Matrix Chart Description
        - Description to display in Confusion Matrix CHart
      * - confusionMatrixTargetLegend
        - Confusion Matrix Target Legend
        - Legend name to display for Target in Confusion Matrix
      * - confusionMatrixPredictedLabelLegend
        - Confusion Matrix PredictedLabel Legend
        - Legend name to display for Predicted Label in Confusion Matrix
      * - confusionMatrixCountLegend
        - Confusion Matrix Count Legend
        - Legend name to display for Count in Confusion Matrix
      * - Description
        - Confusion Matrix Description
        - 
      * - confusionMatrixRowDescription
        - Confusion Matrix Outcome description
        - One can provide the business details of the outcome of the confusion matrix rows
      * - ROC Curve
        - ROC Curve
        - 
      * - output_roc_curve
        - Output ROC Curve
        - whether to display confusion matrix chart.
      * - roc_title
        - ROC Curve Chart Title
        - Title name to display in ROC Curve Chart
      * - roc_description
        - ROC Curve Chart Description
        - Add Description for ROC Curve Chart
      * - xlabel
        - X Label
        - X label
      * - ylabel
        - Y Label
        - Y Label


Details
-------
Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees. 


More details are available at : http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier


Examples
-------
Below example is available at :https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier


import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}


// Load and parse the data file, converting it to a DataFrame.

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")


// Index labels, adding metadata to the label column.

// Fit on whole dataset to include all labels in index.

val labelIndexer = new StringIndexer()

  .setInputCol("label")

  .setOutputCol("indexedLabel")

  .fit(data)

// Automatically identify categorical features, and index them.

// Set maxCategories so features with > 4 distinct values are treated as continuous.

val featureIndexer = new VectorIndexer()

  .setInputCol("features")

  .setOutputCol("indexedFeatures")

  .setMaxCategories(4)

  .fit(data)


// Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))


// Train a GBT model.

val gbt = new GBTClassifier()

  .setLabelCol("indexedLabel")

  .setFeaturesCol("indexedFeatures")

  .setMaxIter(10)

  .setFeatureSubsetStrategy("auto")


// Convert indexed labels back to original labels.

val labelConverter = new IndexToString()

  .setInputCol("prediction")

  .setOutputCol("predictedLabel")

  .setLabels(labelIndexer.labelsArray(0))


// Chain indexers and GBT in a Pipeline.

val pipeline = new Pipeline()

  .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))


// Train model. This also runs the indexers.

val model = pipeline.fit(trainingData)


// Make predictions.

val predictions = model.transform(testData)


// Select example rows to display.

predictions.select("predictedLabel", "label", "features").show(5)


// Select (prediction, true label) and compute test error.

val evaluator = new MulticlassClassificationEvaluator()

  .setLabelCol("indexedLabel")

  .setPredictionCol("prediction")

  .setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions)

println(s"Test Error = ${1.0 - accuracy}")


val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]

println(s"Learned classification GBT model:\\n ${gbtModel.toDebugString}")