Decision Tree Regression =========== It supports both continuous and categorical features. Input -------------- This takes in a DataFrame and performs Decision Tree Regression Output -------------- The Decision Tree Regression Model generated is passed along to the next nodes. The input DataFrame is also passed along to the next nodes Type --------- ml-estimator Class --------- fire.nodes.ml.NodeDecisionTreeRegression Fields --------- .. list-table:: :widths: 10 5 10 :header-rows: 1 * - Name - Title - Description * - featuresCol - Features Column - Features column of type vectorUDT for model fitting * - labelCol - Label Column - The label column for model fitting * - predictionCol - Prediction Column - The prediction column created during model scoring. * - splitRatio - Split Ratio - Split Ratio * - impurity - Impurity - The Criterion used for information gain calculation * - maxBins - Max Bins - The maximum number of bins used for discretizing continuous features.Must be >= 2 and >= number of categories in any categorical feature. * - maxDepth - Max Depth - The Maximum depth of a tree * - minInfoGain - Min Information Gain - The Minimum information gain for a split to be considered at a tree node * - minInstancesPerNode - Min Instances Per Node - The Minimum number of instances each child must have after split * - seed - Seed - The random seed * - cacheNodeIds - Cache Node Ids - The caching nodes IDs. Can speed up training of deeper trees. * - checkpointInterval - Checkpoint Interval - The checkpoint interval. E.g. 10 means that the cache will get checkpointed every 10 iterations.Set checkpoint interval (>= 1) or disable checkpoint (-1) * - maxMemoryInMB - Max memory - Maximum memory in MB allocated to histogram aggregation. * - minWeightFractionPerNode - Min Weight Fraction per Node - Minimum fraction of the weighted sample count that each child must have after split. * - weightCol - Weight Column - Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0. * - gridSearch - Grid Search - * - minInfoGainGrid - Min Information Gain Param Grid Search - Min Information Gain Parameters for Grid Search * - maxBinsGrid - Max Bins Param Grid Search - Max Bins Parameters for Grid Search * - maxDepthGrid - Max Depth Param Grid Search - Max Depth Parameters for Grid Search Details ------- Decision tree supports both continuous and categorical features. More details are available at Apache Spark ML docs page : https://spark.apache.org/docs/1.6.0/ml-classification-regression.html#decision-tree-regression Examples ------- Below example is available at : https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-regression import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.regression.DecisionTreeRegressionModel import org.apache.spark.ml.regression.DecisionTreeRegressor // Load the data stored in LIBSVM format as a DataFrame. val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") // Automatically identify categorical features, and index them. // Here, we treat features with > 4 distinct values as continuous. val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(data) // Split the data into training and test sets (30% held out for testing). val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) // Train a DecisionTree model. val dt = new DecisionTreeRegressor() .setLabelCol("label") .setFeaturesCol("indexedFeatures") // Chain indexer and tree in a Pipeline. val pipeline = new Pipeline() .setStages(Array(featureIndexer, dt)) // Train model. This also runs the indexer. val model = pipeline.fit(trainingData) // Make predictions. val predictions = model.transform(testData) // Select example rows to display. predictions.select("prediction", "label", "features").show(5) // Select (prediction, true label) and compute test error. val evaluator = new RegressionEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("rmse") val rmse = evaluator.evaluate(predictions) println(s"Root Mean Squared Error (RMSE) on test data = $rmse") val treeModel = model.stages(1).asInstanceOf[DecisionTreeRegressionModel] println(s"Learned regression tree model:\\n ${treeModel.toDebugString}")