Train Validation Split

This node represents Train Validation Split from Spark ML

Input

TrainValidationSplit takes an Estimator, a set of ParamMaps provided in the estimatorParamMaps parameter, and anEvaluator.

Output

The incoming DataFrame is passed to the output.

Type

ml-trainvalidationsplit

Class

fire.nodes.ml.NodeTrainValidationSplit

Fields

Name

Title

Description

trainRatio

Train Ratio

Training Ratio

Details

This node represents Train Validation Split from Spark ML.

In addition to CrossValidator Spark also offers TrainValidationSplit for hyper-parameter tuning.

TrainValidationSplit only evaluates each combination of parameters once, as opposed to k times in the case of CrossValidator.

It is, therefore, less expensive, but will not produce as reliable results when the training dataset is not sufficiently large.

Unlike CrossValidator, TrainValidationSplit creates a single (training, test) dataset pair. It splits the dataset into these two parts using the trainRatio parameter.

For example with trainRatio=0.75, TrainValidationSplit will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.

Like CrossValidator, TrainValidationSplit finally fits the Estimator using the best ParamMap and the entire dataset.

More at Spark MLlib/ML docs page : https://spark.apache.org/docs/latest/ml-tuning.html#train-validation-split

Examples

Below example is available at : https://spark.apache.org/docs/latest/ml-tuning.html#train-validation-split

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.regression.LinearRegression

import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}

// Prepare training and test data.

val data = spark.read.format(“libsvm”).load(“data/mllib/sample_linear_regression_data.txt”)

val Array(training, test) = data.randomSplit(Array(0.9, 0.1), seed = 12345)

val lr = new LinearRegression()

.setMaxIter(10)

// We use a ParamGridBuilder to construct a grid of parameters to search over.

// TrainValidationSplit will try all combinations of values and determine best model using

// the evaluator.

val paramGrid = new ParamGridBuilder()

.addGrid(lr.regParam, Array(0.1, 0.01))

.addGrid(lr.fitIntercept)

.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))

.build()

// In this case the estimator is simply the linear regression.

// A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.

val trainValidationSplit = new TrainValidationSplit()

.setEstimator(lr)

.setEvaluator(new RegressionEvaluator)

.setEstimatorParamMaps(paramGrid)

// 80% of the data will be used for training and the remaining 20% for validation.

.setTrainRatio(0.8)

// Evaluate up to 2 parameter settings in parallel

.setParallelism(2)

// Run train validation split, and choose the best set of parameters.

val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of parameters

// that performed best.

model.transform(test)

.select(“features”, “label”, “prediction”)

.show()