H2O Gradient Boosting Machine

Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method

Input

It takes in a DataFrame as input

Type

ml-estimator

Class

fire.nodes.h2o.NodeH2OGBM

Fields

Name

Title

Description

isResponseIsCategorical

Is Response Column Categorical

Specify a response column type(numeric or categorical). Separates the Classification and Regression

labelCol

Label Column

Response variable column.

featuresCols

Feature Columns

Features to be used for Modelling

columnsToCategorical

Columns to Categorical

Columns to be Categorical encoded

seed

Seed

Seed for pseudo random number generator (if applicable).

balanceClasses

Balance Classes

Balance training data class counts via over/under-sampling (for imbalanced data).

splitRatio

Split Ratio

Split Ratio

nfolds

Number of Folds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

ntrees

Number of Trees

Number of trees.

maxDepth

Max Depth

Maximum tree depth (0 for unlimited).

minRows

Min Rows

Fewest allowed (weighted) observations in a leaf.

nbins

Number of Bins

For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.

nbinsTopLevel

Number of bins top level

For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level.

nbinsCats

Number of Bins Categoricals

For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.

categoricalEncoding

Categorical Encoding

Specify one of the various encoding schemes for handling categorical features

ignoreConstCols

Ignore Const Columns

Ignore constant columns.

scoreEachIteration

Score Each Iteration

Whether to score during each iteration of model training.

stoppingRounds

Stopping Rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable).

maxRuntimeSecs

Max Runtime Secs

his argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour).

stoppingMetric

StoppingMetric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression)

stoppingTolerance

StoppingTolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

gainsliftBins

Gains Lift Bins

Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.

withContributions

With Contributions

Enables or disables generating a sub-column of detailedPredictionCol containing Shapley values.

learnRate

Learn Rate

Learning rate (from 0.0 to 1.0).

advanced

Advanced

convertUnknownCategoricalLevelsToNa

Convert Unknown Categorical Levels to NA

If set to ‘true’, the model converts unknown categorical levels to NA during making predictions.

withLeafNodeAssignments

With Node Assignments

Enables or disables computation of leaf node assignments.

withStageResults

With Stage Results

Enables or disables computation of stage results.

learnRateAnnealing

Learn Rate Annealing

Scale the learning rate by this factor after each tree (e.g., 0.99 or 0.999) .

sampleRate

Sample Rate

Row sample rate per tree (from 0.0 to 1.0).

colSampleRate

Column Sample Rate

Column sample rate(from 0.0 to 1.0).

maxAbsLeafnodePred

Max Absolute Leaf Node Prediction

Maximum absolute value of a leaf node prediction.

predNoiseBandwidth

Prediction Noise Bandwidth

Bandwidth (sigma) of Gaussian multiplicative noise ~N(1,sigma) for tree node predictions.

maxAfterBalanceSize

Max After Balance Size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

maxConfusionMatrixSize

Max Confusion Matrix Size

[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs.

buildTreeOneNode

Build tree one node

Enables to run on a single node

colSampleRatePerTree

Column Sample Rate Per Tree (from 0.0 to 1.0).

Column sample rate per tree (from 0.0 to 1.0).

colSampleRateChangePerLevel

Column Sample Rate change Per Level

Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0).

scoreTreeInterval

Score Tree Interval

Score the model after every so many trees. Disabled if set to 0.

minSplitImprovement

Minimum Split Improvement

histogramType

Histogram Type

What type of histogram to use for finding optimal split points. Possible values are

calibrateModel

Calibrate Model

Use Platt Scaling to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.

checkConstantResponse

Check Constant Response

UCheck if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.

keepCrossValidationModels

Keep Cross Validation Models

Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster.

keepCrossValidationPredictions

Keep Cross Validation Predictions

Whether to keep the predictions of the cross-validation predictions. This needs to be set to TRUE if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML.

keepCrossValidationFoldAssignment

Keep Cross Validation Fold Assignment

Whether to keep cross-validation assignments.

tweediePower

Tweedie Power

Tweedie power for Tweedie regression, must be between 1 and 2.

quantileAlpha

Quantile Alhpa

Desired quantile for Quantile regression, must be between 0 and 1.

huberAlpha

Huber Alpha

Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).

weightCol

Weight Column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

offsetCol

Offset Column

Offset column. This will be added to the combination of columns before applying the link function.

foldCol

Fold Column

Column with cross-validation fold index assignment per observation.

foldAssignment

Fold Assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

aucType

AUC Type

Set default multinomial AUC type.

confusionMatrix

Confusion Matrix

output_confusion_matrix_chart

Output Confusion Matrix Chart

whether to display confusion matrix chart.

cm_chart_title

Confusion Matrix Chart Title

Title name to display in Confusion Matrix Chart

cm_chart_description

Confusion Matrix Chart Description

Description to display in Confusion Matrix CHart

confusionMatrixTargetLegend

Confusion Matrix Target Legend

Legend name to display for Target in Confusion Matrix

confusionMatrixPredictedLabelLegend

Confusion Matrix PredictedLabel Legend

Legend name to display for Predicted Label in Confusion Matrix

confusionMatrixCountLegend

Confusion Matrix Count Legend

Legend name to display for Count in Confusion Matrix

path

Save Confusion Matrix Path

Save Confusion Matrix to Path

Description

Confusion Matrix Description

confusionMatrixRowDescription

Confusion Matrix Outcome description

One can provide the business details of the outcome of the confusion matrix rows

ROC Curve

ROC Curve

output_roc_curve

Output ROC Curve

whether to display confusion matrix chart.

roc_title

ROC Curve Chart Title

Title name to display in ROC Curve Chart

roc_description

ROC Curve Chart Description

Add Description for ROC Curve Chart

xlabel

X Label

X label

ylabel

Y Label

Y Label

Grid Search

Grid Search

paramKeys

Param Name

Param Names. eg: maxDepth ,learnRate, nTrees,distribution

paramValues

Param Value

Enter comma separated values.eg: 4,5,6, eg: gaussian, gamma, bernoulli

gridStrategy

Grid Search Strategy

Strategy to use for model hyperparameter search. Cartesian does exhaustive search; RandomDiscrete searches randomly within given time or model limits.

gridMaxModels

Grid Max Models

Maximum number of models to build in the grid search (0 for unlimited).

gridMaxRuntimeSecs

Grid Max Runtime Seconds

Maximum runtime in seconds for the grid search (0 for unlimited).

gridStoppingRounds

Grid Stopping Rounds

Early stopping based on convergence of the metric during grid search (0 to disable).

gridStoppingTolerance

Grid Stopping Tolerance

Tolerance for metric-based stopping criterion during grid search.

gridStoppingMetric

Grid Stopping Metric

Metric to use for early stopping during grid search (AUTO: logloss for classification, deviance for regression).

gridParallelism

Grid Parallelism

Level of parallelism to use when building models in the grid.

gridSelectBestModelBy

Grid Select Best Model By

Metric used to select the best model from the grid.

Details

Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.

More details are available at : http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html