H2O Distributed Random Forest¶
Distributed Random Forest (DRF) is a powerful classification and regression tool. DRF generates a forest of classification or regression trees.
Input¶
It takes in a DataFrame as input
Type¶
ml-estimator
Class¶
fire.nodes.h2o.NodeH2ODRF
Fields¶
Name |
Title |
Description |
|---|---|---|
isResponseIsCategorical |
Is Response Column Categorical |
Specify a response column type(numeric or categorical). Separates the Classification and Regression |
labelCol |
Label Column |
Response variable column. |
featuresCols |
Feature Columns |
Features to be used for Modelling |
columnsToCategorical |
Columns to Categorical |
Columns to be Categorical encoded |
seed |
Seed |
Seed for pseudo random number generator (if applicable). |
balanceClasses |
Balance Classes |
Balance training data class counts via over/under-sampling (for imbalanced data). |
splitRatio |
Split Ratio |
Split Ratio |
nfolds |
Number of Folds |
Number of folds for K-fold cross-validation (0 to disable or >= 2). |
ntrees |
Number of Trees |
Number of trees. |
maxDepth |
Max Depth |
Maximum tree depth (0 for unlimited). |
minRows |
Min Rows |
Fewest allowed (weighted) observations in a leaf. |
nbins |
Number of Bins |
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point. |
nbinsTopLevel |
Number of bins top level |
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level. |
nbinsCats |
Number of Bins Categoricals |
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. |
categoricalEncoding |
Categorical Encoding |
Specify one of the various encoding schemes for handling categorical features |
ignoreConstCols |
Ignore Const Columns |
Ignore constant columns. |
scoreEachIteration |
Score Each Iteration |
Whether to score during each iteration of model training. |
stoppingRounds |
Stopping Rounds |
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable). |
maxRuntimeSecs |
Max Runtime Secs |
his argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour). |
stoppingMetric |
StoppingMetric |
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression) |
stoppingTolerance |
StoppingTolerance |
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) |
gainsliftBins |
Gains Lift Bins |
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning. |
withContributions |
With Contributions |
Enables or disables generating a sub-column of detailedPredictionCol containing Shapley values. |
advanced |
Advanced |
|
convertUnknownCategoricalLevelsToNa |
Convert Unknown Categorical Levels to NA |
If set to ‘true’, the model converts unknown categorical levels to NA during making predictions. |
withLeafNodeAssignments |
With Node Assignments |
Enables or disables computation of leaf node assignments. |
withStageResults |
With Stage Results |
Enables or disables computation of stage results. |
mtries |
mtries |
Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictors. |
binomialDoubleTrees |
Binomial Double trees |
For binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy. |
sampleRate |
Sample Rate |
Row sample rate per tree (from 0.0 to 1.0). |
maxAfterBalanceSize |
Max After Balance Size |
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. |
maxConfusionMatrixSize |
Max Confusion Matrix Size |
[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs. |
buildTreeOneNode |
Build tree one node |
Enables to run on a single node |
colSampleRatePerTree |
Column Sample Rate Per Tree (from 0.0 to 1.0). |
Column sample rate per tree (from 0.0 to 1.0). |
colSampleRateChangePerLevel |
Column Sample Rate change Per Level |
Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0). |
scoreTreeInterval |
Score Tree Interval |
Score the model after every so many trees. Disabled if set to 0. |
minSplitImprovement |
Minimum Split Improvement |
|
histogramType |
Histogram Type |
What type of histogram to use for finding optimal split points. Possible values are |
calibrateModel |
Calibrate Model |
Use Platt Scaling to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities. |
checkConstantResponse |
Check Constant Response |
UCheck if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not. |
keepCrossValidationModels |
Keep Cross Validation Models |
Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster. |
keepCrossValidationPredictions |
Keep Cross Validation Predictions |
Whether to keep the predictions of the cross-validation predictions. This needs to be set to TRUE if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML. |
keepCrossValidationFoldAssignment |
Keep Cross Validation Fold Assignment |
Whether to keep cross-validation assignments. |
distribution |
Distribution |
Distribution function used by algorithms that support it; other algorithms use their defaults. |
weightCol |
Weight Column |
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0. |
offsetCol |
Offset Column |
Offset column. This will be added to the combination of columns before applying the link function. |
foldCol |
Fold Column |
Column with cross-validation fold index assignment per observation. |
foldAssignment |
Fold Assignment |
Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems. |
aucType |
AUC Type |
Set default multinomial AUC type. |
confusionMatrix |
Confusion Matrix |
|
output_confusion_matrix_chart |
Output Confusion Matrix Chart |
whether to display confusion matrix chart. |
cm_chart_title |
Confusion Matrix Chart Title |
Title name to display in Confusion Matrix Chart |
cm_chart_description |
Confusion Matrix Chart Description |
Description to display in Confusion Matrix CHart |
confusionMatrixTargetLegend |
Confusion Matrix Target Legend |
Legend name to display for Target in Confusion Matrix |
confusionMatrixPredictedLabelLegend |
Confusion Matrix PredictedLabel Legend |
Legend name to display for Predicted Label in Confusion Matrix |
confusionMatrixCountLegend |
Confusion Matrix Count Legend |
Legend name to display for Count in Confusion Matrix |
path |
Save Confusion Matrix Path |
Save Confusion Matrix to Path |
Description |
Confusion Matrix Description |
|
confusionMatrixRowDescription |
Confusion Matrix Outcome description |
One can provide the business details of the outcome of the confusion matrix rows |
ROC Curve |
ROC Curve |
|
output_roc_curve |
Output ROC Curve |
whether to display confusion matrix chart. |
roc_title |
ROC Curve Chart Title |
Title name to display in ROC Curve Chart |
roc_description |
ROC Curve Chart Description |
Add Description for ROC Curve Chart |
xlabel |
X Label |
X label |
ylabel |
Y Label |
Y Label |
Grid Search |
Grid Search |
|
paramKeys |
Param Name |
Param Names. eg: maxDepth ,learnRate, nTrees,distribution |
paramValues |
Param Value |
Enter comma separated values.eg: 4,5,6, eg: gaussian, gamma, bernoulli |
gridStrategy |
Grid Search Strategy |
Strategy to use for model hyperparameter search. Cartesian does exhaustive search; RandomDiscrete searches randomly within given time or model limits. |
gridMaxModels |
Grid Max Models |
Maximum number of models to build in the grid search (0 for unlimited). |
gridMaxRuntimeSecs |
Grid Max Runtime Seconds |
Maximum runtime in seconds for the grid search (0 for unlimited). |
gridStoppingRounds |
Grid Stopping Rounds |
Early stopping based on convergence of the metric during grid search (0 to disable). |
gridStoppingTolerance |
Grid Stopping Tolerance |
Tolerance for metric-based stopping criterion during grid search. |
gridStoppingMetric |
Grid Stopping Metric |
Metric to use for early stopping during grid search (AUTO: logloss for classification, deviance for regression). |
gridParallelism |
Grid Parallelism |
Level of parallelism to use when building models in the grid. |
gridSelectBestModelBy |
Grid Select Best Model By |
Metric used to select the best model from the grid. |
Examples¶
Input:
Label Column: “HousePrice”
Feature Columns: [“Bedrooms”, “SquareFootage”, “Neighborhood”]
Output:
A Random Forest model trained to predict house prices.
The output includes feature importance scores, helping identify the most influential predictors.