H2O Generalized Linear Models

Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions

Input

It takes in a DataFrame as input

Type

ml-estimator

Class

fire.nodes.h2o.NodeH2OGlm

Fields

Name

Title

Description

isResponseIsCategorical

Is Response Column Categorical

Specify a response column type(numeric or categorical). Separates the Classification and Regression

labelCol

Label Column

Response variable column.

featuresCols

Feature Columns

Features to be used for Modelling

splitRatio

Split Ratio

Split Ratio

columnsToCategorical

Columns to Categorical

Columns to be Categorical encoded

seed

Seed

Seed for pseudo random number generator (if applicable).

balanceClasses

Balance Classes

Balance training data class counts via over/under-sampling (for imbalanced data).

nfolds

Number of Folds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

maxIterations

Max Iterations

Maximum number of iterations.

theta

Theta

Theta

solver

Solver

Specify the solver to use (AUTO, IRLSM, L_BFGS, COORDINATE_DESCENT_NAIVE, COORDINATE_DESCENT, GRADIENT_DESCENT_LH, or GRADIENT_DESCENT_SQERR)

earlyStopping

Early Stopping

Stop early when there is no more relative improvement on train or validation (if provided).

removeCollinearCols

Remove Collinear Columns

In case of linearly dependent columns, remove some of the dependent columns.

family

Family

Family. Use binomial for classification with logistic regression, others are for regression problems.

ignoreConstCols

Ignore Const Columns

Ignore constant columns.

scoreEachIteration

Score Each Iteration

Whether to score during each iteration of model training.

stoppingRounds

Stopping Rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable).

maxRuntimeSecs

Max Runtime Secs

his argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour).

stoppingMetric

StoppingMetric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression)

stoppingTolerance

StoppingTolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

gainsliftBins

Gains Lift Bins

Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.

advanced

Advanced

convertUnknownCategoricalLevelsToNa

Convert Unknown Categorical Levels to NA

If set to ‘true’, the model converts unknown categorical levels to NA during making predictions.

standardize

Standardize

Standardize numeric columns to have zero mean and unit variance.

intercept

Intercept

Include constant term in the model.

withLeafNodeAssignments

With Node Assignments

Enables or disables computation of leaf node assignments.

withStageResults

With Stage Results

Enables or disables computation of stage results.

tweedieVariancePower

Tweedie Variance Power

tweedieVariancePower

tweedieLinkPower

Tweedie Link Power

Tweedie link power.

lambdaSearch

Lambda Search

Use lambda search starting at lambda max, given lambda is then interpreted as lambda min.

nlambdas

Number of Lambdas

Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.

scoreIterationInterval

Score Iteration Interval

Score Iteration Interval

coldStart

Cold Start

Only applicable to multiple alpha/lambda values. If false, build the next model for next set of alpha/lambda values starting from the values provided by current model. If true will start GLM model from scratch.

missingValuesHandling

Missing Values Handling

Handling of missing values.

nonNegative

Non Negative

Restrict coefficients (not intercept) to be non-negative.

betaEpsilon

Beta Epsilon

Converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver .

objectiveEpsilon

Objective Epsilon

SConverge if objective value changes less than this. Default (of -1.0) indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.

gradientEpsilon

Gradient Epsilon

SConverge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default (of -1.0) indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.

objReg

Objective Regularizer

Likelihood divider in objective value computation, default (of -1.0) will set it to 1/nobs.

link

Link

Link function

calcLike

Calc Like

if true, will return likelihood function value for HGLM.

HGLM

HGLM

If set to true, will return HGLM model. Otherwise, normal GLM model will be returned.

prior

Prior

SPrior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.

lambdaMinRatio

Lambda Min Ratio

Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.

maxActivePredictors

Max Active Predictors

SMaximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.

maxAfterBalanceSize

Max After Balance Size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

maxConfusionMatrixSize

Max Confusion Matrix Size

[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs.

computePValues

Compute P Values

Request p-values computation, p-values work only with IRLSM solver and no regularization.

generateScoringHistory

Generate Scoring History

If set to true, will generate scoring history for GLM. This may significantly slow down the algo.

keepCrossValidationModels

Keep Cross Validation Models

Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster.

keepCrossValidationPredictions

Keep Cross Validation Predictions

Whether to keep the predictions of the cross-validation predictions. This needs to be set to TRUE if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML.

keepCrossValidationFoldAssignment

Keep Cross Validation Fold Assignment

Whether to keep cross-validation assignments.

weightCol

Weight Column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

offsetCol

Offset Column

Offset column. This will be added to the combination of columns before applying the link function.

foldCol

Fold Column

Column with cross-validation fold index assignment per observation.

foldAssignment

Fold Assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

aucType

AUC Type

Set default multinomial AUC type.

confusionMatrix

Confusion Matrix

output_confusion_matrix_chart

Output Confusion Matrix Chart

whether to display confusion matrix chart.

cm_chart_title

Confusion Matrix Chart Title

Title name to display in Confusion Matrix Chart

cm_chart_description

Confusion Matrix Chart Description

Description to display in Confusion Matrix CHart

confusionMatrixTargetLegend

Confusion Matrix Target Legend

Legend name to display for Target in Confusion Matrix

confusionMatrixPredictedLabelLegend

Confusion Matrix PredictedLabel Legend

Legend name to display for Predicted Label in Confusion Matrix

path

Save Confusion Matrix Path

Save Confusion Matrix to Path

ROC Curve

ROC Curve

output_roc_curve

Output ROC Curve

Whether to display confusion matrix chart.

roc_title

ROC Curve Chart Title

Title name to display in ROC Curve Chart

roc_description

ROC Curve Chart Description

Add Description for ROC Curve Chart

xlabel

X Label

X label

ylabel

Y Label

Y Label

Grid Search

Grid Search

paramKeys

Param Name

Param Names. eg: maxDepth ,learnRate, nTrees,distribution

paramValues

Param Value

Enter comma separated values.eg: 4,5,6, eg: gaussian, gamma, bernoulli

gridStrategy

Grid Search Strategy

Strategy to use for model hyperparameter search. Cartesian does exhaustive search; RandomDiscrete searches randomly within given time or model limits.

gridMaxModels

Grid Max Models

Maximum number of models to build in the grid search (0 for unlimited).

gridMaxRuntimeSecs

Grid Max Runtime Seconds

Maximum runtime in seconds for the grid search (0 for unlimited).

gridStoppingRounds

Grid Stopping Rounds

Early stopping based on convergence of the metric during grid search (0 to disable).

gridStoppingTolerance

Grid Stopping Tolerance

Tolerance for metric-based stopping criterion during grid search.

gridStoppingMetric

Grid Stopping Metric

Metric to use for early stopping during grid search (AUTO: logloss for classification, deviance for regression).

gridParallelism

Grid Parallelism

Level of parallelism to use when building models in the grid.

gridSelectBestModelBy

Grid Select Best Model By

Metric used to select the best model from the grid.

Details

Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification.

More details are available at : http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html

Examples

H2O Generalized Linear Models Node Examples

Example 1: Binary Classification

Configuration:

IS RESPONSE COLUMN CATEGORICAL: true

LABEL COLUMN: Churn

FEATURE COLUMNS: Age, Income, AccountBalance, Tenure

Dataset:

Input DataFrame:

Age Income AccountBalance Tenure Churn

25 30000 5000 2 0

45 60000 12000 5 1

Outcome:

The node trains a binary classification model to predict whether a customer will churn.

Example 2: Regression

Configuration:

IS RESPONSE COLUMN CATEGORICAL: false

LABEL COLUMN: HousePrice

FEATURE COLUMNS: SquareFeet, Bedrooms, Bathrooms, YearBuilt

Dataset:

Input DataFrame:

SquareFeet Bedrooms Bathrooms YearBuilt HousePrice

2000 3 2 2005 450000

1500 2 1 1995 300000

Outcome:

The node trains a regression model to predict house prices based on the features provided.