H2O Generalized Linear Models¶

Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions

Input¶

It takes in a DataFrame as input

Type¶

ml-estimator

Class¶

fire.nodes.h2o.NodeH2OGLM

Fields¶

Name	Title	Description
isResponseIsCategorical	Is Response Column Categorical	Specify a response column type(numeric or categorical). Separates the Classification and Regression
labelCol	Label Column	Response variable column.
featuresCols	Feature Columns	Features to be used for Modelling
splitRatio	Split Ratio	Split Ratio
columnsToCategorical	Columns to Categorical	Columns to be Categorical encoded
seed	Seed	Seed for pseudo random number generator (if applicable).
balanceClasses	Balance Classes	Balance training data class counts via over/under-sampling (for imbalanced data).
nfolds	Number of Folds	Number of folds for K-fold cross-validation (0 to disable or >= 2).
maxIterations	Max Iterations	Maximum number of iterations.
theta	Theta	Theta
solver	Solver	Specify the solver to use (AUTO, IRLSM, L_BFGS, COORDINATE_DESCENT_NAIVE, COORDINATE_DESCENT, GRADIENT_DESCENT_LH, or GRADIENT_DESCENT_SQERR)
earlyStopping	Early Stopping	Stop early when there is no more relative improvement on train or validation (if provided).
removeCollinearCols	Remove Collinear Columns	In case of linearly dependent columns, remove some of the dependent columns.
family	Family	Family. Use binomial for classification with logistic regression, others are for regression problems.
ignoreConstCols	Ignore Const Columns	Ignore constant columns.
scoreEachIteration	Score Each Iteration	Whether to score during each iteration of model training.
stoppingRounds	Stopping Rounds	Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable).
maxRuntimeSecs	Max Runtime Secs	his argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour).
stoppingMetric	StoppingMetric	Metric to use for early stopping (AUTO: logloss for classification, deviance for regression)
stoppingTolerance	StoppingTolerance	Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
gainsliftBins	Gains Lift Bins	Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.
advanced	Advanced
convertUnknownCategoricalLevelsToNa	Convert Unknown Categorical Levels to NA	If set to ‘true’, the model converts unknown categorical levels to NA during making predictions.
standardize	Standardize	Standardize numeric columns to have zero mean and unit variance.
intercept	Intercept	Include constant term in the model.
withLeafNodeAssignments	With Node Assignments	Enables or disables computation of leaf node assignments.
withStageResults	With Stage Results	Enables or disables computation of stage results.
tweedieVariancePower	Tweedie Variance Power	tweedieVariancePower
tweedieLinkPower	Tweedie Link Power	Tweedie link power.
lambdaSearch	Lambda Search	Use lambda search starting at lambda max, given lambda is then interpreted as lambda min.
nlambdas	Number of Lambdas	Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.
scoreIterationInterval	Score Iteration Interval	Score Iteration Interval
coldStart	Cold Start	Only applicable to multiple alpha/lambda values. If false, build the next model for next set of alpha/lambda values starting from the values provided by current model. If true will start GLM model from scratch.
missingValuesHandling	Missing Values Handling	Handling of missing values.
nonNegative	Non Negative	Restrict coefficients (not intercept) to be non-negative.
betaEpsilon	Beta Epsilon	Converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver .
objectiveEpsilon	Objective Epsilon	SConverge if objective value changes less than this. Default (of -1.0) indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.
gradientEpsilon	Gradient Epsilon	SConverge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default (of -1.0) indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.
objReg	Objective Regularizer	Likelihood divider in objective value computation, default (of -1.0) will set it to 1/nobs.
link	Link	Link function
calcLike	Calc Like	if true, will return likelihood function value for HGLM.
HGLM	HGLM	If set to true, will return HGLM model. Otherwise, normal GLM model will be returned.
prior	Prior	SPrior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.
lambdaMinRatio	Lambda Min Ratio	Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.
maxActivePredictors	Max Active Predictors	SMaximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.
maxAfterBalanceSize	Max After Balance Size	Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
maxConfusionMatrixSize	Max Confusion Matrix Size	[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs.
computePValues	Compute P Values	Request p-values computation, p-values work only with IRLSM solver and no regularization.
generateScoringHistory	Generate Scoring History	If set to true, will generate scoring history for GLM. This may significantly slow down the algo.
keepCrossValidationModels	Keep Cross Validation Models	Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster.
keepCrossValidationPredictions	Keep Cross Validation Predictions	Whether to keep the predictions of the cross-validation predictions. This needs to be set to TRUE if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML.
keepCrossValidationFoldAssignment	Keep Cross Validation Fold Assignment	Whether to keep cross-validation assignments.
weightCol	Weight Column	Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
offsetCol	Offset Column	Offset column. This will be added to the combination of columns before applying the link function.
foldCol	Fold Column	Column with cross-validation fold index assignment per observation.
foldAssignment	Fold Assignment	Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
aucType	AUC Type	Set default multinomial AUC type.
confusionMatrix	Confusion Matrix
output_confusion_matrix_chart	Output Confusion Matrix Chart	whether to display confusion matrix chart.
cm_chart_title	Confusion Matrix Chart Title	Title name to display in Confusion Matrix Chart
cm_chart_description	Confusion Matrix Chart Description	Description to display in Confusion Matrix CHart
confusionMatrixTargetLegend	Confusion Matrix Target Legend	Legend name to display for Target in Confusion Matrix
confusionMatrixPredictedLabelLegend	Confusion Matrix PredictedLabel Legend	Legend name to display for Predicted Label in Confusion Matrix
confusionMatrixCountLegend	Confusion Matrix Count Legend	Legend name to display for Count in Confusion Matrix
path	Save Confusion Matrix Path	Save Confusion Matrix to Path
Description	Confusion Matrix Description
confusionMatrixRowDescription	Confusion Matrix Outcome description	One can provide the business details of the outcome of the confusion matrix rows
ROC Curve	ROC Curve
output_roc_curve	Output ROC Curve	whether to display confusion matrix chart.
roc_title	ROC Curve Chart Title	Title name to display in ROC Curve Chart
roc_description	ROC Curve Chart Description	Add Description for ROC Curve Chart
xlabel	X Label	X label
ylabel	Y Label	Y Label
Grid Search	Grid Search
paramKeys	Param Name	Param Names. eg: maxDepth ,learnRate, nTrees,distribution
paramValues	Param Value	Enter comma separated values.eg: 4,5,6, eg: gaussian, gamma, bernoulli
gridStrategy	Grid Search Strategy	Strategy to use for model hyperparameter search. Cartesian does exhaustive search; RandomDiscrete searches randomly within given time or model limits.
gridMaxModels	Grid Max Models	Maximum number of models to build in the grid search (0 for unlimited).
gridMaxRuntimeSecs	Grid Max Runtime Seconds	Maximum runtime in seconds for the grid search (0 for unlimited).
gridStoppingRounds	Grid Stopping Rounds	Early stopping based on convergence of the metric during grid search (0 to disable).
gridStoppingTolerance	Grid Stopping Tolerance	Tolerance for metric-based stopping criterion during grid search.
gridStoppingMetric	Grid Stopping Metric	Metric to use for early stopping during grid search (AUTO: logloss for classification, deviance for regression).
gridParallelism	Grid Parallelism	Level of parallelism to use when building models in the grid.
gridSelectBestModelBy	Grid Select Best Model By	Metric used to select the best model from the grid.

Details¶

Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification.

More details are available at : http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html

Examples¶

H2O Generalized Linear Models Node Examples¶

Example 1: Binary Classification

Configuration:

IS RESPONSE COLUMN CATEGORICAL: true

LABEL COLUMN: Churn

FEATURE COLUMNS: Age, Income, AccountBalance, Tenure

Dataset:

Input DataFrame:

Age Income AccountBalance Tenure Churn

25 30000 5000 2 0

45 60000 12000 5 1

Outcome:

The node trains a binary classification model to predict whether a customer will churn.

Example 2: Regression

Configuration:

IS RESPONSE COLUMN CATEGORICAL: false

LABEL COLUMN: HousePrice

FEATURE COLUMNS: SquareFeet, Bedrooms, Bathrooms, YearBuilt

Dataset:

Input DataFrame:

SquareFeet Bedrooms Bathrooms YearBuilt HousePrice

2000 3 2 2005 450000

1500 2 1 1995 300000

Outcome:

The node trains a regression model to predict house prices based on the features provided.