H2O K-Means¶

K-Means falls in the general category of clustering algorithms.

Input¶

It takes in a DataFrame as input

Type¶

ml-estimator

Class¶

fire.nodes.h2o.NodeH2OKMeans

Fields¶

Name	Title	Description
modelIdentifier	Model Identifier	modelIdentifier starts with $loop & columns names separated with underscore. Example: $loop_columnName1_columnName2.
maxIterations	Max Iterations	Maximum number of iterations.
featuresCols	Feature Columns	Features to be used for Modelling
splitRatio	Split Ratio	Split Ratio
standardize	Standardize	Standardize numeric columns to have zero mean and unit variance.
seed	Seed	Seed for pseudo random number generator (if applicable).
init	Initialization mode	Initialization mode.
estimateK	Estimate k	Whether to estimate the number of clusters (<=k) iteratively and deterministically.
k	K	The max. number of clusters. If estimate_k is disabled, the model will find k centroids, otherwise it will find up to k centroids.
dynamicClusteringCol	Dynamic Clustering Column	Column with varying K values. If empty default K values will be used.
saveCentroidsPath	Save Centroids Path	Save Centroids as CSV
clusterSizeConstraints	Cluster Size Constraint	An array specifying the minimum number of points that should be in each cluster. The length of the constraints array has to be the same as the number of clusters.
nfolds	Number of Folds	Number of folds for K-fold cross-validation (0 to disable or >= 2).
keepCrossValidationModels	Keep Cross Validation Models	Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster.
keepCrossValidationPredictions	Keep Cross Validation Predictions	Whether to keep the predictions of the cross-validation predictions. This needs to be set to TRUE if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML.
keepCrossValidationFoldAssignment	Keep Cross Validation Fold Assignment	Whether to keep cross-validation assignments.
foldCol	Fold Column	Column with cross-validation fold index assignment per observation.
foldAssignment	Fold Assignment	Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
categoricalEncoding	Categorical Encoding	Specify one of the various encoding schemes for handling categorical features
ignoreConstCols	Ignore Const Columns	Ignore constant columns.
columnsToCategorical	Columns to Categorical	Columns to be Categorical encoded
scoreEachIteration	Score Each Iteration	Whether to score during each iteration of model training.
maxRuntimeSecs	Max Runtime Secs	his argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour).
Grid Search	Grid Search
paramKeys	Param Name	Param Names. eg: maxDepth ,learnRate, nTrees,distribution
paramValues	Param Value	Enter comma separated values.eg: 4,5,6, eg: gaussian, gamma, bernoulli
gridStrategy	Grid Search Strategy	Strategy to use for model hyperparameter search. Cartesian does exhaustive search; RandomDiscrete searches randomly within given time or model limits.
gridMaxModels	Grid Max Models	Maximum number of models to build in the grid search (0 for unlimited).
gridMaxRuntimeSecs	Grid Max Runtime Seconds	Maximum runtime in seconds for the grid search (0 for unlimited).
gridStoppingRounds	Grid Stopping Rounds	Early stopping based on convergence of the metric during grid search (0 to disable).
gridStoppingTolerance	Grid Stopping Tolerance	Tolerance for metric-based stopping criterion during grid search.
gridStoppingMetric	Grid Stopping Metric	Metric to use for early stopping during grid search (AUTO: logloss for classification, deviance for regression).
gridParallelism	Grid Parallelism	Level of parallelism to use when building models in the grid.
gridSelectBestModelBy	Grid Select Best Model By	Metric used to select the best model from the grid.

Details¶

K-Means falls in the general category of clustering algorithms. Clustering is a form of unsupervised learning that tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that an observation in a given group is more similar to another observation in the same group than to another observation in a different group.

More details are available at : http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html