H2O Isolation Forest¶

Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees.

Input¶

It takes in a DataFrame as input

Type¶

ml-estimator

Class¶

fire.nodes.h2o.NodeH2OIsolationForest

Fields¶

Name	Title	Description
sampleSize	Sample Size	Number of randomly sampled observations used to train each Isolation Forest tree. Only one of parameters sample_size and sample_rate should be defined. If sample_rate is defined, sample_size will be ignored
featuresCols	Feature Columns	Features to be used for Modelling
sampleRate	Sample Rate	Row sample rate per tree (from 0.0 to 1.0).
mtries	mtries	Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictors.
contamination	Contamination	Contamination ratio - the proportion of anomalies in the input dataset. If undefined (-1) the predict function will not mark observations as anomalies and only anomaly score will be returned.
ntrees	Number of Trees	Number of trees.
maxDepth	Max Depth	Maximum tree depth (0 for unlimited).
minRows	Min Rows	Fewest allowed (weighted) observations in a leaf.
seed	Seed	Seed for pseudo random number generator (if applicable).
buildTreeOneNode	Build tree one node	Enables to run on a single node
colSampleRatePerTree	Column Sample Rate Per Tree (from 0.0 to 1.0).	Column sample rate per tree (from 0.0 to 1.0).
colSampleRateChangePerLevel	Column Sample Rate change Per Level	Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0).
scoreTreeInterval	Score Tree Interval	Score the model after every so many trees. Disabled if set to 0.
categoricalEncoding	Categorical Encoding	Specify one of the various encoding schemes for handling categorical features
ignoreConstCols	Ignore Const Columns	Ignore constant columns.
ignoredCols	Ignore Columns	Ignore Columns.
columnsToCategorical	Columns to Categorical	Columns to be Categorical encoded
scoreEachIteration	Score Each Iteration	Whether to score during each iteration of model training.
stoppingRounds	Stopping Rounds	Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable).
maxRuntimeSecs	Max Runtime Secs	his argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour).
stoppingMetric	StoppingMetric	Metric to use for early stopping (AUTO: logloss for classification, deviance for regression)
stoppingTolerance	StoppingTolerance	Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
Grid Search	Grid Search
paramKeys	Param Name	Param Names. eg: maxDepth ,learnRate, nTrees,distribution
paramValues	Param Value	Enter comma separated values.eg: 4,5,6, eg: gaussian, gamma, bernoulli
gridStrategy	Grid Search Strategy	Strategy to use for model hyperparameter search. Cartesian does exhaustive search; RandomDiscrete searches randomly within given time or model limits.
gridMaxModels	Grid Max Models	Maximum number of models to build in the grid search (0 for unlimited).
gridMaxRuntimeSecs	Grid Max Runtime Seconds	Maximum runtime in seconds for the grid search (0 for unlimited).
gridStoppingRounds	Grid Stopping Rounds	Early stopping based on convergence of the metric during grid search (0 to disable).
gridStoppingTolerance	Grid Stopping Tolerance	Tolerance for metric-based stopping criterion during grid search.
gridStoppingMetric	Grid Stopping Metric	Metric to use for early stopping during grid search (AUTO: logloss for classification, deviance for regression).
gridParallelism	Grid Parallelism	Level of parallelism to use when building models in the grid.
gridSelectBestModelBy	Grid Select Best Model By	Metric used to select the best model from the grid.

Details¶

Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees. Isolation Forest, however, identifies anomalies or outliers rather than profiling normal data points. Isolation Forest isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of that selected feature. This split depends on how long it takes to separate the points.

More details are available at : http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html