H2O PCA¶
PCA is commonly used to model without regularization or perform dimensionality reduction. It can also be useful to carry out as a preprocessing step before distance-based algorithms such as K-Means since PCA guarantees that all dimensions of a manifold are orthogonal.
Input¶
It takes in a DataFrame as input
Type¶
ml-estimator
Class¶
fire.nodes.h2o.NodeH2OPCA
Fields¶
Name |
Title |
Description |
|---|---|---|
transform |
Transform |
Transformation of training data. |
pcaMethod |
PCA Method |
Specify the algorithm to use for computing the principal components: GramSVD - uses a distributed computation of the Gram matrix, followed by a local SVD; Power - computes the SVD using the power iteration method (experimental); Randomized - uses randomized subspace iteration method; GLRM - fits a generalized low-rank model with L2 loss function and no regularization and solves for the SVD using local matrix algebra (experimental). |
pcaImpl |
PCA Implementation |
Specify the implementation to use for computing PCA (via SVD or EVD): MTJ_EVD_DENSEMATRIX - eigenvalue decompositions for dense matrix using MTJ; MTJ_EVD_SYMMMATRIX - eigenvalue decompositions for symmetric matrix using MTJ; MTJ_SVD_DENSEMATRIX - singular-value decompositions for dense matrix using MTJ; JAMA - eigenvalue decompositions for dense matrix using JAMA. References: JAMA - http://math.nist.gov/javanumerics/jama/; MTJ - https://github.com/fommil/matrix-toolkits-java/. |
splitRatio |
Split Ratio |
Split Ratio |
k |
K |
Rank of matrix approximation. |
maxIterations |
Max Iterations |
Maximum training iterations. |
seed |
Seed |
Seed for pseudo random number generator (if applicable). |
useAllFactorLevels |
Use All Factor Levels |
Whether first factor level is included in each categorical expansion. |
computeMetrics |
Compute Metrics |
Whether to compute metrics on the training data. |
imputeMissing |
Impute Missing |
Whether to impute missing entries with the column mean. |
ignoreConstCols |
Ignore Const Columns |
Ignore constant columns. |
scoreEachIteration |
Score Each Iteration |
Whether to score during each iteration of model training. |
maxRuntimeSecs |
Max Runtime Secs |
his argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour). |
ignoredCols |
Ignored Columns |
Features to be ignored for Modelling |
columnsToCategorical |
Columns to Categorical |
Columns to be Categorical encoded |
Details¶
Principal Components Analysis (PCA) is closely related to Principal Components Regression. The algorithm is carried out on a set of possibly collinear features and performs a transformation to produce a new set of uncorrelated features.
More details are available at : http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/pca.html