H2O PCA =========== PCA is commonly used to model without regularization or perform dimensionality reduction. It can also be useful to carry out as a preprocessing step before distance-based algorithms such as K-Means since PCA guarantees that all dimensions of a manifold are orthogonal. Input -------------- It takes in a DataFrame as input Type --------- ml-estimator Class --------- fire.nodes.h2o.NodeH2OPCA Fields --------- .. list-table:: :widths: 10 5 10 :header-rows: 1 * - Name - Title - Description * - transform - Transform - Transformation of training data. * - pcaMethod - PCA Method - Specify the algorithm to use for computing the principal components: GramSVD - uses a distributed computation of the Gram matrix, followed by a local SVD; Power - computes the SVD using the power iteration method (experimental); Randomized - uses randomized subspace iteration method; GLRM - fits a generalized low-rank model with L2 loss function and no regularization and solves for the SVD using local matrix algebra (experimental). * - inputCols - Input Columns - Features to be used for Modelling * - pcaImpl - PCA Implementation - Specify the implementation to use for computing PCA (via SVD or EVD): MTJ_EVD_DENSEMATRIX - eigenvalue decompositions for dense matrix using MTJ; MTJ_EVD_SYMMMATRIX - eigenvalue decompositions for symmetric matrix using MTJ; MTJ_SVD_DENSEMATRIX - singular-value decompositions for dense matrix using MTJ; JAMA - eigenvalue decompositions for dense matrix using JAMA. References: JAMA - http://math.nist.gov/javanumerics/jama/; MTJ - https://github.com/fommil/matrix-toolkits-java/. * - k - K - Rank of matrix approximation. * - maxIterations - Max Iterations - Maximum training iterations. * - splitRatio - Split Ratio - Split Ratio * - seed - Seed - Seed for pseudo random number generator (if applicable). * - useAllFactorLevels - Use All Factor Levels - Whether first factor level is included in each categorical expansion. * - computeMetrics - Compute Metrics - Whether to compute metrics on the training data. * - imputeMissing - Impute Missing - Whether to impute missing entries with the column mean. * - ignoreConstCols - Ignore Const Columns - Ignore constant columns. * - scoreEachIteration - Score Each Iteration - Whether to score during each iteration of model training. * - maxRuntimeSecs - Max Runtime Secs - his argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour). * - columnsToCategorical - Columns to Categorical - Columns to be Categorical encoded Details ------- Principal Components Analysis (PCA) is closely related to Principal Components Regression. The algorithm is carried out on a set of possibly collinear features and performs a transformation to produce a new set of uncorrelated features. More details are available at : http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/pca.html