PCA

Trains a model to project vectors to a low-dimensional space using PCA.

Input

This takes in a DataFrame as input

Output

The output DataFrame is a projection of the vectors in the incoming DataFrame to a low-dimensional space using PCA

Type

ml-transformer

Class

fire.nodes.ml.NodePCA

Fields

Name

Title

Description

inputCol

Input Column

The input column name

outputCol

Output Column

The output column name

k

K

The number of principal components

Details

Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible.

The columns of the rotation matrix are called principal components.

More at Spark MLlib/ML docs page : https://spark.apache.org/docs/2.0.0/mllib-dimensionality-reduction.html#principal-component-analysis-pca

Examples

The below example is available at : https://spark.apache.org/docs/2.0.0/mllib-dimensionality-reduction.html#principal-component-analysis-pca

import org.apache.spark.mllib.linalg.Matrix

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.linalg.distributed.RowMatrix

val data = Array(

Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),

Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),

Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))

val dataRDD = sc.parallelize(data, 2)

val mat: RowMatrix = new RowMatrix(dataRDD)

// Compute the top 4 principal components.

// Principal components are stored in a local dense matrix.

val pc: Matrix = mat.computePrincipalComponents(4)

// Project the rows to the linear space spanned by the top 4 principal components.

val projected: RowMatrix = mat.multiply(pc)