PCA¶

Trains a model to project vectors to a low-dimensional space using PCA.

Input¶

This takes in a DataFrame as input

Output¶

The output DataFrame is a projection of the vectors in the incoming DataFrame to a low-dimensional space using PCA

Type¶

ml-transformer

Class¶

fire.nodes.ml.NodePCA

Fields¶

Name	Title	Description
inputCol	Input Column	The input column name
outputCol	Output Column	The output column name
k	K	The number of principal components

Details¶

Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible.

The columns of the rotation matrix are called principal components.

More at Spark MLlib/ML docs page : https://spark.apache.org/docs/2.0.0/mllib-dimensionality-reduction.html#principal-component-analysis-pca

Examples¶

The below example is available at : https://spark.apache.org/docs/2.0.0/mllib-dimensionality-reduction.html#principal-component-analysis-pca ¶

import org.apache.spark.mllib.linalg.Matrix

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.linalg.distributed.RowMatrix

val data = Array(

Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),

Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),

Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))

val dataRDD = sc.parallelize(data, 2)

val mat: RowMatrix = new RowMatrix(dataRDD)

// Compute the top 4 principal components.

// Principal components are stored in a local dense matrix.

val pc: Matrix = mat.computePrincipalComponents(4)

// Project the rows to the linear space spanned by the top 4 principal components.

val projected: RowMatrix = mat.multiply(pc)