ChiSq Selector¶

Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label.

Type¶

ml-transformer

Class¶

fire.nodes.ml.NodeChiSqSelector

Fields¶

Name	Title	Description
featuresCol	Features Column	The features column name
outputCol	Output Column	The output column name
labelCol	Label Column	The label column name
numTopFeatures	NumTopFeatures	Number of features that selector will select (ordered by statistic value descending).
percentile	Percentile	Percentile of features that selector will select, ordered by statistics value descending. Only applicable when selectorType is percentile
fpr	FPR	The highest p-value for features to be kept. Only applicable when selectorType is fpr
fdr	FDR	The upper bound of the expected false discovery rate. Only applicable when selectorType is fdr
fwe	FWE	The upper bound of the expected family-wise error rate. Only applicable when selectorType is fwe
selectorType	Selector Type	The selector type of the ChisqSelector.

Details¶

ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose.

More details are available at : http://spark.apache.org/docs/latest/ml-features.html#chisqselector

Examples¶

The below example is available at : http://spark.apache.org/docs/latest/ml-features.html#chisqselector ¶

import org.apache.spark.ml.feature.ChiSqSelector

import org.apache.spark.ml.linalg.Vectors

val data = Seq(

(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),

(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),

(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)

)

val df = spark.createDataset(data).toDF(“id”, “features”, “clicked”)

val selector = new ChiSqSelector()

.setNumTopFeatures(1)

.setFeaturesCol(“features”)

.setLabelCol(“clicked”)

.setOutputCol(“selectedFeatures”)

val result = selector.fit(df).transform(df)

result.show()