ChiSq Selector¶
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label.
Type¶
ml-transformer
Class¶
fire.nodes.ml.NodeChiSqSelector
Fields¶
Name |
Title |
Description |
|---|---|---|
featuresCol |
Features Column |
The features column name |
outputCol |
Output Column |
The output column name |
labelCol |
Label Column |
The label column name |
numTopFeatures |
NumTopFeatures |
Number of features that selector will select (ordered by statistic value descending). |
percentile |
Percentile |
Percentile of features that selector will select, ordered by statistics value descending. Only applicable when selectorType is percentile |
fpr |
FPR |
The highest p-value for features to be kept. Only applicable when selectorType is fpr |
fdr |
FDR |
The upper bound of the expected false discovery rate. Only applicable when selectorType is fdr |
fwe |
FWE |
The upper bound of the expected family-wise error rate. Only applicable when selectorType is fwe |
selectorType |
Selector Type |
The selector type of the ChisqSelector. |
Details¶
ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose.
More details are available at : http://spark.apache.org/docs/latest/ml-features.html#chisqselector
Examples¶
The below example is available at : http://spark.apache.org/docs/latest/ml-features.html#chisqselector¶
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)
val df = spark.createDataset(data).toDF(“id”, “features”, “clicked”)
val selector = new ChiSqSelector()
.setNumTopFeatures(1)
.setFeaturesCol(“features”)
.setLabelCol(“clicked”)
.setOutputCol(“selectedFeatures”)
val result = selector.fit(df).transform(df)
result.show()