ChiSq Selector =========== Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. Type --------- ml-transformer Class --------- fire.nodes.ml.NodeChiSqSelector Fields --------- .. list-table:: :widths: 10 5 10 :header-rows: 1 * - Name - Title - Description * - featuresCol - Features Column - The features column name * - outputCol - Output Column - The output column name * - labelCol - Label Column - The label column name * - numTopFeatures - NumTopFeatures - Number of features that selector will select (ordered by statistic value descending). * - percentile - Percentile - Percentile of features that selector will select, ordered by statistics value descending. Only applicable when selectorType is percentile * - fpr - FPR - The highest p-value for features to be kept. Only applicable when selectorType is fpr * - fdr - FDR - The upper bound of the expected false discovery rate. Only applicable when selectorType is fdr * - fwe - FWE - The upper bound of the expected family-wise error rate. Only applicable when selectorType is fwe * - selectorType - Selector Type - The selector type of the ChisqSelector. Details ------- ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. More details are available at : http://spark.apache.org/docs/latest/ml-features.html#chisqselector Examples ------- The below example is available at : http://spark.apache.org/docs/latest/ml-features.html#chisqselector +++++++++++++++ import org.apache.spark.ml.feature.ChiSqSelector import org.apache.spark.ml.linalg.Vectors val data = Seq( (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0), (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0), (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0) ) val df = spark.createDataset(data).toDF("id", "features", "clicked") val selector = new ChiSqSelector() .setNumTopFeatures(1) .setFeaturesCol("features") .setLabelCol("clicked") .setOutputCol("selectedFeatures") val result = selector.fit(df).transform(df) result.show()