Sklearn TF-IDF Vectorizer¶
Applies scikit-learn’s TfidfVectorizer to a text column. Converts text documents into TF-IDF feature vectors and stores them as an ARRAY column in the Spark DataFrame.
Input¶
Takes a DataFrame with at least one text column.
Output¶
Adds a new column containing TF-IDF vectors as ARRAY and passes the DataFrame to downstream nodes.
Type¶
transform
Class¶
fire.nodes.sklearn.preprocessing.NodeTFIDFVectorizerFitTransform
Fields¶
Name |
Title |
Description |
|---|---|---|
columnToVectorize |
Text Column to Vectorize |
Name of the text column on which TF-IDF should be computed. |
outputCol |
Output Column Name |
Name of the output column that will store TF-IDF vectors as ARRAY. If left empty, defaults to ‘tfidf_<columnToVectorize>’. |
max_df |
Max Document Frequency (max_df) |
Ignore terms that appear in more than this proportion of documents. For example, 0.9 drops terms appearing in more than 90% of documents. |
min_df |
Min Document Frequency (min_df) |
Ignore terms that appear in fewer than this number of documents. For example, 2 keeps only terms that appear in at least 2 documents. |