Sklearn TF-IDF Vectorizer¶

Applies scikit-learn’s TfidfVectorizer to a text column. Converts text documents into TF-IDF feature vectors and stores them as an ARRAY column in the Spark DataFrame.

Input¶

Takes a DataFrame with at least one text column.

Output¶

Adds a new column containing TF-IDF vectors as ARRAY and passes the DataFrame to downstream nodes.

Type¶

transform

Class¶

fire.nodes.sklearn.preprocessing.NodeTFIDFVectorizerFitTransform

Fields¶

Name	Title	Description
columnToVectorize	Text Column to Vectorize	Name of the text column on which TF-IDF should be computed.
outputCol	Output Column Name	Name of the output column that will store TF-IDF vectors as ARRAY. If left empty, defaults to ‘tfidf_<columnToVectorize>’.
max_df	Max Document Frequency (max_df)	Ignore terms that appear in more than this proportion of documents. For example, 0.9 drops terms appearing in more than 90% of documents.
min_df	Min Document Frequency (min_df)	Ignore terms that appear in fewer than this number of documents. For example, 2 keeps only terms that appear in at least 2 documents.