Sklearn TF-IDF Vectorizer

Applies scikit-learn’s TfidfVectorizer to a text column. Converts text documents into TF-IDF feature vectors and stores them as an ARRAY column in the Spark DataFrame.

Input

Takes a DataFrame with at least one text column.

Output

Adds a new column containing TF-IDF vectors as ARRAY and passes the DataFrame to downstream nodes.

Type

transform

Class

fire.nodes.sklearn.preprocessing.NodeTFIDFVectorizerFitTransform

Fields

Name

Title

Description

columnToVectorize

Text Column to Vectorize

Name of the text column on which TF-IDF should be computed.

outputCol

Output Column Name

Name of the output column that will store TF-IDF vectors as ARRAY. If left empty, defaults to ‘tfidf_<columnToVectorize>’.

max_df

Max Document Frequency (max_df)

Ignore terms that appear in more than this proportion of documents. For example, 0.9 drops terms appearing in more than 90% of documents.

min_df

Min Document Frequency (min_df)

Ignore terms that appear in fewer than this number of documents. For example, 2 keeps only terms that appear in at least 2 documents.