Count Vectorizer¶
Extracts the vocabulary from a given collection of documents and generates a vector of token counts for each document.
Input¶
It takes in a DataFrame as input and transforms it to another DataFrame
Output¶
It adds a new column to the incoming DataFrame containing the vector of token counts in the input column, to generate the output DataFrame
Type¶
ml-transformer
Class¶
fire.nodes.ml.NodeCountVectorizer
Fields¶
Name |
Title |
Description |
|---|---|---|
inputCol |
Input Column |
Input column name |
outputCol |
Output Column |
Output column name |
vocabularySize |
Vocabulary Size |
Max size of the vocabulary. |
minDF |
Min DF |
Specifies the minimum number of different documents a term must appear in to be included in the vocabulary |
maxDF |
Max DF |
Specifies the maximum number of different documents a term could appear in to be included in the vocabulary |
minTF |
Min TF |
Filter to ignore rare words in a document |
binary |
Binary |
Binary toggle to control the output vector values. |
Details¶
Count Vectorizer Node Details¶
Machines cannot understand characters and words. So when dealing with text data we need to represent it in numbers to be understood by the machine. The Count Vectorizer Node is used to convert text to numerical data.
When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.
Input Parameters¶
OUTPUT STORAGE LEVEL : Keep this as DEFAULT.
INPUT COLUMN : Select the token column from the input schema .
OUTPUT COLUMN : The name of the output field that contains the transformed features.
VOCABULARY SIZE : The maximum size of the vocabulary. If this value is smaller than the total number of different terms, the vocabulary will contain the top terms ordered by term frequency across the corpus. Default is 3.
MIN DF : Specifies the minimum nonnegative number of different documents a term must appear in to be included in the vocabulary. Default is 1.
MAX DF : Specifies the maximum number of different documents a term could appear in to be included in the vocabulary. A term that appears more than the threshold will be ignored. Default (2^63) - 1’).
MIN TF : Filter to ignore rare words in a document. For each document, terms with frequency (or count) less than the given threshold are ignored. Default is 1.
BINARY : By setting ‘binary = True’, the node no more takes into consideration the frequency of the term/word. If it occurs it’s set to 1 otherwise 0. By default, binary is set to False.
Examples¶
Count Vectorizer Node Example¶
Assume that we have the following DataFrame with columns id and texts:
id | texts
----|------------------------------
0 | Array("a", "b", "c")
1 | Array("a", "b", "b", "c", "a")
each row in texts is a document of type Array[String]. Invoking the Count Vectorizer node produces the output column vector as shown below:
id | texts | vector
----|---------------------------------|--------------------------
0 | Array("a", "b", "c") | (3,[0,1,2],[1.0,1.0,1.0])
1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
Each vector represents the token counts of the document over the vocabulary.