Tokenizer¶

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

Input¶

It takes in a DataFrame and transforms it to another DataFrame

Output¶

It adds a new column containing the results of tokenization of the input column, to the incoming DataFrame.

Type¶

ml-transformer

Class¶

fire.nodes.ml.NodeTokenizer

Fields¶

Name	Title	Description
inputCol	Input Column	Column containing text (such as sentence)
outputCol	Output Column	Output column name

Details¶

Tokenizer Node Details¶

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens. The Tokenizer node accepts an string input and breaks the string into an array of tokens.

Input Parameters¶

OUTPUT STORAGE LEVEL : Keep this as DEFAULT.
INPUT COLUMN : Select the required string column for whom tokenization has to be done .
OUTPUT COLUMN : The name of the output tokenized column.

Examples¶

Tokenizer Node Example¶

Assume that we have a DataFrame with the column <b>strText<b>:

             strText             |
   ----------------------------- |
    Sparkflows is cool to learn. |


If we set Tokenizer's **INPUT COLUMN** to strText and **OUTPUT COLUMN** to tokens, after transformation we should get the following DataFrame:

                  tokens                       |
---------------------------------------------- |
WrappedArray(sparkflows, is, cool, to, learn.) |