Regex Tokenizer¶
This node creates a new DataFrame by the process of taking text (such as a sentence) and breaking it into individual terms (usually words) based on regular expression
Type¶
transform
Class¶
fire.nodes.etl.NodeRegexTokenizer
Fields¶
Name |
Title |
Description |
|---|---|---|
inputCol |
Column |
input column for tokenizing |
outputCol |
Tokenized Column |
New output column after tokenization |
pattern |
Pattern |
The regex pattern used to match delimiters |
gaps |
Gaps |
Indicates whether the regex splits on gaps |
Details¶
Regex Tokenizer Node¶
Overview:¶
The Regex Tokenizer node splits text data into tokens based on a regular expression pattern. This is useful for tasks like text preprocessing, natural language processing, and information extraction.
Input:¶
Column: The column containing the text data to be tokenized.
Tokenized Column: The name of the new column to store the tokenized text.
Pattern: The regular expression pattern to use for tokenization.
Gaps: A flag indicating whether to include gaps (spaces) between tokens.
Output:¶
The node creates a new column containing the tokenized text.
Examples¶
Example:¶
Let’s assume we have a column named text containing the following text:
This is a sample text.
Configure the Node:
Column: text
Tokenized Column: tokens
Pattern: w+ (matches word characters)
Gaps: True
Node Execution:
The node will split the text into tokens based on word boundaries and create a new column tokens containing the following:
This,is,a,sample,text