Regex Tokenizer¶

This node creates a new DataFrame by the process of taking text (such as a sentence) and breaking it into individual terms (usually words) based on regular expression

Type¶

transform

Class¶

fire.nodes.etl.NodeRegexTokenizer

Fields¶

Name	Title	Description
inputCol	Column	input column for tokenizing
outputCol	Tokenized Column	New output column after tokenization
pattern	Pattern	The regex pattern used to match delimiters
gaps	Gaps	Indicates whether the regex splits on gaps

Details¶

Regex Tokenizer Node¶

Overview:¶

The Regex Tokenizer node splits text data into tokens based on a regular expression pattern. This is useful for tasks like text preprocessing, natural language processing, and information extraction.

Input:¶

Column: The column containing the text data to be tokenized.

Tokenized Column: The name of the new column to store the tokenized text.

Pattern: The regular expression pattern to use for tokenization.

Gaps: A flag indicating whether to include gaps (spaces) between tokens.

Output:¶

The node creates a new column containing the tokenized text.

Examples¶

Example:¶

Let’s assume we have a column named text containing the following text:

This is a sample text.

Configure the Node:

Column: text

Tokenized Column: tokens

Pattern: w+ (matches word characters)

Gaps: True

Node Execution:

The node will split the text into tokens based on word boundaries and create a new column tokens containing the following:

This,is,a,sample,text