Regex Tokenizer

This node creates a new DataFrame by the process of taking text (such as a sentence) and breaking it into individual terms (usually words) based on regular expression

Type

transform

Class

fire.nodes.etl.NodeRegexTokenizer

Fields

Name

Title

Description

inputCol

Column

input column for tokenizing

outputCol

Tokenized Column

New output column after tokenization

pattern

Pattern

The regex pattern used to match delimiters

gaps

Gaps

Indicates whether the regex splits on gaps

Details

Regex Tokenizer Node

Overview:

The Regex Tokenizer node splits text data into tokens based on a regular expression pattern. This is useful for tasks like text preprocessing, natural language processing, and information extraction.

Input:

Column: The column containing the text data to be tokenized.

Tokenized Column: The name of the new column to store the tokenized text.

Pattern: The regular expression pattern to use for tokenization.

Gaps: A flag indicating whether to include gaps (spaces) between tokens.

Output:

The node creates a new column containing the tokenized text.

Examples

Example:

Let’s assume we have a column named text containing the following text:

This is a sample text.

Configure the Node:

Column: text

Tokenized Column: tokens

Pattern: w+ (matches word characters)

Gaps: True

Node Execution:

The node will split the text into tokens based on word boundaries and create a new column tokens containing the following:

This,is,a,sample,text