Regex Tokenizer
===========

This node creates a new DataFrame by the process of taking text (such as a sentence) and breaking it into individual terms (usually words) based on regular expression

Type
--------- 

transform

Class
--------- 

fire.nodes.etl.NodeRegexTokenizer

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - inputCol
        - Column
        - input column for tokenizing
      * - outputCol
        - Tokenized Column
        - New output column after tokenization
      * - pattern
        - Pattern
        - The regex pattern used to match delimiters
      * - gaps
        - Gaps
        - Indicates whether the regex splits on gaps


Details
-------
Regex Tokenizer Node
+++++++++++++++


Overview:
+++++++++++++++


The Regex Tokenizer node splits text data into tokens based on a regular expression pattern. This is useful for tasks like text preprocessing, natural language processing, and information extraction.


Input:
+++++++++++++++


Column: The column containing the text data to be tokenized.

Tokenized Column: The name of the new column to store the tokenized text.

Pattern: The regular expression pattern to use for tokenization.

Gaps: A flag indicating whether to include gaps (spaces) between tokens.


Output:
+++++++++++++++


The node creates a new column containing the tokenized text.


Examples
-------
Example:
+++++++++++++++


Let's assume we have a column named text containing the following text:


This is a sample text.

Configure the Node:


Column: text

Tokenized Column: tokens

Pattern: \w+ (matches word characters)

Gaps: True

Node Execution:


The node will split the text into tokens based on word boundaries and create a new column tokens containing the following:


This,is,a,sample,text