Regex Advanced
===========

Advanced regex operations for text processing - similar to Alteryx Regex Tool with auto-detection of capturing groups

Input
--------------
It accepts a DataFrame as input from the previous Node

Output
--------------
Returns a DataFrame with extracted patterns, marked matches, replaced text, or tokenized data based on the selected regex mode

Type
--------- 

transform

Class
--------- 

fire.nodes.etl.NodeRegexAdvanced

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - general
        - General
        - 
      * - inputCol
        - Input Column
        - Column to apply regex operations on
      * - regexPattern
        - Regular Expression Pattern
        - Enter the regex pattern
      * - regexMode
        - Regex Mode
        - Select the regex operation mode
      * - caseSensitive
        - Case Sensitive
        - Enable case-sensitive pattern matching
      * - errorHandling
        - Error Handling
        - How to handle errors: FAIL (stop execution), SKIP (remove row), IGNORE
      * - replacementText
        - Replacement Text
        - Text to replace matched patterns with (REPLACE mode only)
      * - tokenSplit
        - Split to Columns
        - Enter Number of columns to split
      * - inputMatchCol
        - Match Column Name
        - Enter Column Name for Match Status
      * - newColName
        - Target Column
        - Enter New Column Name
      * - regularexpression
        - Expression
        - Regular Expression on how to get the data which has to be placed under this column
      * - schema
        - InferSchema
        - 
      * - outputColNames
        - Column Names of the Table
        - Output Columns Names of the Table
      * - outputColTypes
        - Column Types of the Table
        - Output Column Types of the Table
      * - outputColFormats
        - Column Formats
        - Output Column Formats


Details
-------
Regex Advanced Node
+++++++++++++++


Overview:
+++++++++++++++


The Regex Advanced node provides powerful text processing capabilities using regular expressions, similar to the Alteryx Regex Tool. It allows parsing, tokenizing, matching, and replacing text in a DataFrame column. Users can also auto-detect capturing groups for parsing operations and control case-sensitivity and error handling.


Input:
+++++++++++++++


* Input Column: The column from the input DataFrame on which regex operations will be applied.

* Regex Pattern: The regular expression pattern to extract, match, replace, or tokenize text.

* Regex Mode: Select the operation mode:

* **PARSE** – Extract data into new columns based on capturing groups.
* **TOKENIZE_COL** – Split text into multiple columns.
* **TOKENIZE_ROW** – Split text into multiple rows.
* **REPLACE** – Replace matched patterns with specified text.
* **MATCH** – Create a column marking whether the pattern matches.

* Case Sensitivity: Specify whether pattern matching should be case-sensitive.

* Error Handling: Choose how errors should be handled – FAIL, SKIP, or IGNORE.


Output:
+++++++++++++++


Returns a transformed DataFrame with new columns or updated values based on the selected regex mode. Output may include:


* Parsed columns from capturing groups (PARSE mode).
* Tokenized columns or rows (TOKENIZE_COL/TOKENIZE_ROW).
* Replaced text in the input column (REPLACE mode).
* Match status column indicating success/failure (MATCH mode).


Advanced Options:
+++++++++++++++


* Replacement Text: Text to replace matches (REPLACE mode).
* Split to Columns: Number of columns to split text into (TOKENIZE_COL mode).
* Match Column Name: Name of the column storing match status (MATCH mode).
* Target Column Names & Expressions (Parse tab): Map capturing groups to new column names with corresponding regex expressions.
* Infer Schema (Schema tab): Define output column names, types, and formats.


Examples
-------
Regex Advanced Node Examples
+++++++++++++++


Example 1 – Parse Mode
+++++++++++++++


* *Input DataFrame:**

::

    | id | info                |
    | -- | ------------------- |
    | 1  | Name: John Age: 25  |
    | 2  | Name: Alice Age: 30 |
    | 3  | Name: Bob Age: 22   |


* *Node Configuration:**

* Input Column: info
* Regex Pattern: `Name:\s*(\w+)\s+Age:\s*(\d+)`
* Regex Mode: PARSE
* Target Column Names: ["name", "age"]
* Expressions: ["(\w+)", "(\d+)"]

* *Output DataFrame:**

::

    | id | info                | name  | age |
    | -- | ------------------- | ----- | --- |
    | 1  | Name: John Age: 25  | John  | 25  |
    | 2  | Name: Alice Age: 30 | Alice | 30  |
    | 3  | Name: Bob Age: 22   | Bob   | 22  |


Example 2 – Replace Mode
+++++++++++++++


* *Input DataFrame:**

::

    | id | email                                                 |
    | -- | ----------------------------------------------------- |
    | 1  | [john.doe@gmail.com](mailto:john.doe@gmail.com)       |
    | 2  | [alice.smith@yahoo.com](mailto:alice.smith@yahoo.com) |


* *Node Configuration:**

* Input Column: email
* Regex Pattern: `@.*`
* Regex Mode: REPLACE
* Replacement Text: `@example.com`

* *Output DataFrame:**

::

    | id | email                                                     |
    | -- | --------------------------------------------------------- |
    | 1  | [john.doe@example.com](mailto:john.doe@example.com)       |
    | 2  | [alice.smith@example.com](mailto:alice.smith@example.com) |


Example 3 – Match Mode
+++++++++++++++


* *Input DataFrame:**

::

    | id | code  |
    | -- | ----- |
    | 1  | AB123 |
    | 2  | XY789 |
    | 3  | 1234  |


* *Node Configuration:**

* Input Column: code
* Regex Pattern: `^[A-Z]{2}\d{3}$`
* Regex Mode: MATCH
* Match Column Name: is_valid

* *Output DataFrame:**

::

    | id | code  | is_valid |
    | -- | ----- | -------- |
    | 1  | AB123 | true     |
    | 2  | XY789 | true     |
    | 3  | 1234  | false    |