String Indexer

StringIndexer encodes a string column of labels to a column of label indices

Input

It takes in a DataFrame and transforms it to another DataFrame

Output

It adds a new column containing the encoding of the string column of labels to a column of label indices, to the incoming DataFrame.

Type

ml-transformer

Class

fire.nodes.ml.NodeStringIndexer

Fields

Name

Title

Description

handleInvalid

Handle Invalid

Invalid entries to be skipped or thrown error

inputCols

Input Columns

Input columns for encoding

outputCols

Output Columns

Output columns

stringOrderType

String Order Type

Param for how to order labels of string column

Details

String Indexer Node Details

The String Indexer node encodes a string column of labels to a column of label indices. The indices are in [0, numLabels).

By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting STRING ORDER TYPE. Its default value is ‘frequencyDesc’. In case of equal frequency when under frequencyDesc/Asc, the strings are further sorted alphabetically

Input Parameters

  • OUTPUT STORAGE LEVEL : Keep this as DEFAULT.

  • HANDLE INVALID : Specifies how to handle invalid data (unseen or NULL values) in features and label column of string type. Options are ‘skip’ (filter out rows with invalid data), or error (throw an error).

  • VARIABLES : Allows multiple string columns to be selected for conversion.

  • Input Columns : Select the column which needs to be converted.

  • Output Columns : The name of the output converted column.

  • STRING ORDER TYPE : Specifies how to order labels of string column. (default = “frequencyDesc”)

  • Supported options are:

    “frequencyDesc”: descending order by label frequency (most frequent label assigned 0)

    “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0)

    “alphabetDesc”: descending alphabetical order

    “alphabetAsc”: ascending alphabetical order

Examples

String Indexer Node Example

Assume that we have the following DataFrame with columns id and category:

 id | category
----|----------
 0  | a
 1  | b
 2  | c
 3  | a
 4  | a
 5  | c

category is a string column with three labels: “a”, “b”, and “c”. Applying StringIndexer with category as the input column and categoryIndex as the output column, we should get the following:

 id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0
 3  | a        | 0.0
 4  | a        | 0.0
 5  | c        | 1.0

“a” gets index 0 because it is the most frequent, followed by “c” with index 1 and “b” with index 2.