String Indexer¶

StringIndexer encodes a string column of labels to a column of label indices

Input¶

It takes in a DataFrame and transforms it to another DataFrame

Output¶

It adds a new column containing the encoding of the string column of labels to a column of label indices, to the incoming DataFrame.

Type¶

ml-transformer

Class¶

fire.nodes.ml.NodeStringIndexer

Fields¶

Name	Title	Description
handleInvalid	Handle Invalid	Invalid entries to be skipped or thrown error
inputCols	Input Columns	Input columns for encoding
outputCols	Output Columns	Output columns
stringOrderType	String Order Type	Param for how to order labels of string column

Details¶

String Indexer Node Details¶

The String Indexer node encodes a string column of labels to a column of label indices. The indices are in [0, numLabels).

By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting STRING ORDER TYPE. Its default value is ‘frequencyDesc’. In case of equal frequency when under frequencyDesc/Asc, the strings are further sorted alphabetically

Input Parameters¶

OUTPUT STORAGE LEVEL : Keep this as DEFAULT.
HANDLE INVALID : Specifies how to handle invalid data (unseen or NULL values) in features and label column of string type. Options are ‘skip’ (filter out rows with invalid data), or error (throw an error).
VARIABLES : Allows multiple string columns to be selected for conversion.
Input Columns : Select the column which needs to be converted.
Output Columns : The name of the output converted column.
STRING ORDER TYPE : Specifies how to order labels of string column. (default = “frequencyDesc”)
Supported options are:

“frequencyDesc”: descending order by label frequency (most frequent label assigned 0)

“frequencyAsc”: ascending order by label frequency (least frequent label assigned 0)

“alphabetDesc”: descending alphabetical order

“alphabetAsc”: ascending alphabetical order

Examples¶

String Indexer Node Example¶

Assume that we have the following DataFrame with columns id and category:

 id | category
----|----------
| a
| b
| c
| a
| a
| c

category is a string column with three labels: “a”, “b”, and “c”. Applying StringIndexer with category as the input column and categoryIndex as the output column, we should get the following:

 id | category | categoryIndex
----|----------|---------------
| a        | 0.0
| b        | 2.0
| c        | 1.0
| a        | 0.0
| a        | 0.0
| c        | 1.0

“a” gets index 0 because it is the most frequent, followed by “c” with index 1 and “b” with index 2.