Hashing TF¶

Maps a sequence of terms to term frequencies using the hashing trick.

Input¶

It takes in a DataFrame as input and transforms it to another DataFrame

Output¶

A new column is added to the input DataFrame containing hashing of the bag of words into a feature vector

Type¶

ml-transformer

Class¶

fire.nodes.ml.NodeHashingTF

Fields¶

Name	Title	Description
inputCol	Input Column	Contains sets of terms. In text processing, a ‘set of terms’ might be a bag of words
outputCol	Output Column	Output column name

Details¶

Hashing TF Node Details¶

The Hashing TF Node maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby’s MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object.

Hashing TF converts documents to vectors of fixed size. The default feature dimension is 262,144. The terms are mapped to indices using a Hash Function. The term frequencies are computed with respect to the mapped indices.

Input Parameters¶

OUTPUT STORAGE LEVEL : Keep this as DEFAULT.
INPUT COLUMN : Select the input token column from the incoming schema.
OUTPUT COLUMN : Set a name of the output transformed column.

Examples¶

Hashing TF Node Example¶

Consider the below Hashing TF vector output for the tokenizer column.

|-----------------------------------------------------------------------------|
|       Title         |       tokenizer         |           hashTF            |
|-----------------------------------------------------------------------------|
| Learning Sparkflows | (learning, sparkflows)  |   (1000,[72,990],[1.0,1.0]) |