Word2 Vec

Transforms vectors of words into vectors of numeric codes for the purpose of further processing by NLP or machine learning algorithms.

Input

It takes in a DataFrame as input and transforms it to another DataFrame

Output

A new column containing feature vector is added to the incoming DataFrame

Type

ml-transformer

Class

fire.nodes.ml.NodeWord2Vec

Fields

Name

Title

Description

inputCol

Input Column

Contains sequences of words

inputColStringArrCol

Text Array Column

The text array column which is produced

outputCol

Output Column

Output column name

vectorSize

Vector Size

Vector Size

minCount

Min Count

Min Count

numPartitions

Num Partitions

Number of partitions for sentences of words

windowSize

Window Size

The window size (context words from [-window, window])

maxSentenceLength

Max Sentence Length

Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to maxSentenceLength

Details

Word2Vec Node Details

The Word2Vec Node is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used for as features for prediction, document similarity calculations, etc.

The word2vec algorithm uses a model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.

Input Parameters

  • OUTPUT STORAGE LEVEL : Keep this as DEFAULT.

  • INPUT COLUMN : The name of the field in the input schema that contains the features to build the model from.

  • TEXT ARRAY COLUMN : Set a name for the output transformed array column.

  • OUTPUT COLUMN : Set a name for the output vector column.

  • VECTOR SIZE : The dimension of the code that you want to transform from words. Defaults to 3.

  • MIN COUNT : The minimum number of times a token must appear to be included in the word2vec model’s vocabulary.

  • NUM PARTITIONS : Number of partitions for sentences of words.

  • WINDOW SIZE : Hyperparameter used in training process. Defaults to 5.

  • MAX SENTENCE LENGTH : Sets the maximum length (in words) of each sentence in the input data.

Examples

Word2Vec Node Example

Consider the below Word2Vec output’s for the input string Title column.

  • Title : ‘Learning Sparkflows’

  • textarr : (Learning, Sparkflows)

  • Word2Vec : [-0.12955643981695175,0.0201990008354187,0.13060205057263374]