Word2 Vec¶
Transforms vectors of words into vectors of numeric codes for the purpose of further processing by NLP or machine learning algorithms.
Input¶
It takes in a DataFrame as input and transforms it to another DataFrame
Output¶
A new column containing feature vector is added to the incoming DataFrame
Type¶
ml-transformer
Class¶
fire.nodes.ml.NodeWord2Vec
Fields¶
Name |
Title |
Description |
|---|---|---|
inputCol |
Input Column |
Contains sequences of words |
inputColStringArrCol |
Text Array Column |
The text array column which is produced |
outputCol |
Output Column |
Output column name |
vectorSize |
Vector Size |
Vector Size |
minCount |
Min Count |
Min Count |
numPartitions |
Num Partitions |
Number of partitions for sentences of words |
windowSize |
Window Size |
The window size (context words from [-window, window]) |
maxSentenceLength |
Max Sentence Length |
Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to maxSentenceLength |
Details¶
Word2Vec Node Details¶
The Word2Vec Node is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used for as features for prediction, document similarity calculations, etc.
The word2vec algorithm uses a model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.
Input Parameters¶
OUTPUT STORAGE LEVEL : Keep this as DEFAULT.
INPUT COLUMN : The name of the field in the input schema that contains the features to build the model from.
TEXT ARRAY COLUMN : Set a name for the output transformed array column.
OUTPUT COLUMN : Set a name for the output vector column.
VECTOR SIZE : The dimension of the code that you want to transform from words. Defaults to 3.
MIN COUNT : The minimum number of times a token must appear to be included in the word2vec model’s vocabulary.
NUM PARTITIONS : Number of partitions for sentences of words.
WINDOW SIZE : Hyperparameter used in training process. Defaults to 5.
MAX SENTENCE LENGTH : Sets the maximum length (in words) of each sentence in the input data.