SMOTE¶
Implementation of SMOTE - Synthetic Minority Over-sampling Technique.
Input¶
This type of node takes in a DataFrame and transforms it to another DataFrame
Output¶
Output DataFrame
Type¶
ml-transformer
Class¶
fire.nodes.ml.NodeSMOTE
Fields¶
Name |
Title |
Description |
|---|---|---|
labelCol |
Label Column |
Label Column. |
featureCol |
Feature Column |
Feature Column with vector type. |
bucketLength |
BucketLength |
Bucket Length for LSH - The length of each hash bucket, a larger bucket lowers the false negative rate. |
Details¶
SMOTE Node Details¶
Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way.
The component works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases.
Synthetic examples are generated in the following way:
Take the difference between the feature vector (sample) under consideration and its nearest neighbour (Using Locality Sensitive Hashing (LSH) model)
Multiply this difference by a random number between 0 and 1, and add it to the sample
For discrete attributes, the synthetic example randomly picks either the sample or the neighbour, and copies that value.
By forcing the decision region of the minority class to become more general, SMOTE reduces overfitting.
Input Parameters¶
OUTPUT STORAGE LEVEL : Keep this as DEFAULT.
LABEL COLUMN : Select the field in the input schema that contains the label.
FEATURE COLUMN : Select the field in the input schema that contains the feature vector.
BUCKETLENGTH : Set the length of each hash bucket, a larger bucket lowers the false negative rate. Defaults to 100.
Examples¶
SMOTE Node Example¶
Assume a dataset that consists of transactions made by credit cards. This dataset has 492 fraud transactions out of 284,807 transactions.
That makes it highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
Before OverSampling, counts of label ‘1’: [492]
Before OverSampling, counts of label ‘0’: [284315]
After running the dataset through the SMOTE node we can expect the below results:
After SMOTE OverSampling, counts of label ‘1’: 284315
After SMOTE OverSampling, counts of label ‘0’: 284315
The SMOTE node has oversampled the minority instances and made it equal to majority class. Both categories have equal amount of records. More specifically, the minority class has been increased to the total number of majority class.