Dedup
===========

This node is used for problems like entity resolution or data matching. Entity resolution or Data matching is the problem of finding and linking different mentions of the same entity in a single data source or across multiple data sources.

Input
--------------
It takes in a DataFrame as input

Output
--------------
Dataframe with confidence score field and other selected scores for entities

Type
--------- 

transform

Class
--------- 

fire.nodes.ml.NodeDedup

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - confidenceScore
        - Confidence Score
        - Confidence Score
      * - lhsCols
        - LHS Columns
        - LHS columns for matching
      * - rhsCols
        - RHS Columns
        - RHS columns for matching
      * - matchingAlgorithms
        - Algorithm to use
        - Algorithm to use for matching
      * - matchingWeights
        - Weights
        - Weights to consider for confidence score caluclation and total wight should be less than 1.
      * - outputCols
        - Output Column
        - Output Column


Details
-------
Levenstein
+++++++++++++++


The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.


How many char you change to make two strings equal.


JaroWinker
+++++++++++++++


Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.


Good for short words, typos and nikename.


Fullmatch
+++++++++++++++


Fullmatch distance for two strings is, how two strings are match exactly. The score is assigned such that 1 is for exact match and 0 is for not match.


Jaccard
+++++++++++++++


The Jaccard similarity measures similarity between finite sample sets, and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B.


Sparkflows provide default 3-gram Jaccard similarity measures.


::

    Longest common subsequences(LCS): LCS distance between strings s1 and s2, computed as |s1| +|s2| - 2 * |LCSfunction(s1, s2)| and distance is normalized between 0 and 1.


LCSfunction returns the length of Longest Common Subsequence (LCS) between strings s1 and s2.


Notional distance
+++++++++++++++


Notional distance between two numbers X and Y, computed as  abs(X - Y) / abs(x) + abs(Y).


Date Difference
+++++++++++++++


Date Difference gives number of days between two dates(yyyy-MM-dd).


Examples
-------
Dedup Node Example
+++++++++++++++


Given the following dataset:


PartID    SupplierID    PartName    PartCategory    PartQualityScore

A1    S1    Bolt    Hardware    8

B2    S2    Nut    Hardware    7

A1    S1    Bolt    Hardware    8

C3    S3    Screw    Hardware    9


If we configure the Dedup node to remove duplicates based on the following columns:


PartID

SupplierID

PartName

PartCategory

PartQualityScore

The resulting dataset would be:


PartID    SupplierID    PartName    PartCategory    PartQualityScore

A1    S1    Bolt    Hardware    8

B2    S2    Nut    Hardware    7

C3    S3    Screw    Hardware    9