Dedup¶

This node is used for problems like entity resolution or data matching. Entity resolution or Data matching is the problem of finding and linking different mentions of the same entity in a single data source or across multiple data sources.

Input¶

It takes in a DataFrame as input

Output¶

Dataframe with confidence score field and other selected scores for entities

Type¶

transform

Class¶

fire.nodes.ml.NodeDedup

Fields¶

Name	Title	Description
confidenceScore	Confidence Score	Confidence Score
lhsCols	LHS Columns	LHS columns for matching
rhsCols	RHS Columns	RHS columns for matching
matchingAlgorithms	Algorithm to use	Algorithm to use for matching
matchingWeights	Weights	Weights to consider for confidence score caluclation and total wight should be less than 1.
outputCols	Output Column	Output Column

Details¶

Levenstein¶

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

How many char you change to make two strings equal.

JaroWinker¶

Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

Good for short words, typos and nikename.

Fullmatch¶

Fullmatch distance for two strings is, how two strings are match exactly. The score is assigned such that 1 is for exact match and 0 is for not match.

Jaccard¶

The Jaccard similarity measures similarity between finite sample sets, and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B.

Sparkflows provide default 3-gram Jaccard similarity measures.

Longest common subsequences(LCS): LCS distance between strings s1 and s2, computed as |s1| +|s2| - 2 * |LCSfunction(s1, s2)| and distance is normalized between 0 and 1.

LCSfunction returns the length of Longest Common Subsequence (LCS) between strings s1 and s2.

Notional distance¶

Notional distance between two numbers X and Y, computed as abs(X - Y) / abs(x) + abs(Y).

Date Difference¶

Date Difference gives number of days between two dates(yyyy-MM-dd).

Examples¶

Dedup Node Example¶

Given the following dataset:

PartID SupplierID PartName PartCategory PartQualityScore

A1 S1 Bolt Hardware 8

B2 S2 Nut Hardware 7

A1 S1 Bolt Hardware 8

C3 S3 Screw Hardware 9

If we configure the Dedup node to remove duplicates based on the following columns:

PartID

SupplierID

PartName

PartCategory

PartQualityScore

The resulting dataset would be:

PartID SupplierID PartName PartCategory PartQualityScore

A1 S1 Bolt Hardware 8

B2 S2 Nut Hardware 7

C3 S3 Screw Hardware 9