Dedup

This node is used for problems like entity resolution or data matching. Entity resolution or Data matching is the problem of finding and linking different mentions of the same entity in a single data source or across multiple data sources.

Input

It takes in a DataFrame as input

Output

Dataframe with confidence score field and other selected scores for entities

Type

transform

Class

fire.nodes.ml.NodeDedup

Fields

Name

Title

Description

confidenceScore

Confidence Score

Confidence Score

lhsCols

LHS Columns

LHS columns for matching

rhsCols

RHS Columns

RHS columns for matching

matchingAlgorithms

Algorithm to use

Algorithm to use for matching

matchingWeights

Weights

Weights to consider for confidence score caluclation and total wight should be less than 1.

outputCols

Output Column

Output Column

Details

Levenstein

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

How many char you change to make two strings equal.

JaroWinker

Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

Good for short words, typos and nikename.

Fullmatch

Fullmatch distance for two strings is, how two strings are match exactly. The score is assigned such that 1 is for exact match and 0 is for not match.

Jaccard

The Jaccard similarity measures similarity between finite sample sets, and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B.

Sparkflows provide default 3-gram Jaccard similarity measures.

Longest common subsequences(LCS): LCS distance between strings s1 and s2, computed as |s1| +|s2| - 2 * |LCSfunction(s1, s2)| and distance is normalized between 0 and 1.

LCSfunction returns the length of Longest Common Subsequence (LCS) between strings s1 and s2.

Notional distance

Notional distance between two numbers X and Y, computed as abs(X - Y) / abs(x) + abs(Y).

Date Difference

Date Difference gives number of days between two dates(yyyy-MM-dd).

Examples

Dedup Node Example

Given the following dataset:

PartID SupplierID PartName PartCategory PartQualityScore

A1 S1 Bolt Hardware 8

B2 S2 Nut Hardware 7

A1 S1 Bolt Hardware 8

C3 S3 Screw Hardware 9

If we configure the Dedup node to remove duplicates based on the following columns:

PartID

SupplierID

PartName

PartCategory

PartQualityScore

The resulting dataset would be:

PartID SupplierID PartName PartCategory PartQualityScore

A1 S1 Bolt Hardware 8

B2 S2 Nut Hardware 7

C3 S3 Screw Hardware 9