Remove Duplicate Rows¶

This node take an array of fields and compare the rows on those fields. From the matches it would randomly take one row and drop the rest.

Input¶

It accepts a DataFrame as input from the previous Node

Output¶

The output Dataframe is the same as the input Dataframe with the duplicate rows removed

Type¶

transform

Class¶

fire.nodes.etl.NodeRemoveDuplicateRows

Fields¶

Name	Title	Description
order	Order	Whether to take the first or last matching record when removing duplicates
inputCols	Columns	The columns to be selected for match

Details¶

This node drops duplicate rows from the incoming DataFrame.

Specific columns can be selected to be used when comparing two rows.

One of the matching rows either first or last one is included in the outgoing Dataframe based on selection .

Examples¶

Incoming Dataframe has following rows:

EMP_CD    |    EMP_NAME    |    DEPT       |    AGE
-------------------------------------------------------
E01       |    DAVID       |    HR         |    25
E05       |    DAVID       |    HR         |    25
E02       |    JOHN        |    SALES      |    35
E03       |    JOHN        |    MARKETING  |    40
E04       |    JOHN        |    MARKETING  |    45

If RemoveDuplicateRows node is configured to retain first matching row and drop other duplicate rows having duplicate values in [EMP_NAME] and [DEPT]¶

then outgoing dataframe would be created as below:

EMP_CD    |    EMP_NAME    |    DEPT       |    AGE
-------------------------------------------------------
E01       |    DAVID       |    HR         |    25
E02       |    JOHN        |    SALES      |    35
E03       |    JOHN        |    MARKETING  |    40

If RemoveDuplicateRows node is configured to retain last matching row and drop other duplicate rows having duplicate values in [EMP_NAME] and [DEPT]¶

then outgoing dataframe would be created as below:

EMP_CD    |    EMP_NAME    |    DEPT       |    AGE
-------------------------------------------------------
E05       |    DAVID       |    HR         |    25
E02       |    JOHN        |    SALES      |    35
E04       |    JOHN        |    MARKETING  |    45

If RemoveDuplicateRows node is configured to retain first matching row and drop other duplicate rows having duplicate values in [EMP_NAME], [DEPT] and [AGE]¶

then outgoing dataframe would be created as below:

EMP_CD    |    EMP_NAME    |    DEPT       |    AGE
-------------------------------------------------------
E01       |    DAVID       |    HR         |    25
E02       |    JOHN        |    SALES      |    35
E03       |    JOHN        |    MARKETING  |    40
E04       |    JOHN        |    MARKETING  |    45