Group By RFM Features¶
This node computes feature engineering tasks such as group by, frequency, recency, average days between purchases, total value of purchases, and customer age. These are computed per user using the selected operations.
Input¶
It takes DataFrame(s) as input for processing.
Output¶
Returns engineered features as a DataFrame.
Type¶
pyspark
Class¶
fire.nodes.fe.NodeRFMFeatures
Fields¶
Name |
Title |
Description |
|---|---|---|
groupByCols |
Group By Columns |
Columns to group by |
frequency |
Enable Frequency |
Enable frequency count per user ID |
recency |
Enable Recency |
Enable recency computation (last purchase date) per user ID |
recencyDateCol |
Recency Date Column |
Date column to compute recency from |
avgDaysBetween |
Enable Avg Days Between |
Enable computation of average days between purchases |
avgDaysDateCol |
Avg Days Date Column |
Date column to compute average days between |
valueOfPurchase |
Enable Value of Purchase |
Enable total value of purchase per user |
sumCols |
Columns to Sum |
Numeric columns to sum per user |
customerAge |
Enable Customer Age |
Compute age of customer from DOB |
dobCol |
Date of Birth Column |
Column containing date of birth |
Details¶
Feature Engineering Node Details¶
The Feature Engineering node is designed to compute user-level features by applying various analytical operations such as frequency, recency, average days between events, total value of purchases, and customer age. These features are generated per group (e.g., per user ID) using the selected input columns and operations.
General:¶
Group By Columns:¶
Specifies the column(s) to group the input data by. Typically, this would be a user or customer identifier (e.g., user_id, customer_id). All other feature engineering computations are performed within each group.
Enable Frequency:¶
When enabled, computes the frequency (i.e., count) of rows per group. Useful for understanding user activity levels.
Enable Recency:¶
When enabled, calculates the number of days since the last recorded activity or event per group. Requires a date column to compute this from.
Recency Date Column:¶
Specifies the column containing the date values for recency computation. Recency is calculated as the number of days between the maximum date per group and the current date.
Enable Avg Days Between:¶
When enabled, computes the average number of days between consecutive activities or transactions per group.
Avg Days Date Column:¶
Specifies the date column to use for computing the average time gap between consecutive records within each group.
Enable Value of Purchase:¶
When enabled, computes the total value of purchases per group by summing selected numeric columns.
Columns to Sum:¶
Specifies one or more numeric columns whose values are summed per group to calculate total value of purchases.
Enable Customer Age:¶
When enabled, calculates the current age of the customer based on their date of birth.
Date of Birth Column:¶
Specifies the column containing the customer’s date of birth, which is used to compute their current age in years.
Output:¶
The node outputs a DataFrame with the group-by columns and one or more additional columns, depending on the selected features:
frequency: count
recency: recency_days
avgDaysBetween: avg_days_between
valueOfPurchase: sum_<col_name> for each column selected in Columns to Sum
customerAge: customer_age
Examples¶
Feature Engineering Node Examples¶
Input:¶
A DataFrame contains the following data:
userId: [“U1”, “U1”, “U2”, “U2”, “U2”]
eventDate: [“2023-01-01”, “2023-01-10”, “2023-01-05”, “2023-01-15”, “2023-01-25”]
amount: [100, 150, 200, 300, 500]
dob: [“1990-04-01”, “1990-04-01”, “1985-08-20”, “1985-08-20”, “1985-08-20”]
The Feature Engineering node is configured as follows:
Group By Columns: userId
Enable Frequency: true
Enable Recency: true
Recency Date Column: eventDate
Enable Avg Days Between: true
Avg Days Date Column: eventDate
Enable Value of Purchase: true
Columns to Sum: amount
Enable Customer Age: true
Date of Birth Column: dob
Output:¶
The node processes the DataFrame and produces the following result:
userId: “U1”
frequency: 2
recency_days: (based on today - “2023-01-10”)
avg_days_between: 9.0
sum_amount: 250
customer_age: 35
userId: “U2”
frequency: 3
recency_days: (based on today - “2023-01-25”)
avg_days_between: 10.0
sum_amount: 1000
customer_age: 39