Aggregate¶
The most powerful and flexible aggregation node – combines Group By, multiple aggregations, Pivot, conditional expressions, and column propagation in one easy-to-use interface. Perfect for business reporting, dashboards, KPIs, and analytics summaries.
Input¶
Accepts one or more DataFrames as input.
Output¶
Returns an aggregated DataFrame.
Type¶
transform
Class¶
fire.nodes.etl.NodeAggregateColumns
Fields¶
Name |
Title |
Description |
|---|---|---|
Aggregate |
Aggregate Columns |
|
aggregatetargetColumns |
Target Column |
Name of the new column that will hold the aggregated result (e.g., total_sales, order_count, avg_price). This is the column name users will see in reports. |
aggregateexpressions |
Expression |
Spark SQL aggregation expression. Examples: • sum(amount) • count(*) • avg(price) • max(date) • min(temperature) • count(distinct customer_id) • sum(amount * quantity) • stddev(pop) • corr(price, volume) |
propagateAllInputColumns |
Propagate All Input Columns |
When true → all non-grouped columns are automatically kept in the output using first() (or last() if you change it later). Great for keeping customer name, latest order date, etc., alongside totals. |
GroupBy |
Group By Columns |
|
groupByTargetColumns |
Target Column |
Output column name for the grouping key (e.g., region, department, year_month). Use this to rename or transform grouping columns (e.g., year(order_date) → order_year). |
groupByExpressions |
Expression |
How to group the data. Usually just the column name, but can be complex: • region • year(order_date) • concat(year, ‘-‘, month) • substr(postal_code, 1, 3) • date_trunc(‘month’, order_date) |
Pivot |
Pivot |
|
pivotCol |
Pivot Column |
Column whose distinct values become new columns (e.g., product_category, region, status). Turns long data into wide crosstab format – ideal for Excel reports and dashboards. |
uniqueValues |
Unique Values |
Comma-separated list of expected values (e.g., Electronics,Clothing,Food or Jan,Feb,Mar,Apr,…,Dec). Dramatically improves performance and guarantees column order. Leave empty to let Spark discover values (slower on big data). |
schema |
InferSchema |
|
outputColNames |
Column Names of the Table |
Force exact output column names. Useful when downstream systems expect fixed headers. |
outputColTypes |
Column Types of the Table |
Force column data types (Integer, Double, String, Date, etc.). Critical for correct joins and visualizations. |
outputColFormats |
Column Formats |
Date/number format strings (e.g., yyyy-MM-dd, #,##0.00). Used for display in reports. |
Details¶
Aggregate Node – The Ultimate Business Aggregation Tool¶
The Aggregate node is the Swiss Army knife of summarization. It replaces multiple simple GroupBy + Pivot + Select nodes with one intuitive interface that business analysts and data engineers love.
When to Use It¶
Monthly sales by region/product (pivot + sum)
Customer lifetime value with latest name & join date (group by + propagate)
Dashboard KPIs (global aggregates)
Year-over-year comparisons (group by year/month)
Top N per category (with window functions in expressions)
Any Excel-style crosstab report
Key Advantages¶
Full Spark SQL expression support in Group By and Aggregate
Optional Pivot with known values → blazing fast on huge datasets
Automatic propagation of descriptive columns (customer name, category, etc.)
One node does what used to take 3–5 nodes
Explicit schema control for downstream compatibility
Pro Tips¶
Always provide Unique Values in Pivot when you know them → 10x–100x faster
Use Propagate All Input Columns = true to keep metadata without writing first() manually
Combine with Filter node before for conditional aggregation
Use date_trunc(‘month’, date) or year(date) for clean time grouping
Examples¶
Aggregate Node – Real-World Business Examples¶
Example 1 – Monthly Sales Dashboard (Classic Pivot)¶
Goal: Sales by Product Category and Month (wide format for Excel)¶
Configuration¶
Group By → Target: month, Expression: date_trunc(‘month’, order_date)
Pivot Column: category
Unique Values: Electronics,Clothing,Food,Books
Aggregate → Target: total_sales, Expression: sum(amount)
Propagate All Input Columns: false
Result¶
| month | Electronics | Clothing | Food | Books |
|------------|-------------|----------|------|-------|
| 2025-01-01 | 45000 | 28000 | 12000| 8000 |
| 2025-02-01 | 52000 | 31000 | 13500| 9200 |
Example 2 – Customer Summary with Latest Info¶
Goal: Total spend per customer + latest order date & name¶
Configuration¶
Group By → Target: customer_id, Expression: customer_id
Aggregate → Target: total_spent, Expression: sum(amount)
Aggregate → Target: order_count, Expression: count(*)
Propagate All Input Columns: true
Result¶
| customer_id | total_spent | order_count | customer_name | latest_order_date |
|-------------|-------------|-------------|---------------|-------------------|
| 101 | 2500 | 12 | John Doe | 2025-03-15 |
| 102 | 1800 | 8 | Jane Smith | 2025-03-10 |
Example 3 – Regional KPIs (No Group By = Global)¶
Configuration¶
Aggregate → Target: total_revenue, Expression: sum(amount)
Aggregate → Target: unique_customers, Expression: count(distinct customer_id)
Aggregate → Target: avg_order_value, Expression: avg(amount)
Propagate All Input Columns: false
Result¶
| total_revenue | unique_customers | avg_order_value |
|---------------|------------------|-----------------|
| 1,250,000 | 8,421 | 148.32 |
Example 4 – Year-over-Year Growth by Category¶
Configuration¶
Group By → Target: year, Expression: year(order_date)
Group By → Target: category, Expression: category
Aggregate → Target: sales, Expression: sum(amount)
Propagate All Input Columns: false
Result (ready for % change calculation downstream)¶
| year | category | sales |
|------|--------------|-----------|
| 2024 | Electronics | 420000 |
| 2025 | Electronics | 580000 |
| 2024 | Clothing | 310000 |
| 2025 | Clothing | 410000 |
Example 5 – Top 3 Products per Region (using window functions)¶
Configuration¶
Group By → Target: region, Expression: region
Aggregate → Target: product_sales, Expression: sum(amount)
Aggregate → Target: rank_in_region, Expression: rank() over (partition by region order by sum(amount) desc)
Then add a Filter node after: rank_in_region <= 3
Result¶
Only the top 3 products per region appear.
Example 6 – Distinct Combinations Only (no aggregation)¶
Goal: List of unique Country + Channel combinations¶
Configuration¶
Group By → Target: country, Expression: country
Group By → Target: channel, Expression: channel
No Aggregate columns
Propagate All Input Columns: false
Result¶
| country | channel |
|-------------|-----------|
| USA | Online |
| USA | Retail |
| Germany | Online |
| France | Retail |