Aggregate¶

The most powerful and flexible aggregation node – combines Group By, multiple aggregations, Pivot, conditional expressions, and column propagation in one easy-to-use interface. Perfect for business reporting, dashboards, KPIs, and analytics summaries.

Input¶

Accepts one or more DataFrames as input.

Output¶

Returns an aggregated DataFrame.

Type¶

transform

Class¶

fire.nodes.etl.NodeAggregateColumns

Fields¶

Name	Title	Description
Aggregate	Aggregate Columns
aggregatetargetColumns	Target Column	Name of the new column that will hold the aggregated result (e.g., total_sales, order_count, avg_price). This is the column name users will see in reports.
aggregateexpressions	Expression	Spark SQL aggregation expression. Examples: • sum(amount) • count() • avg(price) • max(date) • min(temperature) • count(distinct customer_id) • sum(amount quantity) • stddev(pop) • corr(price, volume)
propagateAllInputColumns	Propagate All Input Columns	When true → all non-grouped columns are automatically kept in the output using first() (or last() if you change it later). Great for keeping customer name, latest order date, etc., alongside totals.
GroupBy	Group By Columns
groupByTargetColumns	Target Column	Output column name for the grouping key (e.g., region, department, year_month). Use this to rename or transform grouping columns (e.g., year(order_date) → order_year).
groupByExpressions	Expression	How to group the data. Usually just the column name, but can be complex: • region • year(order_date) • concat(year, ‘-‘, month) • substr(postal_code, 1, 3) • date_trunc(‘month’, order_date)
Pivot	Pivot
pivotCol	Pivot Column	Column whose distinct values become new columns (e.g., product_category, region, status). Turns long data into wide crosstab format – ideal for Excel reports and dashboards.
uniqueValues	Unique Values	Comma-separated list of expected values (e.g., Electronics,Clothing,Food or Jan,Feb,Mar,Apr,…,Dec). Dramatically improves performance and guarantees column order. Leave empty to let Spark discover values (slower on big data).
schema	InferSchema
outputColNames	Column Names of the Table	Force exact output column names. Useful when downstream systems expect fixed headers.
outputColTypes	Column Types of the Table	Force column data types (Integer, Double, String, Date, etc.). Critical for correct joins and visualizations.
outputColFormats	Column Formats	Date/number format strings (e.g., yyyy-MM-dd, #,##0.00). Used for display in reports.

Details¶

Aggregate Node – The Ultimate Business Aggregation Tool¶

The Aggregate node is the Swiss Army knife of summarization. It replaces multiple simple GroupBy + Pivot + Select nodes with one intuitive interface that business analysts and data engineers love.

When to Use It¶

Monthly sales by region/product (pivot + sum)
Customer lifetime value with latest name & join date (group by + propagate)
Dashboard KPIs (global aggregates)
Year-over-year comparisons (group by year/month)
Top N per category (with window functions in expressions)
Any Excel-style crosstab report

Key Advantages¶

Full Spark SQL expression support in Group By and Aggregate
Optional Pivot with known values → blazing fast on huge datasets
Automatic propagation of descriptive columns (customer name, category, etc.)
One node does what used to take 3–5 nodes
Explicit schema control for downstream compatibility

Pro Tips¶

Always provide Unique Values in Pivot when you know them → 10x–100x faster
Use Propagate All Input Columns = true to keep metadata without writing first() manually
Combine with Filter node before for conditional aggregation
Use date_trunc(‘month’, date) or year(date) for clean time grouping

Examples¶

Aggregate Node – Real-World Business Examples¶

Example 1 – Monthly Sales Dashboard (Classic Pivot)¶

Goal: Sales by Product Category and Month (wide format for Excel)¶

Configuration¶

Group By → Target: month, Expression: date_trunc(‘month’, order_date)
Pivot Column: category
Unique Values: Electronics,Clothing,Food,Books
Aggregate → Target: total_sales, Expression: sum(amount)
Propagate All Input Columns: false

Result¶

| month      | Electronics | Clothing | Food | Books |
|------------|-------------|----------|------|-------|
| 2025-01-01 | 45000       | 28000    | 12000| 8000  |
| 2025-02-01 | 52000       | 31000    | 13500| 9200  |

Example 2 – Customer Summary with Latest Info¶

Goal: Total spend per customer + latest order date & name¶

Configuration¶

Group By → Target: customer_id, Expression: customer_id
Aggregate → Target: total_spent, Expression: sum(amount)
Aggregate → Target: order_count, Expression: count(*)
Propagate All Input Columns: true

Result¶

| customer_id | total_spent | order_count | customer_name | latest_order_date |
|-------------|-------------|-------------|---------------|-------------------|
| 101         | 2500        | 12          | John Doe      | 2025-03-15        |
| 102         | 1800        | 8           | Jane Smith    | 2025-03-10        |

Example 3 – Regional KPIs (No Group By = Global)¶

Configuration¶

Aggregate → Target: total_revenue, Expression: sum(amount)
Aggregate → Target: unique_customers, Expression: count(distinct customer_id)
Aggregate → Target: avg_order_value, Expression: avg(amount)
Propagate All Input Columns: false

Result¶

| total_revenue | unique_customers | avg_order_value |
|---------------|------------------|-----------------|
| 1,250,000     | 8,421            | 148.32          |

Example 4 – Year-over-Year Growth by Category¶

Configuration¶

Group By → Target: year, Expression: year(order_date)
Group By → Target: category, Expression: category
Aggregate → Target: sales, Expression: sum(amount)
Propagate All Input Columns: false

Result (ready for % change calculation downstream)¶

| year | category     | sales     |
|------|--------------|-----------|
| 2024 | Electronics  | 420000    |
| 2025 | Electronics  | 580000    |
| 2024 | Clothing     | 310000    |
| 2025 | Clothing     | 410000    |

Example 5 – Top 3 Products per Region (using window functions)¶

Configuration¶

Group By → Target: region, Expression: region
Aggregate → Target: product_sales, Expression: sum(amount)
Aggregate → Target: rank_in_region, Expression: rank() over (partition by region order by sum(amount) desc)
Then add a Filter node after: rank_in_region <= 3

Result¶

Only the top 3 products per region appear.

Example 6 – Distinct Combinations Only (no aggregation)¶

Goal: List of unique Country + Channel combinations¶

Configuration¶

Group By → Target: country, Expression: country
Group By → Target: channel, Expression: channel
No Aggregate columns
Propagate All Input Columns: false

Result¶

| country     | channel   |
|-------------|-----------|
| USA         | Online    |
| USA         | Retail    |
| Germany     | Online    |
| France      | Retail    |