Aggregate

The most powerful and flexible aggregation node – combines Group By, multiple aggregations, Pivot, conditional expressions, and column propagation in one easy-to-use interface. Perfect for business reporting, dashboards, KPIs, and analytics summaries.

Input

Accepts one or more DataFrames as input.

Output

Returns an aggregated DataFrame.

Type

transform

Class

fire.nodes.etl.NodeAggregateColumns

Fields

Name

Title

Description

Aggregate

Aggregate Columns

aggregatetargetColumns

Target Column

Name of the new column that will hold the aggregated result (e.g., total_sales, order_count, avg_price). This is the column name users will see in reports.

aggregateexpressions

Expression

Spark SQL aggregation expression. Examples: • sum(amount) • count(*) • avg(price) • max(date) • min(temperature) • count(distinct customer_id) • sum(amount * quantity) • stddev(pop) • corr(price, volume)

propagateAllInputColumns

Propagate All Input Columns

When true → all non-grouped columns are automatically kept in the output using first() (or last() if you change it later). Great for keeping customer name, latest order date, etc., alongside totals.

GroupBy

Group By Columns

groupByTargetColumns

Target Column

Output column name for the grouping key (e.g., region, department, year_month). Use this to rename or transform grouping columns (e.g., year(order_date) → order_year).

groupByExpressions

Expression

How to group the data. Usually just the column name, but can be complex: • region • year(order_date) • concat(year, ‘-‘, month) • substr(postal_code, 1, 3) • date_trunc(‘month’, order_date)

Pivot

Pivot

pivotCol

Pivot Column

Column whose distinct values become new columns (e.g., product_category, region, status). Turns long data into wide crosstab format – ideal for Excel reports and dashboards.

uniqueValues

Unique Values

Comma-separated list of expected values (e.g., Electronics,Clothing,Food or Jan,Feb,Mar,Apr,…,Dec). Dramatically improves performance and guarantees column order. Leave empty to let Spark discover values (slower on big data).

schema

InferSchema

outputColNames

Column Names of the Table

Force exact output column names. Useful when downstream systems expect fixed headers.

outputColTypes

Column Types of the Table

Force column data types (Integer, Double, String, Date, etc.). Critical for correct joins and visualizations.

outputColFormats

Column Formats

Date/number format strings (e.g., yyyy-MM-dd, #,##0.00). Used for display in reports.

Details

Aggregate Node – The Ultimate Business Aggregation Tool

The Aggregate node is the Swiss Army knife of summarization. It replaces multiple simple GroupBy + Pivot + Select nodes with one intuitive interface that business analysts and data engineers love.

When to Use It

  • Monthly sales by region/product (pivot + sum)

  • Customer lifetime value with latest name & join date (group by + propagate)

  • Dashboard KPIs (global aggregates)

  • Year-over-year comparisons (group by year/month)

  • Top N per category (with window functions in expressions)

  • Any Excel-style crosstab report

Key Advantages

  • Full Spark SQL expression support in Group By and Aggregate

  • Optional Pivot with known values → blazing fast on huge datasets

  • Automatic propagation of descriptive columns (customer name, category, etc.)

  • One node does what used to take 3–5 nodes

  • Explicit schema control for downstream compatibility

Pro Tips

  • Always provide Unique Values in Pivot when you know them → 10x–100x faster

  • Use Propagate All Input Columns = true to keep metadata without writing first() manually

  • Combine with Filter node before for conditional aggregation

  • Use date_trunc(‘month’, date) or year(date) for clean time grouping

Examples

Aggregate Node – Real-World Business Examples

Example 1 – Monthly Sales Dashboard (Classic Pivot)

Goal: Sales by Product Category and Month (wide format for Excel)

Configuration

  • Group By → Target: month, Expression: date_trunc(‘month’, order_date)

  • Pivot Column: category

  • Unique Values: Electronics,Clothing,Food,Books

  • Aggregate → Target: total_sales, Expression: sum(amount)

  • Propagate All Input Columns: false

Result

| month      | Electronics | Clothing | Food | Books |
|------------|-------------|----------|------|-------|
| 2025-01-01 | 45000       | 28000    | 12000| 8000  |
| 2025-02-01 | 52000       | 31000    | 13500| 9200  |

Example 2 – Customer Summary with Latest Info

Goal: Total spend per customer + latest order date & name

Configuration

  • Group By → Target: customer_id, Expression: customer_id

  • Aggregate → Target: total_spent, Expression: sum(amount)

  • Aggregate → Target: order_count, Expression: count(*)

  • Propagate All Input Columns: true

Result

| customer_id | total_spent | order_count | customer_name | latest_order_date |
|-------------|-------------|-------------|---------------|-------------------|
| 101         | 2500        | 12          | John Doe      | 2025-03-15        |
| 102         | 1800        | 8           | Jane Smith    | 2025-03-10        |

Example 3 – Regional KPIs (No Group By = Global)

Configuration

  • Aggregate → Target: total_revenue, Expression: sum(amount)

  • Aggregate → Target: unique_customers, Expression: count(distinct customer_id)

  • Aggregate → Target: avg_order_value, Expression: avg(amount)

  • Propagate All Input Columns: false

Result

| total_revenue | unique_customers | avg_order_value |
|---------------|------------------|-----------------|
| 1,250,000     | 8,421            | 148.32          |

Example 4 – Year-over-Year Growth by Category

Configuration

  • Group By → Target: year, Expression: year(order_date)

  • Group By → Target: category, Expression: category

  • Aggregate → Target: sales, Expression: sum(amount)

  • Propagate All Input Columns: false

Result (ready for % change calculation downstream)

| year | category     | sales     |
|------|--------------|-----------|
| 2024 | Electronics  | 420000    |
| 2025 | Electronics  | 580000    |
| 2024 | Clothing     | 310000    |
| 2025 | Clothing     | 410000    |

Example 5 – Top 3 Products per Region (using window functions)

Configuration

  • Group By → Target: region, Expression: region

  • Aggregate → Target: product_sales, Expression: sum(amount)

  • Aggregate → Target: rank_in_region, Expression: rank() over (partition by region order by sum(amount) desc)

  • Then add a Filter node after: rank_in_region <= 3

Result

Only the top 3 products per region appear.

Example 6 – Distinct Combinations Only (no aggregation)

Goal: List of unique Country + Channel combinations

Configuration

  • Group By → Target: country, Expression: country

  • Group By → Target: channel, Expression: channel

  • No Aggregate columns

  • Propagate All Input Columns: false

Result

| country     | channel   |
|-------------|-----------|
| USA         | Online    |
| USA         | Retail    |
| Germany     | Online    |
| France      | Retail    |