Aggregate
===========

The most powerful and flexible aggregation node – combines Group By, multiple aggregations, Pivot, conditional expressions, and column propagation in one easy-to-use interface. Perfect for business reporting, dashboards, KPIs, and analytics summaries.

Input
--------------
Accepts one or more DataFrames as input.

Output
--------------
Returns an aggregated DataFrame.

Type
--------- 

transform

Class
--------- 

fire.nodes.etl.NodeAggregateColumns

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - Aggregate
        - Aggregate Columns
        - 
      * - aggregatetargetColumns
        - Target Column
        - Name of the new column that will hold the aggregated result (e.g., total_sales, order_count, avg_price). This is the column name users will see in reports.
      * - aggregateexpressions
        - Expression
        - Spark SQL aggregation expression. Examples: • sum(amount) • count(*) • avg(price) • max(date) • min(temperature) • count(distinct customer_id) • sum(amount * quantity) • stddev(pop) • corr(price, volume)
      * - propagateAllInputColumns
        - Propagate All Input Columns
        - When true → all non-grouped columns are automatically kept in the output using first() (or last() if you change it later). Great for keeping customer name, latest order date, etc., alongside totals.
      * - GroupBy
        - Group By Columns
        - 
      * - groupByTargetColumns
        - Target Column
        - Output column name for the grouping key (e.g., region, department, year_month). Use this to rename or transform grouping columns (e.g., year(order_date) → order_year).
      * - groupByExpressions
        - Expression
        - How to group the data. Usually just the column name, but can be complex: • region • year(order_date) • concat(year, '-', month) • substr(postal_code, 1, 3) • date_trunc('month', order_date)
      * - Pivot
        - Pivot
        - 
      * - pivotCol
        - Pivot Column
        - Column whose distinct values become new columns (e.g., product_category, region, status). Turns long data into wide crosstab format – ideal for Excel reports and dashboards.
      * - uniqueValues
        - Unique Values
        - Comma-separated list of expected values (e.g., Electronics,Clothing,Food or Jan,Feb,Mar,Apr,...,Dec). Dramatically improves performance and guarantees column order. Leave empty to let Spark discover values (slower on big data).
      * - schema
        - InferSchema
        - 
      * - outputColNames
        - Column Names of the Table
        - Force exact output column names. Useful when downstream systems expect fixed headers.
      * - outputColTypes
        - Column Types of the Table
        - Force column data types (Integer, Double, String, Date, etc.). Critical for correct joins and visualizations.
      * - outputColFormats
        - Column Formats
        - Date/number format strings (e.g., yyyy-MM-dd, #,##0.00). Used for display in reports.


Details
-------
Aggregate Node – The Ultimate Business Aggregation Tool
+++++++++++++++

The Aggregate node is the Swiss Army knife of summarization. It replaces multiple simple GroupBy + Pivot + Select nodes with one intuitive interface that business analysts and data engineers love.


When to Use It
+++++++++++++++


* Monthly sales by region/product (pivot + sum)
* Customer lifetime value with latest name & join date (group by + propagate)
* Dashboard KPIs (global aggregates)
* Year-over-year comparisons (group by year/month)
* Top N per category (with window functions in expressions)
* Any Excel-style crosstab report


Key Advantages
+++++++++++++++

* Full Spark SQL expression support in Group By and Aggregate
* Optional Pivot with known values → blazing fast on huge datasets
* Automatic propagation of descriptive columns (customer name, category, etc.)
* One node does what used to take 3–5 nodes
* Explicit schema control for downstream compatibility


Pro Tips
+++++++++++++++

* Always provide Unique Values in Pivot when you know them → 10x–100x faster
* Use Propagate All Input Columns = true to keep metadata without writing first() manually
* Combine with Filter node before for conditional aggregation
* Use date_trunc('month', date) or year(date) for clean time grouping


Examples
-------
Aggregate Node – Real-World Business Examples
+++++++++++++++


Example 1 – Monthly Sales Dashboard (Classic Pivot)
+++++++++++++++


Goal: Sales by Product Category and Month (wide format for Excel)
+++++++++++++++


Configuration
+++++++++++++++


* Group By → Target: month, Expression: date_trunc('month', order_date)
* Pivot Column: category
* Unique Values: Electronics,Clothing,Food,Books
* Aggregate → Target: total_sales, Expression: sum(amount)
* Propagate All Input Columns: false

Result
+++++++++++++++

::

    | month      | Electronics | Clothing | Food | Books |
    |------------|-------------|----------|------|-------|
    | 2025-01-01 | 45000       | 28000    | 12000| 8000  |
    | 2025-02-01 | 52000       | 31000    | 13500| 9200  |


Example 2 – Customer Summary with Latest Info
+++++++++++++++


Goal: Total spend per customer + latest order date & name
+++++++++++++++


Configuration
+++++++++++++++

* Group By → Target: customer_id, Expression: customer_id
* Aggregate → Target: total_spent, Expression: sum(amount)
* Aggregate → Target: order_count, Expression: count(*)
* Propagate All Input Columns: true

Result
+++++++++++++++

::

    | customer_id | total_spent | order_count | customer_name | latest_order_date |
    |-------------|-------------|-------------|---------------|-------------------|
    | 101         | 2500        | 12          | John Doe      | 2025-03-15        |
    | 102         | 1800        | 8           | Jane Smith    | 2025-03-10        |


Example 3 – Regional KPIs (No Group By = Global)
+++++++++++++++


Configuration
+++++++++++++++

* Aggregate → Target: total_revenue, Expression: sum(amount)
* Aggregate → Target: unique_customers, Expression: count(distinct customer_id)
* Aggregate → Target: avg_order_value, Expression: avg(amount)
* Propagate All Input Columns: false

Result
+++++++++++++++

::

    | total_revenue | unique_customers | avg_order_value |
    |---------------|------------------|-----------------|
    | 1,250,000     | 8,421            | 148.32          |


Example 4 – Year-over-Year Growth by Category
+++++++++++++++


Configuration
+++++++++++++++

* Group By → Target: year, Expression: year(order_date)
* Group By → Target: category, Expression: category
* Aggregate → Target: sales, Expression: sum(amount)
* Propagate All Input Columns: false

Result (ready for % change calculation downstream)
+++++++++++++++

::

    | year | category     | sales     |
    |------|--------------|-----------|
    | 2024 | Electronics  | 420000    |
    | 2025 | Electronics  | 580000    |
    | 2024 | Clothing     | 310000    |
    | 2025 | Clothing     | 410000    |


Example 5 – Top 3 Products per Region (using window functions)
+++++++++++++++


Configuration
+++++++++++++++

* Group By → Target: region, Expression: region
* Aggregate → Target: product_sales, Expression: sum(amount)
* Aggregate → Target: rank_in_region, Expression: rank() over (partition by region order by sum(amount) desc)
* Then add a Filter node after: rank_in_region <= 3

Result
+++++++++++++++


Only the top 3 products per region appear.


Example 6 – Distinct Combinations Only (no aggregation)
+++++++++++++++


Goal: List of unique Country + Channel combinations
+++++++++++++++


Configuration
+++++++++++++++


* Group By → Target: country, Expression: country
* Group By → Target: channel, Expression: channel
* No Aggregate columns
* Propagate All Input Columns: false

Result
+++++++++++++++

::

    | country     | channel   |
    |-------------|-----------|
    | USA         | Online    |
    | USA         | Retail    |
    | Germany     | Online    |
    | France      | Retail    |