Read Excel Advanced

Advanced Excel Reader Node – Load one or more sheets, named ranges, or specific cell ranges from .xlsx/.xls files with full control over headers, data types, and metadata columns. Perfect for business reports, financial models, and mixed-format Excel sources.

Input

Reads data from Excel files (.xlsx, .xls) with support for sheets, ranges, and custom schemas.

Output

Returns a DataFrame containing the Excel data with proper column names and types.

Type

dataset

Class

fire.nodes.dataset.NodeDatasetExcelAdvanced

Fields

Name

Title

Description

path

Path

Full path to the Excel file or folder. Supports local paths, HDFS, S3, GCS, Azure, etc. Example: /data/sales.xlsx or s3a://my-bucket/reports/

boxConnection

Box Connection

accessToken

Access Token

recdirectorysearch

Read subdirectory recursively

If the path is a directory, recursively scan all subfolders for Excel files. Useful when processing many files at once.

readAllSheets

Read All Sheets

Automatically read every sheet in the workbook and union them into one DataFrame. Great for files where each sheet has the same structure (e.g., monthly tabs).

selectedSheets

Select Sheets

Comma-separated list of sheet names or indices to import (e.g., Sheet1,Summary,2023 Data or 0,2,5). Leave empty if reading only one sheet or using Read All Sheets.

dataAddressMode

Data Source Mode

Choose how to specify which part of the sheet to read: FullSheet Read the entire sheet starting from the defined start line (default & fastest) CellRange – Import only a specific cell range (e.g., A5:Z1000). Ideal for very large sheets or when you only need a table inside the sheet NamedRange – Use a predefined named range from Excel (e.g., SalesData, ReportTable). Most reliable when the table position changes

startLine

Start Data Import on Line

Row number (1 = first row) where actual data begins. Useful when Excel files have titles, logos, or blank rows at the top.

range

Range

Excel-style cell range in A1 notation (e.g., A1:F1000, B5:Z500). Limits reading to only these cells – perfect for very large sheets.

header_option

Header Handling Mode

How to treat headers: • NO_HEADER – columns become _c0, _c1, etc. • FIRST_ROW – first row at Start Line becomes column names • PREVIOUS_ROW – row before Start Line is header • CUSTOM_ROW – specify exact header row(s)

customHeaderRow

Custom Header Row

Row number or range (e.g., 5 or 5:7 for multi-row headers) to use as header. Only active when Header Handling Mode = CUSTOM_ROW.

importSheetNamesOnly

Import Only Sheet Names

Instead of data, output a simple list of all sheet names in the workbook. Ideal for discovery or dynamic sheet selection workflows.

dropSpecialCharacterInColumnName

Drop Special Character In Column Name

Automatically clean column names by replacing spaces, #, *, /, etc. with underscores and removing invalid characters. Recommended for downstream compatibility.

schema

InferSchema

outputFileName

Output File Name as Field

Add source file info as a new column: • no – nothing added • filename – just the file name (e.g., report.xlsx) • fullpath – complete path (useful for traceability)

sheetNameDisplay

Output Sheet Name as Field

Adds a new column sheet_name containing the source sheet for each row. Essential when reading multiple sheets together.

enforceSchema

Enforce Schema

Forcefully apply the defined schema (column names & types). Ignores Excel headers completely and casts all data – use when Excel headers are unreliable or missing.

outputColNames

Column Names for the Excel

Explicitly define output column names. Overrides any header from Excel. Leave empty to use Excel headers (or auto-generated names).

outputColTypes

Column Types for the Excel

Force column data types (String, Integer, Long, Double, Boolean, Date, Timestamp). Critical for correct calculations and joins.

outputColFormats

Column Formats for the Excel

Date/timestamp format patterns (e.g., yyyy-MM-dd, dd/MM/yyyy HH:mm:ss). Only needed when Excel stores dates as text or custom formats.

interactiveRunConfiguration

Interactive Run Configuration

limitInputRecords

Limit Input Records

Maximum rows to preview in interactive/design mode (default 20). Does not affect batch execution.

dataSampling

Data Sampling

How to sample data in interactive mode: • Random – picks random rows • Selective – use a filter expression to choose exactly which rows to preview

selectiveSamplingExpression

Selective Filter Expression

SQL-like filter to pull specific rows in interactive mode (e.g., “Region = ‘North’ AND Year = 2024”). Only active when Data Sampling = Selective.

Details

Read Excel Advanced Node – Complete Guide

The Read Excel Advanced node is the most powerful and flexible way to import Excel files into your pipeline. It handles real-world business Excel files with multiple sheets, merged cells, titles, footers, and inconsistent formatting – all while giving you full control over schema, metadata, and performance.

When to Use This Node

  • Business reports with headers starting on row 5–10

  • Workbooks where each month/year is a separate sheet with identical layout

  • Need to read only a specific table from a huge sheet

  • Requirement to trace source file and sheet name for audit/compliance

  • Excel files with messy column names (spaces, special chars)

Key Features

  • Read single sheet, multiple sheets, all sheets

  • Precise cell range selection (A1 notation)

  • Multi-row or custom-row headers

  • Automatic or forced schema with type casting

  • Adds filename and sheet_name columns for traceability

  • Recursively processes folders of Excel files

  • Cleans column names automatically

  • High performance even on large workbooks

Core Parameters Explained

Path

Full location of the Excel file or folder (supports HDFS, S3, GCS, ADLS, local, etc.).

Read All Sheets + Select Sheets

Combine both for advanced use: read only specific sheets from many files, or read all sheets and union them.

Start Data Import on Line

Skip logos, titles, or blank rows at the top – common in finance and reporting files.

Range

Read only the exact table you need – ignore charts, comments, and extra data.

Header Handling Mode

Flexible options for any header layout Excel users create.

Enforce Schema

Critical when Excel headers change or are missing – guarantees stable column names and types downstream.

Output File Name / Sheet Name as Field

Essential for audit trails and debugging when processing hundreds of files.

Schema Tab (InferSchema)

Define exact column names, data types, and date formats. When Enforce Schema = true, these override everything in the Excel file.

Examples

Read Excel Advanced – Real-World Examples

Example 1 – Monthly Sales Files (One Sheet per Month)

Scenario

Folder contains 12 files: Sales_Jan.xlsx, Sales_Feb.xlsx, …, each with a sheet “Data” starting at row 6 (rows 1–5 are titles).

Configuration

  • Path: /data/sales_monthly/

  • Read subdirectory recursively: false

  • Selected Sheets: Data

  • Start Data Import on Line: 6

  • Header Handling Mode: FIRST_ROW

  • Output File Name as Field: filename

  • Drop Special Character In Column Name: true

Result

One unified DataFrame with a filename column showing which month each row came from.

Example 2 – All Sheets Have Same Structure (Yearly Workbook)

Scenario

File “2024_Sales.xlsx” has 12 sheets: Jan, Feb, …, Dec – all with identical columns.

Configuration

  • Path: /data/2024_Sales.xlsx

  • Read All Sheets: true

  • Start Data Import on Line: 1

  • Header Handling Mode: FIRST_ROW

  • Output Sheet Name as Field: true

  • Drop Special Character In Column Name: true

Result

Single DataFrame with extra column sheet_name = “Jan”, “Feb”, etc.

Example 3 – Extract Only Named Table from Dashboard File

Scenario

Configuration

  • Path: s3://reports/dashboard_2024.xlsx

  • Header Handling Mode: FIRST_ROW

  • Enforce Schema: true (with predefined column names & types)

Result

Clean, typed DataFrame containing only the intended table – ignores everything else.

Example 4 – Read Specific Range with Custom Multi-Row Header

Scenario

Header spans rows 3–5, data starts at row 6, only need columns B to P.

Configuration

  • Range: B6:P1000

  • Start Data Import on Line: 6

  • Header Handling Mode: CUSTOM_ROW

  • Custom Header Row: 3:5

  • Drop Special Character In Column Name: true

Result

Proper multi-level column names (concatenated) and only the required columns.

Example 5 – Discover All Sheets in a New File

Configuration

  • Path: /incoming/new_report.xlsx

  • Import Only Sheet Names: true

Result

| sheet_name   |
|--------------|
| Cover        |
| Summary      |
| Raw Data     |
| Pivot Tables |

Example 6 – Force Strict Schema (Finance Use Case)

Scenario

Excel files come from different regions with different header languages, but business needs consistent column names and types.

Configuration

  • Enforce Schema: true

  • Output Column Names: [“Transaction_ID”, “Date”, “Amount”, “Currency”, “Region”, “Category”]

  • Output Column Types: [“string”, “date”, “double”, “string”, “string”, “string”]

  • Output Column Formats: [“”, “dd-MM-yyyy”, “”, “”, “”, “”]

Result

Every file outputs exactly the same schema regardless of original headers or formats.