Read Excel Advanced¶

Advanced Excel Reader Node – Load one or more sheets, named ranges, or specific cell ranges from .xlsx/.xls files with full control over headers, data types, and metadata columns. Perfect for business reports, financial models, and mixed-format Excel sources.

Input¶

Reads data from Excel files (.xlsx, .xls) with support for sheets, ranges, and custom schemas.

Output¶

Returns a DataFrame containing the Excel data with proper column names and types.

Type¶

dataset

Class¶

fire.nodes.dataset.NodeDatasetExcelAdvanced

Fields¶

Name	Title	Description
path	Path	Full path to the Excel file or folder. Supports local paths, HDFS, S3, GCS, Azure, etc. Example: /data/sales.xlsx or s3a://my-bucket/reports/
boxConnection	Box Connection
accessToken	Access Token
recdirectorysearch	Read subdirectory recursively	If the path is a directory, recursively scan all subfolders for Excel files. Useful when processing many files at once.
readAllSheets	Read All Sheets	Automatically read every sheet in the workbook and union them into one DataFrame. Great for files where each sheet has the same structure (e.g., monthly tabs).
selectedSheets	Select Sheets	Comma-separated list of sheet names or indices to import (e.g., Sheet1,Summary,2023 Data or 0,2,5). Leave empty if reading only one sheet or using Read All Sheets.
dataAddressMode	Data Source Mode	Choose how to specify which part of the sheet to read: FullSheet Read the entire sheet starting from the defined start line (default & fastest) CellRange – Import only a specific cell range (e.g., A5:Z1000). Ideal for very large sheets or when you only need a table inside the sheet NamedRange – Use a predefined named range from Excel (e.g., SalesData, ReportTable). Most reliable when the table position changes
startLine	Start Data Import on Line	Row number (1 = first row) where actual data begins. Useful when Excel files have titles, logos, or blank rows at the top.
range	Range	Excel-style cell range in A1 notation (e.g., A1:F1000, B5:Z500). Limits reading to only these cells – perfect for very large sheets.
header_option	Header Handling Mode	How to treat headers: • NO_HEADER – columns become _c0, _c1, etc. • FIRST_ROW – first row at Start Line becomes column names • PREVIOUS_ROW – row before Start Line is header • CUSTOM_ROW – specify exact header row(s)
customHeaderRow	Custom Header Row	Row number or range (e.g., 5 or 5:7 for multi-row headers) to use as header. Only active when Header Handling Mode = CUSTOM_ROW.
importSheetNamesOnly	Import Only Sheet Names	Instead of data, output a simple list of all sheet names in the workbook. Ideal for discovery or dynamic sheet selection workflows.
dropSpecialCharacterInColumnName	Drop Special Character In Column Name	Automatically clean column names by replacing spaces, #, *, /, etc. with underscores and removing invalid characters. Recommended for downstream compatibility.
schema	InferSchema
outputFileName	Output File Name as Field	Add source file info as a new column: • no – nothing added • filename – just the file name (e.g., report.xlsx) • fullpath – complete path (useful for traceability)
sheetNameDisplay	Output Sheet Name as Field	Adds a new column sheet_name containing the source sheet for each row. Essential when reading multiple sheets together.
enforceSchema	Enforce Schema	Forcefully apply the defined schema (column names & types). Ignores Excel headers completely and casts all data – use when Excel headers are unreliable or missing.
outputColNames	Column Names for the Excel	Explicitly define output column names. Overrides any header from Excel. Leave empty to use Excel headers (or auto-generated names).
outputColTypes	Column Types for the Excel	Force column data types (String, Integer, Long, Double, Boolean, Date, Timestamp). Critical for correct calculations and joins.
outputColFormats	Column Formats for the Excel	Date/timestamp format patterns (e.g., yyyy-MM-dd, dd/MM/yyyy HH:mm:ss). Only needed when Excel stores dates as text or custom formats.
interactiveRunConfiguration	Interactive Run Configuration
limitInputRecords	Limit Input Records	Maximum rows to preview in interactive/design mode (default 20). Does not affect batch execution.
dataSampling	Data Sampling	How to sample data in interactive mode: • Random – picks random rows • Selective – use a filter expression to choose exactly which rows to preview
selectiveSamplingExpression	Selective Filter Expression	SQL-like filter to pull specific rows in interactive mode (e.g., “Region = ‘North’ AND Year = 2024”). Only active when Data Sampling = Selective.

Details¶

Read Excel Advanced Node – Complete Guide¶

The Read Excel Advanced node is the most powerful and flexible way to import Excel files into your pipeline. It handles real-world business Excel files with multiple sheets, merged cells, titles, footers, and inconsistent formatting – all while giving you full control over schema, metadata, and performance.

When to Use This Node¶

Business reports with headers starting on row 5–10
Workbooks where each month/year is a separate sheet with identical layout
Need to read only a specific table from a huge sheet
Requirement to trace source file and sheet name for audit/compliance
Excel files with messy column names (spaces, special chars)

Key Features¶

Read single sheet, multiple sheets, all sheets
Precise cell range selection (A1 notation)
Multi-row or custom-row headers
Automatic or forced schema with type casting
Adds filename and sheet_name columns for traceability
Recursively processes folders of Excel files
Cleans column names automatically
High performance even on large workbooks

Core Parameters Explained¶

Path¶

Full location of the Excel file or folder (supports HDFS, S3, GCS, ADLS, local, etc.).

Read All Sheets + Select Sheets¶

Combine both for advanced use: read only specific sheets from many files, or read all sheets and union them.

Start Data Import on Line¶

Skip logos, titles, or blank rows at the top – common in finance and reporting files.

Range¶

Read only the exact table you need – ignore charts, comments, and extra data.

Header Handling Mode¶

Flexible options for any header layout Excel users create.

Enforce Schema¶

Critical when Excel headers change or are missing – guarantees stable column names and types downstream.

Output File Name / Sheet Name as Field¶

Essential for audit trails and debugging when processing hundreds of files.

Schema Tab (InferSchema)¶

Define exact column names, data types, and date formats. When Enforce Schema = true, these override everything in the Excel file.

Examples¶

Read Excel Advanced – Real-World Examples¶

Example 1 – Monthly Sales Files (One Sheet per Month)¶

Scenario¶

Folder contains 12 files: Sales_Jan.xlsx, Sales_Feb.xlsx, …, each with a sheet “Data” starting at row 6 (rows 1–5 are titles).

Configuration¶

Path: /data/sales_monthly/
Read subdirectory recursively: false
Selected Sheets: Data
Start Data Import on Line: 6
Header Handling Mode: FIRST_ROW
Output File Name as Field: filename
Drop Special Character In Column Name: true

Result¶

One unified DataFrame with a filename column showing which month each row came from.

Example 2 – All Sheets Have Same Structure (Yearly Workbook)¶

Scenario¶

File “2024_Sales.xlsx” has 12 sheets: Jan, Feb, …, Dec – all with identical columns.

Configuration¶

Path: /data/2024_Sales.xlsx
Read All Sheets: true
Start Data Import on Line: 1
Header Handling Mode: FIRST_ROW
Output Sheet Name as Field: true
Drop Special Character In Column Name: true

Result¶

Single DataFrame with extra column sheet_name = “Jan”, “Feb”, etc.

Example 3 – Extract Only Named Table from Dashboard File¶

Scenario¶

Configuration¶

Path: s3://reports/dashboard_2024.xlsx
Header Handling Mode: FIRST_ROW
Enforce Schema: true (with predefined column names & types)

Result¶

Clean, typed DataFrame containing only the intended table – ignores everything else.

Example 4 – Read Specific Range with Custom Multi-Row Header¶

Scenario¶

Header spans rows 3–5, data starts at row 6, only need columns B to P.

Configuration¶

Range: B6:P1000
Start Data Import on Line: 6
Header Handling Mode: CUSTOM_ROW
Custom Header Row: 3:5
Drop Special Character In Column Name: true

Result¶

Proper multi-level column names (concatenated) and only the required columns.

Example 5 – Discover All Sheets in a New File¶

Configuration¶

Path: /incoming/new_report.xlsx
Import Only Sheet Names: true

Result¶

| sheet_name   |
|--------------|
| Cover        |
| Summary      |
| Raw Data     |
| Pivot Tables |

Example 6 – Force Strict Schema (Finance Use Case)¶

Scenario¶

Excel files come from different regions with different header languages, but business needs consistent column names and types.

Configuration¶

Enforce Schema: true
Output Column Names: [“Transaction_ID”, “Date”, “Amount”, “Currency”, “Region”, “Category”]
Output Column Types: [“string”, “date”, “double”, “string”, “string”, “string”]
Output Column Formats: [“”, “dd-MM-yyyy”, “”, “”, “”, “”]

Result¶

Every file outputs exactly the same schema regardless of original headers or formats.