Read Excel Advanced¶
Advanced Excel Reader Node – Load one or more sheets, named ranges, or specific cell ranges from .xlsx/.xls files with full control over headers, data types, and metadata columns. Perfect for business reports, financial models, and mixed-format Excel sources.
Input¶
Reads data from Excel files (.xlsx, .xls) with support for sheets, ranges, and custom schemas.
Output¶
Returns a DataFrame containing the Excel data with proper column names and types.
Type¶
dataset
Class¶
fire.nodes.dataset.NodeDatasetExcelAdvanced
Fields¶
Name |
Title |
Description |
|---|---|---|
path |
Path |
Full path to the Excel file or folder. Supports local paths, HDFS, S3, GCS, Azure, etc. Example: /data/sales.xlsx or s3a://my-bucket/reports/ |
boxConnection |
Box Connection |
|
accessToken |
Access Token |
|
recdirectorysearch |
Read subdirectory recursively |
If the path is a directory, recursively scan all subfolders for Excel files. Useful when processing many files at once. |
readAllSheets |
Read All Sheets |
Automatically read every sheet in the workbook and union them into one DataFrame. Great for files where each sheet has the same structure (e.g., monthly tabs). |
selectedSheets |
Select Sheets |
Comma-separated list of sheet names or indices to import (e.g., Sheet1,Summary,2023 Data or 0,2,5). Leave empty if reading only one sheet or using Read All Sheets. |
dataAddressMode |
Data Source Mode |
Choose how to specify which part of the sheet to read: FullSheet Read the entire sheet starting from the defined start line (default & fastest) CellRange – Import only a specific cell range (e.g., A5:Z1000). Ideal for very large sheets or when you only need a table inside the sheet NamedRange – Use a predefined named range from Excel (e.g., SalesData, ReportTable). Most reliable when the table position changes |
startLine |
Start Data Import on Line |
Row number (1 = first row) where actual data begins. Useful when Excel files have titles, logos, or blank rows at the top. |
range |
Range |
Excel-style cell range in A1 notation (e.g., A1:F1000, B5:Z500). Limits reading to only these cells – perfect for very large sheets. |
header_option |
Header Handling Mode |
How to treat headers: • NO_HEADER – columns become _c0, _c1, etc. • FIRST_ROW – first row at Start Line becomes column names • PREVIOUS_ROW – row before Start Line is header • CUSTOM_ROW – specify exact header row(s) |
customHeaderRow |
Custom Header Row |
Row number or range (e.g., 5 or 5:7 for multi-row headers) to use as header. Only active when Header Handling Mode = CUSTOM_ROW. |
importSheetNamesOnly |
Import Only Sheet Names |
Instead of data, output a simple list of all sheet names in the workbook. Ideal for discovery or dynamic sheet selection workflows. |
dropSpecialCharacterInColumnName |
Drop Special Character In Column Name |
Automatically clean column names by replacing spaces, #, *, /, etc. with underscores and removing invalid characters. Recommended for downstream compatibility. |
schema |
InferSchema |
|
outputFileName |
Output File Name as Field |
Add source file info as a new column: • no – nothing added • filename – just the file name (e.g., report.xlsx) • fullpath – complete path (useful for traceability) |
sheetNameDisplay |
Output Sheet Name as Field |
Adds a new column sheet_name containing the source sheet for each row. Essential when reading multiple sheets together. |
enforceSchema |
Enforce Schema |
Forcefully apply the defined schema (column names & types). Ignores Excel headers completely and casts all data – use when Excel headers are unreliable or missing. |
outputColNames |
Column Names for the Excel |
Explicitly define output column names. Overrides any header from Excel. Leave empty to use Excel headers (or auto-generated names). |
outputColTypes |
Column Types for the Excel |
Force column data types (String, Integer, Long, Double, Boolean, Date, Timestamp). Critical for correct calculations and joins. |
outputColFormats |
Column Formats for the Excel |
Date/timestamp format patterns (e.g., yyyy-MM-dd, dd/MM/yyyy HH:mm:ss). Only needed when Excel stores dates as text or custom formats. |
interactiveRunConfiguration |
Interactive Run Configuration |
|
limitInputRecords |
Limit Input Records |
Maximum rows to preview in interactive/design mode (default 20). Does not affect batch execution. |
dataSampling |
Data Sampling |
How to sample data in interactive mode: • Random – picks random rows • Selective – use a filter expression to choose exactly which rows to preview |
selectiveSamplingExpression |
Selective Filter Expression |
SQL-like filter to pull specific rows in interactive mode (e.g., “Region = ‘North’ AND Year = 2024”). Only active when Data Sampling = Selective. |
Details¶
Read Excel Advanced Node – Complete Guide¶
The Read Excel Advanced node is the most powerful and flexible way to import Excel files into your pipeline. It handles real-world business Excel files with multiple sheets, merged cells, titles, footers, and inconsistent formatting – all while giving you full control over schema, metadata, and performance.
When to Use This Node¶
Business reports with headers starting on row 5–10
Workbooks where each month/year is a separate sheet with identical layout
Need to read only a specific table from a huge sheet
Requirement to trace source file and sheet name for audit/compliance
Excel files with messy column names (spaces, special chars)
Key Features¶
Read single sheet, multiple sheets, all sheets
Precise cell range selection (A1 notation)
Multi-row or custom-row headers
Automatic or forced schema with type casting
Adds filename and sheet_name columns for traceability
Recursively processes folders of Excel files
Cleans column names automatically
High performance even on large workbooks
Core Parameters Explained¶
Path¶
Full location of the Excel file or folder (supports HDFS, S3, GCS, ADLS, local, etc.).
Read All Sheets + Select Sheets¶
Combine both for advanced use: read only specific sheets from many files, or read all sheets and union them.
Start Data Import on Line¶
Skip logos, titles, or blank rows at the top – common in finance and reporting files.
Range¶
Read only the exact table you need – ignore charts, comments, and extra data.
Header Handling Mode¶
Flexible options for any header layout Excel users create.
Enforce Schema¶
Critical when Excel headers change or are missing – guarantees stable column names and types downstream.
Output File Name / Sheet Name as Field¶
Essential for audit trails and debugging when processing hundreds of files.
Schema Tab (InferSchema)¶
Define exact column names, data types, and date formats. When Enforce Schema = true, these override everything in the Excel file.
Examples¶
Read Excel Advanced – Real-World Examples¶
Example 1 – Monthly Sales Files (One Sheet per Month)¶
Scenario¶
Folder contains 12 files: Sales_Jan.xlsx, Sales_Feb.xlsx, …, each with a sheet “Data” starting at row 6 (rows 1–5 are titles).
Configuration¶
Path: /data/sales_monthly/
Read subdirectory recursively: false
Selected Sheets: Data
Start Data Import on Line: 6
Header Handling Mode: FIRST_ROW
Output File Name as Field: filename
Drop Special Character In Column Name: true
Result¶
One unified DataFrame with a filename column showing which month each row came from.
Example 2 – All Sheets Have Same Structure (Yearly Workbook)¶
Scenario¶
File “2024_Sales.xlsx” has 12 sheets: Jan, Feb, …, Dec – all with identical columns.
Configuration¶
Path: /data/2024_Sales.xlsx
Read All Sheets: true
Start Data Import on Line: 1
Header Handling Mode: FIRST_ROW
Output Sheet Name as Field: true
Drop Special Character In Column Name: true
Result¶
Single DataFrame with extra column sheet_name = “Jan”, “Feb”, etc.
Example 3 – Extract Only Named Table from Dashboard File¶
Scenario¶
Configuration¶
Path: s3://reports/dashboard_2024.xlsx
Header Handling Mode: FIRST_ROW
Enforce Schema: true (with predefined column names & types)
Result¶
Clean, typed DataFrame containing only the intended table – ignores everything else.
Example 4 – Read Specific Range with Custom Multi-Row Header¶
Scenario¶
Header spans rows 3–5, data starts at row 6, only need columns B to P.
Configuration¶
Range: B6:P1000
Start Data Import on Line: 6
Header Handling Mode: CUSTOM_ROW
Custom Header Row: 3:5
Drop Special Character In Column Name: true
Result¶
Proper multi-level column names (concatenated) and only the required columns.
Example 5 – Discover All Sheets in a New File¶
Configuration¶
Path: /incoming/new_report.xlsx
Import Only Sheet Names: true
Result¶
| sheet_name |
|--------------|
| Cover |
| Summary |
| Raw Data |
| Pivot Tables |
Example 6 – Force Strict Schema (Finance Use Case)¶
Scenario¶
Excel files come from different regions with different header languages, but business needs consistent column names and types.
Configuration¶
Enforce Schema: true
Output Column Names: [“Transaction_ID”, “Date”, “Amount”, “Currency”, “Region”, “Category”]
Output Column Types: [“string”, “date”, “double”, “string”, “string”, “string”]
Output Column Formats: [“”, “dd-MM-yyyy”, “”, “”, “”, “”]
Result¶
Every file outputs exactly the same schema regardless of original headers or formats.