Read Excel Advanced =========== Advanced Excel Reader Node – Load one or more sheets, named ranges, or specific cell ranges from .xlsx/.xls files with full control over headers, data types, and metadata columns. Perfect for business reports, financial models, and mixed-format Excel sources. Input -------------- Reads data from Excel files (.xlsx, .xls) with support for sheets, ranges, and custom schemas. Output -------------- Returns a DataFrame containing the Excel data with proper column names and types. Type --------- dataset Class --------- fire.nodes.dataset.NodeDatasetExcelAdvanced Fields --------- .. list-table:: :widths: 10 5 10 :header-rows: 1 * - Name - Title - Description * - path - Path - Full path to the Excel file or folder. Supports local paths, HDFS, S3, GCS, Azure, etc. Example: /data/sales.xlsx or s3a://my-bucket/reports/ * - boxConnection - Box Connection - * - accessToken - Access Token - * - recdirectorysearch - Read subdirectory recursively - If the path is a directory, recursively scan all subfolders for Excel files. Useful when processing many files at once. * - readAllSheets - Read All Sheets - Automatically read every sheet in the workbook and union them into one DataFrame. Great for files where each sheet has the same structure (e.g., monthly tabs). * - selectedSheets - Select Sheets - Comma-separated list of sheet names or indices to import (e.g., Sheet1,Summary,2023 Data or 0,2,5). Leave empty if reading only one sheet or using Read All Sheets. * - dataAddressMode - Data Source Mode - Choose how to specify which part of the sheet to read: FullSheet Read the entire sheet starting from the defined start line (default & fastest) CellRange – Import only a specific cell range (e.g., A5:Z1000). Ideal for very large sheets or when you only need a table inside the sheet NamedRange – Use a predefined named range from Excel (e.g., SalesData, ReportTable). Most reliable when the table position changes * - startLine - Start Data Import on Line - Row number (1 = first row) where actual data begins. Useful when Excel files have titles, logos, or blank rows at the top. * - range - Range - Excel-style cell range in A1 notation (e.g., A1:F1000, B5:Z500). Limits reading to only these cells – perfect for very large sheets. * - header_option - Header Handling Mode - How to treat headers: • NO_HEADER – columns become _c0, _c1, etc. • FIRST_ROW – first row at Start Line becomes column names • PREVIOUS_ROW – row before Start Line is header • CUSTOM_ROW – specify exact header row(s) * - customHeaderRow - Custom Header Row - Row number or range (e.g., 5 or 5:7 for multi-row headers) to use as header. Only active when Header Handling Mode = CUSTOM_ROW. * - importSheetNamesOnly - Import Only Sheet Names - Instead of data, output a simple list of all sheet names in the workbook. Ideal for discovery or dynamic sheet selection workflows. * - dropSpecialCharacterInColumnName - Drop Special Character In Column Name - Automatically clean column names by replacing spaces, #, *, /, etc. with underscores and removing invalid characters. Recommended for downstream compatibility. * - schema - InferSchema - * - outputFileName - Output File Name as Field - Add source file info as a new column: • no – nothing added • filename – just the file name (e.g., report.xlsx) • fullpath – complete path (useful for traceability) * - sheetNameDisplay - Output Sheet Name as Field - Adds a new column sheet_name containing the source sheet for each row. Essential when reading multiple sheets together. * - enforceSchema - Enforce Schema - Forcefully apply the defined schema (column names & types). Ignores Excel headers completely and casts all data – use when Excel headers are unreliable or missing. * - outputColNames - Column Names for the Excel - Explicitly define output column names. Overrides any header from Excel. Leave empty to use Excel headers (or auto-generated names). * - outputColTypes - Column Types for the Excel - Force column data types (String, Integer, Long, Double, Boolean, Date, Timestamp). Critical for correct calculations and joins. * - outputColFormats - Column Formats for the Excel - Date/timestamp format patterns (e.g., yyyy-MM-dd, dd/MM/yyyy HH:mm:ss). Only needed when Excel stores dates as text or custom formats. * - interactiveRunConfiguration - Interactive Run Configuration - * - limitInputRecords - Limit Input Records - Maximum rows to preview in interactive/design mode (default 20). Does not affect batch execution. * - dataSampling - Data Sampling - How to sample data in interactive mode: • Random – picks random rows • Selective – use a filter expression to choose exactly which rows to preview * - selectiveSamplingExpression - Selective Filter Expression - SQL-like filter to pull specific rows in interactive mode (e.g., "Region = 'North' AND Year = 2024"). Only active when Data Sampling = Selective. Details ------- Read Excel Advanced Node – Complete Guide +++++++++++++++ The Read Excel Advanced node is the most powerful and flexible way to import Excel files into your pipeline. It handles real-world business Excel files with multiple sheets, merged cells, titles, footers, and inconsistent formatting – all while giving you full control over schema, metadata, and performance. When to Use This Node +++++++++++++++ * Business reports with headers starting on row 5–10 * Workbooks where each month/year is a separate sheet with identical layout * Need to read only a specific table from a huge sheet * Requirement to trace source file and sheet name for audit/compliance * Excel files with messy column names (spaces, special chars) Key Features +++++++++++++++ * Read single sheet, multiple sheets, all sheets * Precise cell range selection (A1 notation) * Multi-row or custom-row headers * Automatic or forced schema with type casting * Adds filename and sheet_name columns for traceability * Recursively processes folders of Excel files * Cleans column names automatically * High performance even on large workbooks Core Parameters Explained +++++++++++++++ Path +++++++++++++++ Full location of the Excel file or folder (supports HDFS, S3, GCS, ADLS, local, etc.). Read All Sheets + Select Sheets +++++++++++++++ Combine both for advanced use: read only specific sheets from many files, or read all sheets and union them. Start Data Import on Line +++++++++++++++ Skip logos, titles, or blank rows at the top – common in finance and reporting files. Range +++++++++++++++ Read only the exact table you need – ignore charts, comments, and extra data. Header Handling Mode +++++++++++++++ Flexible options for any header layout Excel users create. Enforce Schema +++++++++++++++ Critical when Excel headers change or are missing – guarantees stable column names and types downstream. Output File Name / Sheet Name as Field +++++++++++++++ Essential for audit trails and debugging when processing hundreds of files. Schema Tab (InferSchema) +++++++++++++++ Define exact column names, data types, and date formats. When Enforce Schema = true, these override everything in the Excel file. Examples ------- Read Excel Advanced – Real-World Examples +++++++++++++++ Example 1 – Monthly Sales Files (One Sheet per Month) +++++++++++++++ Scenario +++++++++++++++ Folder contains 12 files: Sales_Jan.xlsx, Sales_Feb.xlsx, ..., each with a sheet "Data" starting at row 6 (rows 1–5 are titles). Configuration +++++++++++++++ * Path: /data/sales_monthly/ * Read subdirectory recursively: false * Selected Sheets: Data * Start Data Import on Line: 6 * Header Handling Mode: FIRST_ROW * Output File Name as Field: filename * Drop Special Character In Column Name: true Result +++++++++++++++ One unified DataFrame with a filename column showing which month each row came from. Example 2 – All Sheets Have Same Structure (Yearly Workbook) +++++++++++++++ Scenario +++++++++++++++ File "2024_Sales.xlsx" has 12 sheets: Jan, Feb, ..., Dec – all with identical columns. Configuration +++++++++++++++ * Path: /data/2024_Sales.xlsx * Read All Sheets: true * Start Data Import on Line: 1 * Header Handling Mode: FIRST_ROW * Output Sheet Name as Field: true * Drop Special Character In Column Name: true Result +++++++++++++++ Single DataFrame with extra column sheet_name = "Jan", "Feb", etc. Example 3 – Extract Only Named Table from Dashboard File +++++++++++++++ Scenario +++++++++++++++ Configuration +++++++++++++++ * Path: s3://reports/dashboard_2024.xlsx * Header Handling Mode: FIRST_ROW * Enforce Schema: true (with predefined column names & types) Result +++++++++++++++ Clean, typed DataFrame containing only the intended table – ignores everything else. Example 4 – Read Specific Range with Custom Multi-Row Header +++++++++++++++ Scenario +++++++++++++++ Header spans rows 3–5, data starts at row 6, only need columns B to P. Configuration +++++++++++++++ * Range: B6:P1000 * Start Data Import on Line: 6 * Header Handling Mode: CUSTOM_ROW * Custom Header Row: 3:5 * Drop Special Character In Column Name: true Result +++++++++++++++ Proper multi-level column names (concatenated) and only the required columns. Example 5 – Discover All Sheets in a New File +++++++++++++++ Configuration +++++++++++++++ * Path: /incoming/new_report.xlsx * Import Only Sheet Names: true Result +++++++++++++++ :: | sheet_name | |--------------| | Cover | | Summary | | Raw Data | | Pivot Tables | Example 6 – Force Strict Schema (Finance Use Case) +++++++++++++++ Scenario +++++++++++++++ Excel files come from different regions with different header languages, but business needs consistent column names and types. Configuration +++++++++++++++ * Enforce Schema: true * Output Column Names: ["Transaction_ID", "Date", "Amount", "Currency", "Region", "Category"] * Output Column Types: ["string", "date", "double", "string", "string", "string"] * Output Column Formats: ["", "dd-MM-yyyy", "", "", "", ""] Result +++++++++++++++ Every file outputs exactly the same schema regardless of original headers or formats.