Document To Text

The DocumentToText node extracts text content from documents, including PDF, TXT, DOCX, and image files, located in a specified file path or directory. It processes either a single file or an entire directory, producing a structured DataFrame with columns for the extracted text, file name, page numbers, and optionally, base64-encoded data for PDFs and images.

Input

It takes directory or path as an input

Output

Outputs a Dataframe with page content and file name columns

Type

pyspark

Class

fire.nodes.gai.NodeDocumentToText

Fields

Name

Title

Description

filePath

Directory/File Path

Select a Pdf/Docx/Images File or Directory

fileType

Document Type

Choose a Document Type.If Empty all three types of files will be processed.

recursive

Recursive

Recursively process the documents in the given Directory

isImage

Image Encoding

Adds a column for base64 encoded pages

columns

Rename Output Cols

fileNameCol

File Name Column

Rename File Name Column. Defaults to fileName

contentCol

Content Column

Rename Content Column. Defaults to content

pageNumberCol

Page Number Column

Rename Page Number Column. Defaults to pageNumber

base64ImageCol

Base64 Image Column

Rename Image Column. Defaults to base64ImageCol

Details

DocumentToText Node Details

The DocumentToText node extracts text content from documents, including PDF, DOCX, and image files, located in a specified file path or directory. It processes either a single file or an entire directory, producing a structured DataFrame with columns for the extracted text, file name, page numbers, and optionally, base64-encoded data for PDFs and images.

General:

Directory/File Path: Specifies the path to a single document file or a directory containing multiple documents. This field is required and must be accessible to the PySpark engine.

Document Type: Selects the types of documents to process. Options include:

  • pdf: Processes PDF files, extracting text and optionally converting pages to base64-encoded images.

  • docx: Processes Microsoft Word documents, extracting text only.

  • image: Processes image files (e.g., PNG, JPEG) for text extraction via OCR, with optional base64 encoding.

If left empty, the node processes all supported file types (PDF, DOCX, and images) in the specified path.

Image Encoding: Determines whether to include a column with base64-encoded data for PDFs and images. Options are:

  • true: Adds a column with base64-encoded representations of PDF pages and image files.

  • false: Does not include base64-encoded data (default).

Note: TXT and DOCX files are not converted to base64 encodings, even if this option is enabled.

Recursive Processing:

Recursive: Controls whether the node processes documents in subdirectories. Options are:

  • true: Recursively processes all documents in the specified directory and its subdirectories.

  • false: Processes only documents directly in the specified directory (default).

Output Storage:

Output:

The node outputs a DataFrame with the following default columns:

  • fileName: The name of the source file.

  • content: The extracted text content from the document.

  • pageNumber:> The page number of the extracted content (for multi-page documents like PDFs; single-page documents like TXT, DOCX, and images use page number 1).

  • If Image Encoding is set to true, a base64ImageData column is included for PDFs and images, containing base64-encoded representations of the pages or images. TXT and DOCX files will have null in this column.

Examples

Example: DocumentToText Node

Input:

A directory /data/documents/ contains the following files:

  • report.pdf (a 2-page PDF document)

  • proposal.docx (a Microsoft Word document)

  • chart.png (an image file with text)

The DocumentToText node is configured as follows:

  • Directory/File Path: /data/documents/

  • Document Type: [“pdf”, “docx”, “image”] (process all supported types)

  • Image Encoding: true (includes base64-encoded data for PDFs and images)

  • Recursive: false (processes only files in the specified directory)

Output:

The node processes the files and produces a DataFrame with the following structure:

fileName       | content                              | pageNumber | base64ImageData
---------------|--------------------------------------|------------|----------------------------------
report.pdf     | This is page 1 of the report...      | 1          | iVBORw0KGgoAAAANSUhEUg...
report.pdf     | This is page 2 of the report...      | 2          | iVBORw0KGgoAAAANSUhEUg...
proposal.docx  | Proposal for new project...          | 1          | null
chart.png      | Sales: Q1 2025...                    | 1          | iVBORw0KGgoAAAANSUhEUg...

Explanation:

  • The report.pdf file is processed, extracting text from both pages, resulting in two rows (one per page). With Image Encoding set to true, each page is also converted to a base64-encoded image in the base64ImageData column.

  • The proposal.docx file is processed, extracting its text content into a single row. No base64 encoding is applied, so base64ImageData is null.

  • The chart.png file is processed using OCR to extract text, and its base64-encoded image data is included in the base64ImageData column.

  • Since Recursive is set to false, only files directly in /data/documents/ are processed.