PDF Image OCR

Reads in PDF Files from a given path, extracts the images from them, and converts them to text with Tesseract

Input

It reads in a PDF file or a directory containing PDF files

Output

It creates a DataFrame from the data read and sends it to its output

Type

dataset

Class

fire.nodes.dataset.NodeDatasetPDFImageOCR

Fields

Name

Title

Description

path

Path of the PDF files

Path of the PDF file/directory

fileNameCol

File Name Column

File Name Column in the Output DataFrame

outputCol

Column Name which contains the result of OCR

OCR output column in the Output DataFrame