PDF Image OCR¶

Reads in PDF Files from a given path, extracts the images from them, and converts them to text with Tesseract

Input¶

It reads in a PDF file or a directory containing PDF files

It creates a DataFrame from the data read and sends it to its output

dataset

fire.nodes.dataset.NodeDatasetPDFImageOCR

Name	Title	Description
path	Path of the PDF files	Path of the PDF file/directory
fileNameCol	File Name Column	File Name Column in the Output DataFrame
outputCol	Column Name which contains the result of OCR	OCR output column in the Output DataFrame