PDF Image OCR¶
Reads in PDF Files from a given path, extracts the images from them, and converts them to text with Tesseract
Input¶
It reads in a PDF file or a directory containing PDF files
Output¶
It creates a DataFrame from the data read and sends it to its output
Type¶
dataset
Class¶
fire.nodes.dataset.NodeDatasetPDFImageOCR
Fields¶
Name |
Title |
Description |
|---|---|---|
path |
Path of the PDF files |
Path of the PDF file/directory |
fileNameCol |
File Name Column |
File Name Column in the Output DataFrame |
outputCol |
Column Name which contains the result of OCR |
OCR output column in the Output DataFrame |