LinuxCommandLibrary

ocrmypdf

OCR PDF files to make them searchable

TLDR

Create a new searchable PDF/A file from a scanned PDF or image file

$ ocrmypdf [path/to/input] [path/to/output.pdf]
copy

Skip pages of a mixed-format input PDF file that already contain text
$ ocrmypdf --skip-text [path/to/input.pdf] [path/to/output.pdf]
copy

Clean, de-skew, and rotate pages of a poor scan
$ ocrmypdf --clean --deskew --rotate-pages [path/to/input.pdf] [path/to/output.pdf]
copy

Perform lossy optimization on a PDF without performing any OCR
$ ocrmypdf --tesseract-timeout 0 --optimize 2 --skip-text [path/to/input.pdf] [path/to/output.pdf]
copy

Set the metadata of a searchable PDF file
$ ocrmypdf --title "[title]" --author "[author]" --subject "[subject]" --keywords "[keyword; key phrase; ...]" [path/to/input.pdf] [path/to/output.pdf]
copy

Display help
$ ocrmypdf --help
copy

SYNOPSIS

ocrmypdf [OPTIONS]
or
ocrmypdf [OPTIONS] --inplace

PARAMETERS

-l LANGUAGE
    Specify the language(s) for OCR, e.g., 'eng+fra'. Requires corresponding Tesseract language data installed.

--force-ocr
    Forces OCR to run on all pages, even if they appear to contain text already.

--redo-ocr
    Forces OCR to rerun on pages that already have an OCR layer, useful for correcting errors or changing OCR options.

--inplace
    Overwrites the input PDF file with the OCR-processed version instead of creating a new file.

--deskew
    Attempts to detect and correct skewed pages in the input PDF, improving OCR accuracy.

--optimize N
    Optimizes image compression for the output PDF. N is a level (0-3), with 3 providing the highest compression.

--sidecar FILE
    Saves the OCR text in a separate plain text or hOCR (HTML) file alongside the output PDF.

--output-type TYPE
    Specifies the output PDF conformance, e.g., 'pdfa-1', 'pdfa-2', 'pdfa-3', or 'pdf' (default for standard PDF).

--j N
    Specifies the number of parallel jobs (pages processed concurrently) to utilize CPU cores efficiently.

DESCRIPTION

ocrmypdf is a powerful free and open-source command-line tool designed to add an OCR (Optical Character Recognition) text layer to scanned PDF documents. It works by intelligently processing each page, using the Tesseract OCR engine to recognize text, and then embedding this text as an invisible layer beneath the original page image. This transformation makes previously unsearchable scanned PDFs fully searchable and selectable, without altering their visual appearance. ocrmypdf can handle various document types, including those with mixed content, and offers features like image optimization, deskewing, and automatic page rotation to produce high-quality, compact, and searchable PDFs. It's widely used for digitizing and archiving paper documents.

CAVEATS

OCR quality is highly dependent on the input PDF's image quality. Poor scans, low resolution, or complex fonts can result in inaccurate text recognition. The process can be CPU and memory intensive, especially for large documents or many concurrent jobs. While it adds a searchable text layer, it does not convert the PDF into an editable document. Users must ensure that the Tesseract OCR engine and required language packs are installed on their system for ocrmypdf to function correctly.

<B>PDF/A CONFORMANCE</B>

ocrmypdf can generate output PDFs conforming to PDF/A (Portable Document Format for Archiving) standards, such as PDF/A-1b, PDF/A-2b, or PDF/A-3b. This is crucial for long-term archival of documents, ensuring their accessibility and readability across different software and future technologies.

<B>PREREQUISITES</B>

For ocrmypdf to function, the Tesseract OCR engine must be installed on your system. Additionally, you need to install the specific Tesseract language data packs for any languages you intend to OCR (e.g., 'tesseract-ocr-eng' for English on Debian/Ubuntu-based systems).

HISTORY

ocrmypdf was initially developed by Klemens Böhm, with its first stable release appearing around 2014. It emerged as a practical solution to automate the process of adding OCR text layers to scanned PDFs, leveraging the powerful Tesseract OCR engine. Since its inception, it has been maintained as an active open-source project on GitHub, continuously evolving with new features, optimizations, and bug fixes, thanks to contributions from a community of developers. Its robust design and ease of use have made it a popular choice for digitizing documents in both personal and professional contexts.

SEE ALSO

tesseract(1), qpdf(1), pdftk(1), gs(1)

Copied to clipboard