LinuxCommandLibrary

pdftotext

Convert PDF to plain text

TLDR

Convert filename.pdf to plain text and print it to stdout

$ pdftotext [filename.pdf] -
copy

Convert filename.pdf to plain text and save it as filename.txt
$ pdftotext [filename.pdf]
copy

Convert filename.pdf to plain text and preserve the layout
$ pdftotext -layout [filename.pdf]
copy

Convert input.pdf to plain text and save it as output.txt
$ pdftotext [input.pdf] [output.txt]
copy

Convert pages 2, 3 and 4 of input.pdf to plain text and save them as output.txt
$ pdftotext -f [2] -l [4] [input.pdf] [output.txt]
copy

SYNOPSIS

pdftotext [options] <PDF-file> [<text-file>]

PARAMETERS

-f <int>
    Specifies the first page to convert.

-l <int>
    Specifies the last page to convert.

-layout
    Maintains the original physical layout of the text on the page, useful for human readability.

-raw
    Dumps text in raw reading order, ignoring columns and text boxes. Less readable, but can be useful for some automation.

-nopgbrk
    Suppresses page breaks (form feeds) between pages in the output text file.

-q
    Suppresses all error and warning messages.

-opw <password>
    Provides the owner password for encrypted PDF files.

-upw <password>
    Provides the user password for encrypted PDF files.

-enc <name>
    Sets the output text encoding (e.g., 'UTF-8', 'Latin1'). Defaults to 'UTF-8'.

-eol <convention>
    Sets the end-of-line convention: 'unix' (LF), 'dos' (CRLF), or 'mac' (CR).

-r <float>
    Sets the desired resolution for text extraction. Can improve accuracy with certain fonts.

DESCRIPTION

The pdftotext command is a fundamental utility within the Poppler/Xpdf suite, specifically designed to extract plain text from Portable Document Format (PDF) files. It processes a PDF document and outputs its textual content to a specified output file or standard output. This tool is invaluable for a wide array of applications, including:

Data extraction: For programmatic processing or analysis of document content.
Indexing: Preparing PDF text for search engines or document management systems.
Scripting: Automating text-based tasks involving PDF documents.
Accessibility: Converting PDFs to more accessible plain text formats.

While highly effective for text-heavy documents, its performance can vary with complex layouts such as tables or multi-column designs. It also cannot extract text from scanned PDFs (which are essentially images of text) without prior Optical Character Recognition (OCR). pdftotext supports various options to control page ranges, layout preservation, output encoding, and handling of password-protected files, making it a flexible and essential tool for PDF text manipulation.

CAVEATS

Scanned PDFs: pdftotext operates on embedded text. It cannot extract text from scanned PDFs that are essentially images; OCR (Optical Character Recognition) is required first.
Complex Layouts: Documents with intricate layouts, such as tables, multi-column designs, or text flowing around graphics, can result in misaligned or garbled output, even with the -layout option.
Font Issues: Problems with specific embedded fonts or non-standard character encodings within the PDF might lead to incorrect character output.
Password Protection: Encrypted PDFs require the correct user or owner password to be provided using the -upw or -opw options for successful text extraction.

ENCODING AND CHARACTER SETS

By default, pdftotext outputs text in UTF-8, which is the most widely supported encoding for international characters and modern systems. This usually ensures correct display of text in various languages. However, for compatibility with older text processing tools or specific system requirements, the -enc option allows you to specify other encodings like 'Latin1' (ISO-8859-1) or 'UCS-2'. Using an incorrect encoding for the output can result in 'mojibake' (garbled or unreadable characters), especially when dealing with non-ASCII text. Always verify the output encoding if character display issues arise.

HISTORY

pdftotext originated as a utility within the Xpdf project, a free PDF viewer and toolkit developed by Derek Noonburg, first released in the late 1990s. Xpdf provided a set of command-line tools for various PDF manipulations. In 2005, the rendering library and utilities of Xpdf were forked to create Poppler, under the freedesktop.org project, due to licensing differences and a desire for more rapid open-source development. Poppler quickly became the de-facto standard PDF rendering library for most Linux distributions and prominent open-source applications like GNOME and KDE. The pdftotext command distributed today is primarily from the Poppler project, continuing and enhancing the legacy of robust PDF text extraction.

SEE ALSO

pdfinfo(1), pdftoppm(1), pdfimages(1), pdfgrep(1), qpdf(1)

Copied to clipboard