LinuxCommandLibrary

pdftotext

Convert PDF to plain text

TLDR

Convert filename.pdf to plain text and print it to stdout

$ pdftotext [filename.pdf] -
copy

Convert filename.pdf to plain text and save it as filename.txt
$ pdftotext [filename.pdf]
copy

Convert filename.pdf to plain text and preserve the layout
$ pdftotext -layout [filename.pdf]
copy

Convert input.pdf to plain text and save it as output.txt
$ pdftotext [input.pdf] [output.txt]
copy

Convert pages 2, 3 and 4 of input.pdf to plain text and save them as output.txt
$ pdftotext -f [2] -l [4] [input.pdf] [output.txt]
copy

SYNOPSIS

pdftotext [options] []

PARAMETERS

-f
    First page to convert

-l
    Last page to convert

-layout
    Maintain original physical layout

-table
    Maintain table formatting, detect table and output text

-fixed
    Assume fixed-pitch (monospace) layout (specify character width)

-enc
    Output text encoding name

-eol
    End-of-line convention (unix, dos, mac)

-nopgbrk
    Don't insert page breaks between pages

-opw
    Owner password (for encrypted files)

-upw
    User password (for encrypted files, if different from owner)

-bbox
    Output bounding box for each word

-htmlmeta
    Output a simple HTML file with meta information

-v
    Print copyright and version info

-h
    Print usage information

-help
    Print usage information

-version
    Print copyright and version info

DESCRIPTION

The pdftotext command is a utility that extracts text content from a Portable Document Format (PDF) file and converts it into plain text format. This is useful for extracting text for indexing, searching, or editing. It is part of the Poppler library suite. pdftotext attempts to preserve the original layout and formatting of the text as much as possible. The output can be piped to other commands for further processing, or saved directly to a text file.
The command supports various options to control the text extraction process, like page range to extract from the PDF, output encoding, and layout control parameters. With correct usage, pdftotext represents an efficient way to access and repurpose the textual information stored within PDF documents.

CAVEATS

Complex layouts or documents with heavy graphics may result in poorly formatted output. Encrypted PDFs might require the correct password to be accessible. The output might require further cleaning and formatting depending on the complexity of the original PDF.

RETURN VALUE

pdftotext returns 0 if it was successful. It will return a non-zero exit code if there were any issues, like file access issues or invalid parameters.

EXAMPLES

Convert a PDF file named 'document.pdf' to a text file named 'document.txt':
pdftotext document.pdf document.txt
Convert only pages 3 to 5 of 'report.pdf':
pdftotext -f 3 -l 5 report.pdf report.txt

HISTORY

pdftotext is a part of the Poppler library, an open-source PDF rendering library. The development of Poppler and consequently pdftotext has been driven by the need for a free and open alternative to proprietary PDF readers and utilities. Its initial development focused on providing a rendering engine and tools to manipulate PDF files. Over time, pdftotext evolved to improve its accuracy in text extraction and maintain layout fidelity. Its widespread usage in various Linux distributions and applications shows its importance for document processing.

SEE ALSO

Copied to clipboard