pdftotext
Convert PDF to plain text
TLDR
Convert filename.pdf to plain text and print it to stdout
Convert filename.pdf to plain text and save it as filename.txt
Convert filename.pdf to plain text and preserve the layout
Convert input.pdf to plain text and save it as output.txt
Convert pages 2, 3 and 4 of input.pdf to plain text and save them as output.txt
SYNOPSIS
pdftotext [options]
PARAMETERS
-f
First page to convert
-l
Last page to convert
-layout
Maintain original physical layout
-table
Maintain table formatting, detect table and output text
-fixed
Assume fixed-pitch (monospace) layout (specify character width)
-enc
Output text encoding name
-eol
End-of-line convention (unix, dos, mac)
-nopgbrk
Don't insert page breaks between pages
-opw
Owner password (for encrypted files)
-upw
User password (for encrypted files, if different from owner)
-bbox
Output bounding box for each word
-htmlmeta
Output a simple HTML file with meta information
-v
Print copyright and version info
-h
Print usage information
-help
Print usage information
-version
Print copyright and version info
DESCRIPTION
The pdftotext command is a utility that extracts text content from a Portable Document Format (PDF) file and converts it into plain text format. This is useful for extracting text for indexing, searching, or editing. It is part of the Poppler library suite. pdftotext attempts to preserve the original layout and formatting of the text as much as possible. The output can be piped to other commands for further processing, or saved directly to a text file.
The command supports various options to control the text extraction process, like page range to extract from the PDF, output encoding, and layout control parameters. With correct usage, pdftotext represents an efficient way to access and repurpose the textual information stored within PDF documents.
CAVEATS
Complex layouts or documents with heavy graphics may result in poorly formatted output. Encrypted PDFs might require the correct password to be accessible. The output might require further cleaning and formatting depending on the complexity of the original PDF.
RETURN VALUE
pdftotext returns 0 if it was successful. It will return a non-zero exit code if there were any issues, like file access issues or invalid parameters.
EXAMPLES
Convert a PDF file named 'document.pdf' to a text file named 'document.txt':
pdftotext document.pdf document.txt
Convert only pages 3 to 5 of 'report.pdf':pdftotext -f 3 -l 5 report.pdf report.txt
HISTORY
pdftotext is a part of the Poppler library, an open-source PDF rendering library. The development of Poppler and consequently pdftotext has been driven by the need for a free and open alternative to proprietary PDF readers and utilities. Its initial development focused on providing a rendering engine and tools to manipulate PDF files. Over time, pdftotext evolved to improve its accuracy in text extraction and maintain layout fidelity. Its widespread usage in various Linux distributions and applications shows its importance for document processing.