pdfgrep

Search text within PDF files

TLDR

Find lines that match pattern in a PDF

$ pdfgrep [pattern] [file.pdf]

Include file name and page number for each matched line

$ pdfgrep [[-H|--with-filename]] [[-n|--page-number]] [pattern] [file.pdf]

Do a case-insensitive search for lines that begin with file_name and return the first 3 matches

$ pdfgrep [[-m|--max-count]] [3] [[-i|--ignore-case]] '[^file_name]' [file.pdf]

Find pattern in files with a .pdf extension in the current directory recursively

$ pdfgrep [[-r|--recursive]] [pattern]

Find pattern on files that match a specific glob in the current directory recursively

$ pdfgrep [[-r|--recursive]] --include '[*book.pdf]' [pattern]

SYNOPSIS

pdfgrep [OPTION]... PATTERN [FILE]...

PATTERN
    The regular expression pattern to search for within the PDF content.

FILE
    One or more PDF files to search. If omitted, pdfgrep reads from standard input (e.g., piped output from another command).

-i, --ignore-case
    Ignore case distinctions in the pattern and input data.

-n, --page-number
    Prefix each match with the page number where it was found.

-H, --with-filename
    Print the filename for each match (default when multiple files are given).

-h, --no-filename
    Suppress the prefixing of filenames on output.

-c, --count
    Print only a count of matching lines for each input file.

-l, --files-with-matches
    Print only the names of files that contain at least one match.

-L, --files-without-matches
    Print only the names of files that contain no matches.

-r, --recursive
    Recursively search directories specified on the command line.

-A NUM, --after-context=NUM
    Print NUM lines of trailing context after a match.

-B NUM, --before-context=NUM
    Print NUM lines of leading context before a match.

-C NUM, --context=NUM
    Print NUM lines of both leading and trailing context around a match.

-o, --only-matching
    Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

DESCRIPTION

pdfgrep is a command-line utility designed to search for regular expression patterns within PDF files. It functions similarly to the well-known grep command, but specifically targets PDF documents by extracting their textual content and then applying the search pattern. This makes it an invaluable tool for quickly finding specific information across multiple PDF files without needing to manually open each one.

pdfgrep supports various grep-like options, including case-insensitive searches, displaying page numbers, recursive directory scanning, and printing only filenames with matches or counts. It leverages underlying PDF text extraction libraries, such as those from the Poppler utilities, to process the PDF content. This means it can only search text that is embedded within the PDF, not text from scanned images unless Optical Character Recognition (OCR) has been applied to the document.

CAVEATS

pdfgrep relies on the text layer embedded within PDF documents. It cannot search text that is part of scanned images within a PDF unless Optical Character Recognition (OCR) has already been applied to convert the image text into a searchable text layer.

Its performance can also vary significantly based on PDF complexity and size. Text extraction quality may differ depending on the PDF's internal structure, fonts, and layout.

DEPENDENCIES

pdfgrep typically relies on a PDF rendering library like Poppler (specifically libpoppler) for its text extraction capabilities. Ensure this dependency is met for pdfgrep to function correctly.

EXIT STATUS

Like grep, pdfgrep uses exit statuses to indicate the search result: 0 if lines were selected, 1 if no lines were selected, and 2 if an error occurred. This allows for its robust use in scripting.

HISTORY

pdfgrep was developed out of the need for a grep-like tool specifically tailored for PDF documents. Before its existence, users often had to convert PDFs to plain text using tools like pdftotext and then grep the resulting text files. pdfgrep automates this process by integrating the text extraction and searching capabilities into a single command, providing a more convenient and efficient workflow for PDF content searching. It was first released in 2009.