LinuxCommandLibrary

pdfgrep

Search text within PDF files

TLDR

Find lines that match pattern in a PDF

$ pdfgrep [pattern] [file.pdf]
copy

Include file name and page number for each matched line
$ pdfgrep [[-H|--with-filename]] [[-n|--page-number]] [pattern] [file.pdf]
copy

Do a case-insensitive search for lines that begin with "foo" and return the first 3 matches
$ pdfgrep [[-m|--max-count]] [3] [[-i|--ignore-case]] ['^foo'] [file.pdf]
copy

Find pattern in files with a .pdf extension in the current directory recursively
$ pdfgrep [[-r|--recursive]] [pattern]
copy

Find pattern on files that match a specific glob in the current directory recursively
$ pdfgrep [[-r|--recursive]] --include ['*book.pdf'] [pattern]
copy

SYNOPSIS

pdfgrep [options] pattern [file(s)]

PARAMETERS

-i, --ignore-case
    Ignore case distinctions in both the pattern and the input files.

-r, --recursive
    Recursively search directories for pdf files.

-l, --files-with-matches
    Only print the names of files containing matches, not the matching lines.

-n, --line-number
    Prefix each line of output with the line number within its file.

-H, --with-filename
    Print the file name for each match.

-h, --no-filename
    Suppress the prefixing of file names on output.

-c, --count
    Suppress normal output; instead print a count of matching lines for each input file. With the -v, --invert-match option (see below), count non-matching lines.

-o, --only-matching
    Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

-f FILE, --file=FILE
    Obtain patterns from FILE, one per line.

-p PASSWORD, --password=PASSWORD
    pdf password

-q, --quiet, --silent
    Suppress all normal output.

-v, --invert-match
    Select non-matching lines.

-m NUM, --max-count=NUM
    Stop reading a file after NUM matching lines.

-F, --fixed-strings
    Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.

-G, --basic-regexp
    Interpret PATTERN as a basic regular expression (BRE).

-E, --extended-regexp
    Interpret PATTERN as an extended regular expression (ERE).

-P, --perl-regexp
    Interpret PATTERN as a Perl regular expression (PCRE).

--help
    Display help message and exit.

--version
    Display version information and exit.

DESCRIPTION

pdfgrep is a command-line tool designed to search for text within pdf files. It operates similarly to the widely used grep command, but is specifically tailored for pdf documents. Unlike grep, pdfgrep can directly parse and extract text content from pdfs without requiring them to be converted to plain text first. This makes it a convenient and efficient way to find specific keywords, phrases, or patterns within pdf documents. It supports various options for fine-tuning the search, including case-insensitive matching, regular expressions, recursive directory searching, and output formatting. pdfgrep can also handle password-protected pdfs if the password is provided. Its key advantage lies in its ability to accurately extract text from pdf files, even those with complex layouts or embedded fonts, making it a valuable tool for anyone working with large collections of pdf documents.

CAVEATS

pdfgrep's accuracy in extracting text depends on the pdf format. Scanned images or poorly formatted pdfs may not be searchable.
Some functionalities depends on the used PDF library and can provide inconsistent results.
Performance can degrade when searching large or complex pdf files.

EXIT STATUS

The exit status is 0 if selected lines are found, and 1 if not found. If an error occurred the exit status is 2 or greater.

HISTORY

pdfgrep was developed as a specialized tool to address the need for searching text within pdf files, a common requirement in document management and information retrieval. It builds upon the principles of grep, a fundamental command-line utility for searching text in files, but extends its capabilities to handle the specific structure and encoding of pdf documents.
Over time, pdfgrep has evolved to incorporate features like regular expression support, password handling, and recursive directory searching, making it a powerful and versatile tool for working with pdf archives.

SEE ALSO

grep(1), find(1)

Copied to clipboard