LinuxCommandLibrary

pdfgrep

TLDR

Search for pattern in PDF

$ pdfgrep "[pattern]" [file.pdf]
copy
Search recursively in directory
$ pdfgrep -r "[pattern]" [/path/to/pdfs/]
copy
Case-insensitive search
$ pdfgrep -i "[pattern]" [file.pdf]
copy
Show page numbers
$ pdfgrep -n "[pattern]" [file.pdf]
copy
Show context lines
$ pdfgrep -C [2] "[pattern]" [file.pdf]
copy
Count matches
$ pdfgrep -c "[pattern]" [file.pdf]
copy
Search with extended regex
$ pdfgrep -E "[pattern1|pattern2]" [file.pdf]
copy
Print only filenames with matches
$ pdfgrep -l "[pattern]" [*.pdf]
copy

SYNOPSIS

pdfgrep [-inrcl] [-C num] [-p pages] [pattern] files

DESCRIPTION

pdfgrep searches for text patterns in PDF files, similar to grep but for PDFs. It extracts text from PDF content and applies regular expression matching.
The tool handles the complexity of PDF text extraction transparently. Text from multiple columns, pages, and formatting is processed into searchable strings. Results show the matching text with optional context.
Page number display (-n) helps locate matches in documents. Page range limiting (-p) speeds searches in large documents. Context lines (-C) show surrounding text for understanding matches.
Recursive search (-r) processes directory trees of PDFs. Combined with --include patterns, this enables searching document collections. Output modes include filenames only, counts, and quiet mode for scripting.
Regular expression support ranges from basic to Perl-compatible (PCRE). This enables complex pattern matching beyond simple string search.

PARAMETERS

-i, --ignore-case

Case-insensitive matching.
-n, --page-number
Print page numbers.
-c, --count
Print match count only.
-l, --files-with-matches
Print only matching filenames.
-L, --files-without-match
Print only non-matching filenames.
-r, --recursive
Search directories recursively.
-R
Follow symlinks when recursive.
-E, --extended-regexp
Use extended regular expressions.
-P, --perl-regexp
Use Perl-compatible regular expressions.
-C NUM, --context NUM
Print NUM lines of context.
-A NUM, --after-context NUM
Print NUM lines after match.
-B NUM, --before-context NUM
Print NUM lines before match.
-p RANGE, --page-range RANGE
Limit search to page range (e.g., 1-10,15).
-m NUM, --max-count NUM
Stop after NUM matches.
--include GLOB
Only search files matching pattern.
--password PASS
PDF password.
--color WHEN
Colorize output: auto, always, never.
-q, --quiet
Suppress output.

CAVEATS

Text extraction quality depends on PDF structure. Scanned PDFs require OCR preprocessing. Complex layouts may not extract cleanly. Large PDFs can be slow to process. Encrypted PDFs need password. Some PDF features may not be supported.

HISTORY

pdfgrep was developed by Hans-Peter Deifel starting around 2010. It fills the gap between general-purpose grep and PDF-specific tools, providing a familiar interface for PDF text search. The project uses the Poppler library for PDF handling.

SEE ALSO

grep(1), pdftotext(1), ripgrep(1), pdfinfo(1)

Copied to clipboard