pdfgrep
searches for text patterns in PDF files, similar to grep but for PDFs
TLDR
SYNOPSIS
pdfgrep [OPTIONS] PATTERN FILE...pdfgrep [OPTIONS] {-e PATTERN|-f FILE}... FILE...pdfgrep [OPTIONS] -r|-R PATTERN [FILE|DIR...]
DESCRIPTION
pdfgrep searches for text patterns in PDF files using the Poppler library for text extraction. It provides a familiar grep-like interface for PDF documents.Text is extracted from each page and matched against the given regular expression. By default pdfgrep uses PCRE2 for pattern matching. Fixed-string matching is available via -F.Page number output (-n) helps locate matches within a document. Restricting the search to a page range (--page-range) speeds up searches on large files. Context lines (-C) show surrounding text to aid understanding of a match.Recursive search (-r) processes entire directory trees. Combined with --include and --exclude, this enables targeted searches across document collections. Multiple patterns can be specified with repeated -e options or read from a file with -f.The --unac option is useful when PDFs use typographic ligatures or accented characters that differ from the search term. The --cache option stores extracted text to accelerate repeated searches.
PARAMETERS
-e PATTERN, --regexp=PATTERN
Specify a search pattern. Can be used multiple times to match any of several patterns.-f FILE, --file=FILE
Read patterns from a file, one per line.-i, --ignore-case
Case-insensitive matching.-F, --fixed-strings
Treat the pattern as a fixed string (no regular expression interpretation).-P, --perl-regexp
Use Perl-compatible regular expressions (PCRE2).-n, --page-number[=TYPE]
Prefix each match with its page number. TYPE is `index` (default) or `label`.-c, --count
Print match count per file instead of matched lines.-p, --page-count
Print match count per page (implies -n).-l, --files-with-matches
Print only filenames that contain a match.-L, --files-without-match
Print only filenames that contain no match.-o, --only-matching
Print only the matched portion of each line.-H, --with-filename
Print the filename with each match (default when searching multiple files).-h, --no-filename
Suppress filename prefix in output.-Z, --null
Use a null byte instead of a colon to separate the filename from the rest of the output line. Useful for filenames containing colons or spaces.--match-prefix-separator SEP
Use SEP as the separator between the match prefix (filename, page number) and the matched line, instead of the default colon.-r, --recursive
Search all PDF files under each directory recursively. Symlinks are followed only when specified on the command line.-R, --dereference-recursive
Like -r, but follow all symlinks.--include=GLOB
Only search files whose names match GLOB (default: `*.pdf`).--exclude=GLOB
Skip files whose names match GLOB.-A NUM, --after-context=NUM
Print NUM lines of context after each match.-B NUM, --before-context=NUM
Print NUM lines of context before each match.-C NUM, --context=NUM
Print NUM lines of context before and after each match.--page-range=RANGE
Limit the search to the specified page range (e.g., `1-10,15`).-m NUM, --max-count=NUM
Stop after NUM matches per file.--password=PASSWORD
Use PASSWORD to decrypt a password-protected PDF.--color WHEN
Colorize output: `auto` (default), `always`, or `never`.--cache
Cache rendered page text to speed up repeated searches on the same files.--unac
Remove accents and ligatures from both the search pattern and the document text. Useful for matching words like "ae" against the ligature "æ".--warn-empty
Warn when a PDF contains no searchable text (e.g., scanned images without OCR).-q, --quiet
Suppress all output. Exit status indicates whether a match was found.-V, --version
Print version information.
EXIT STATUS
0
One or more matches were found.1
No matches were found.2
An error occurred.
CAVEATS
Text extraction quality depends on the PDF's internal structure. Scanned PDFs without embedded text require OCR preprocessing before pdfgrep can search them (use --warn-empty to detect such files). Complex multi-column layouts may not extract in reading order. Encrypted PDFs require the correct --password.
HISTORY
pdfgrep was written by Hans-Peter Deifel starting around 2010. It uses the Poppler library for PDF parsing and provides a grep-compatible interface for searching PDF text content.
