LinuxCommandLibrary
GitHubF-DroidGoogle Play Store

pdfgrep

searches for text patterns in PDF files, similar to grep but for PDFs

TLDR

Search for pattern in PDF
$ pdfgrep "[pattern]" [file.pdf]
copy
Case-insensitive search showing page numbers
$ pdfgrep -in "[pattern]" [file.pdf]
copy
Search recursively in directory
$ pdfgrep -r "[pattern]" [/path/to/pdfs/]
copy
Count matches per file
$ pdfgrep -c "[pattern]" [*.pdf]
copy
Print only filenames with matches
$ pdfgrep -l "[pattern]" [*.pdf]
copy
Search with multiple patterns
$ pdfgrep -e "[pattern1]" -e "[pattern2]" [file.pdf]
copy
Limit search to a page range
$ pdfgrep --page-range=[1-10] "[pattern]" [file.pdf]
copy
Print only the matched text
$ pdfgrep -o "[pattern]" [file.pdf]
copy

SYNOPSIS

pdfgrep [OPTIONS] PATTERN FILE...pdfgrep [OPTIONS] {-e PATTERN|-f FILE}... FILE...pdfgrep [OPTIONS] -r|-R PATTERN [FILE|DIR...]

DESCRIPTION

pdfgrep searches for text patterns in PDF files using the Poppler library for text extraction. It provides a familiar grep-like interface for PDF documents.Text is extracted from each page and matched against the given regular expression. By default pdfgrep uses PCRE2 for pattern matching. Fixed-string matching is available via -F.Page number output (-n) helps locate matches within a document. Restricting the search to a page range (--page-range) speeds up searches on large files. Context lines (-C) show surrounding text to aid understanding of a match.Recursive search (-r) processes entire directory trees. Combined with --include and --exclude, this enables targeted searches across document collections. Multiple patterns can be specified with repeated -e options or read from a file with -f.The --unac option is useful when PDFs use typographic ligatures or accented characters that differ from the search term. The --cache option stores extracted text to accelerate repeated searches.

PARAMETERS

-e PATTERN, --regexp=PATTERN

Specify a search pattern. Can be used multiple times to match any of several patterns.
-f FILE, --file=FILE
Read patterns from a file, one per line.
-i, --ignore-case
Case-insensitive matching.
-F, --fixed-strings
Treat the pattern as a fixed string (no regular expression interpretation).
-P, --perl-regexp
Use Perl-compatible regular expressions (PCRE2).
-n, --page-number[=TYPE]
Prefix each match with its page number. TYPE is `index` (default) or `label`.
-c, --count
Print match count per file instead of matched lines.
-p, --page-count
Print match count per page (implies -n).
-l, --files-with-matches
Print only filenames that contain a match.
-L, --files-without-match
Print only filenames that contain no match.
-o, --only-matching
Print only the matched portion of each line.
-H, --with-filename
Print the filename with each match (default when searching multiple files).
-h, --no-filename
Suppress filename prefix in output.
-Z, --null
Use a null byte instead of a colon to separate the filename from the rest of the output line. Useful for filenames containing colons or spaces.
--match-prefix-separator SEP
Use SEP as the separator between the match prefix (filename, page number) and the matched line, instead of the default colon.
-r, --recursive
Search all PDF files under each directory recursively. Symlinks are followed only when specified on the command line.
-R, --dereference-recursive
Like -r, but follow all symlinks.
--include=GLOB
Only search files whose names match GLOB (default: `*.pdf`).
--exclude=GLOB
Skip files whose names match GLOB.
-A NUM, --after-context=NUM
Print NUM lines of context after each match.
-B NUM, --before-context=NUM
Print NUM lines of context before each match.
-C NUM, --context=NUM
Print NUM lines of context before and after each match.
--page-range=RANGE
Limit the search to the specified page range (e.g., `1-10,15`).
-m NUM, --max-count=NUM
Stop after NUM matches per file.
--password=PASSWORD
Use PASSWORD to decrypt a password-protected PDF.
--color WHEN
Colorize output: `auto` (default), `always`, or `never`.
--cache
Cache rendered page text to speed up repeated searches on the same files.
--unac
Remove accents and ligatures from both the search pattern and the document text. Useful for matching words like "ae" against the ligature "æ".
--warn-empty
Warn when a PDF contains no searchable text (e.g., scanned images without OCR).
-q, --quiet
Suppress all output. Exit status indicates whether a match was found.
-V, --version
Print version information.

EXIT STATUS

0

One or more matches were found.
1
No matches were found.
2
An error occurred.

CAVEATS

Text extraction quality depends on the PDF's internal structure. Scanned PDFs without embedded text require OCR preprocessing before pdfgrep can search them (use --warn-empty to detect such files). Complex multi-column layouts may not extract in reading order. Encrypted PDFs require the correct --password.

HISTORY

pdfgrep was written by Hans-Peter Deifel starting around 2010. It uses the Poppler library for PDF parsing and provides a grep-compatible interface for searching PDF text content.

SEE ALSO

grep(1), pdftotext(1), ripgrep(1), pdfinfo(1)

Copied to clipboard
Kai