pdfgrep
Search text within PDF files
TLDR
Find lines that match pattern in a PDF
Include file name and page number for each matched line
Do a case-insensitive search for lines that begin with "foo" and return the first 3 matches
Find pattern in files with a .pdf extension in the current directory recursively
Find pattern on files that match a specific glob in the current directory recursively
SYNOPSIS
pdfgrep [options] pattern [file(s)]
PARAMETERS
-i, --ignore-case
Ignore case distinctions in both the pattern and the input files.
-r, --recursive
Recursively search directories for pdf files.
-l, --files-with-matches
Only print the names of files containing matches, not the matching lines.
-n, --line-number
Prefix each line of output with the line number within its file.
-H, --with-filename
Print the file name for each match.
-h, --no-filename
Suppress the prefixing of file names on output.
-c, --count
Suppress normal output; instead print a count of matching lines for each input file. With the -v, --invert-match option (see below), count non-matching lines.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line.
-p PASSWORD, --password=PASSWORD
pdf password
-q, --quiet, --silent
Suppress all normal output.
-v, --invert-match
Select non-matching lines.
-m NUM, --max-count=NUM
Stop reading a file after NUM matching lines.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
-G, --basic-regexp
Interpret PATTERN as a basic regular expression (BRE).
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE).
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression (PCRE).
--help
Display help message and exit.
--version
Display version information and exit.
DESCRIPTION
pdfgrep is a command-line tool designed to search for text within pdf files. It operates similarly to the widely used grep command, but is specifically tailored for pdf documents. Unlike grep, pdfgrep can directly parse and extract text content from pdfs without requiring them to be converted to plain text first. This makes it a convenient and efficient way to find specific keywords, phrases, or patterns within pdf documents. It supports various options for fine-tuning the search, including case-insensitive matching, regular expressions, recursive directory searching, and output formatting. pdfgrep can also handle password-protected pdfs if the password is provided. Its key advantage lies in its ability to accurately extract text from pdf files, even those with complex layouts or embedded fonts, making it a valuable tool for anyone working with large collections of pdf documents.
CAVEATS
pdfgrep's accuracy in extracting text depends on the pdf format. Scanned images or poorly formatted pdfs may not be searchable.
Some functionalities depends on the used PDF library and can provide inconsistent results.
Performance can degrade when searching large or complex pdf files.
EXIT STATUS
The exit status is 0 if selected lines are found, and 1 if not found. If an error occurred the exit status is 2 or greater.
HISTORY
pdfgrep was developed as a specialized tool to address the need for searching text within pdf files, a common requirement in document management and information retrieval. It builds upon the principles of grep, a fundamental command-line utility for searching text in files, but extends its capabilities to handle the specific structure and encoding of pdf documents.
Over time, pdfgrep has evolved to incorporate features like regular expression support, password handling, and recursive directory searching, making it a powerful and versatile tool for working with pdf archives.