pdf-parser
Analyze and extract data from PDF files
TLDR
Display statistics for a PDF file
Display objects of a specific type (/Font, /URI, ...) in a PDF file
Search for strings in indirect objects
SYNOPSIS
pdf-parser.py [options] <pdf-file>
PARAMETERS
-a
automatically analyze (decompress) all streams
-A
search for all EExec / Flate streams (auto analysis)
-e
extra analysis
-f <file>
add file to pdf file (append or replace object stream)
-F
dump raw file data
-g
go to specific offset
-i
display indirect object
-j
search for JavaScript
-l
load stream into memory
-o <objnr>
go to object number
-O <objnr>
display object (raw, no decode)
-p <password>
password to decrypt file
-q
super quiet (no output)
-r
raw output (no text decode)
-s <string>
search for string
-S
do not show stream contents
-t
display trailer
-v
verbose
-x
extract objects to files
-X
extract objects and streams to files
-h
show help
DESCRIPTION
pdf-parser is a specialized Python-based command-line tool developed by Didier Stevens for dissecting PDF files at a low level. It parses the internal structure of PDF documents, extracting and displaying objects, streams, cross-reference tables, and trailers without rendering the file. This makes it invaluable for digital forensics, malware analysis, and security research, particularly when investigating malicious PDFs that exploit vulnerabilities.
Key capabilities include searching for specific strings or patterns across objects, dumping raw streams, following indirect references, and performing automated analysis on all streams. Unlike rendering tools like evince or pdftotext, pdf-parser operates directly on the file's binary format, revealing hidden or obfuscated content such as JavaScript, embedded files, or anomalous structures.
It's lightweight, scriptable, and outputs human-readable hex and text dumps, aiding in reverse engineering. Commonly used in incident response to identify PDF-based attacks.
CAVEATS
Python script (requires Python 2/3); download from Didier Stevens' site as not in standard repos. Large PDFs may consume high memory. No PDF rendering; purely structural analysis.
INSTALLATION
Download pdf-parser.py from Didier Stevens' blog; run with python pdf-parser.py. No package manager install.
OUTPUT FORMAT
Hex dumps with ASCII preview; object headers like 'obj 1 0'; use -r for binary-safe output piping.
HISTORY
Created by Didier Stevens in 2008 amid rising PDF malware threats; evolved through versions for better stream handling and encryption support. Actively maintained on blog.didierstevens.com.


