bulk_extractor
Extract features from disk images and files
SYNOPSIS
bulk_extractor [options] <input_file> [<output_dir>]
PARAMETERS
-C
Canonicalize output to UTF-8 charset
-d
Enable debug output
-e <scanner>
Enable specific scanner
-E <scanner>
Run only specified scanner(s)
-h
Show help
-j <jobs>
Set number of extraction threads
-J
Output JSON feature files
-m
Print feature manifest
-o <outdir>
Output directory (default: bulk)
-O
Overwrite output directory if exists
-p <progress>
Progress report every N seconds
-q
Quiet mode, suppress non-error messages
-R
Recurse into contained filesystems
-t
Test/dry-run mode
-v
Verbose output
-V
Show version
-w <wordlist>
Custom wordlist for word scanner
-x <scanner>
Disable specific scanner
-X <scanner>
Disable and unload scanner config
DESCRIPTION
bulk_extractor is a fast, parallel digital forensics tool that scans disk images, files, or memory dumps to extract structured personal data without relying on filesystem parsers. It detects thousands of features including email addresses, credit card numbers, IP addresses, URLs, social security numbers, phone numbers, and more using regex-based scanners.
Unlike carving tools, it operates at the byte stream level, ignoring fragmentation or corruption, making it ideal for rapid triage of large datasets (terabytes). Scanners run concurrently across CPU cores, producing per-feature text files in an output directory. These can be queried with tools like grep or imported into databases.
Key strengths: speed (processes 1GB/sec+ on modern hardware), plugin extensibility, and context preservation (offsets and surrounding bytes). Widely used in law enforcement, incident response, and research. Supports raw, EWF (E01), AFF formats.
CAVEATS
High memory usage on large images; no full filesystem parsing (byte-level only); may produce false positives; requires significant CPU for best speed.
INPUT FORMATS
Supports raw (.dd), EWF (.E01), AFF, split/raw; memory dumps (.mem)
COMMON SCANNERS
email, ccnum, http, wordlist, find, exif; list via bulk_extractor -m
HISTORY
Developed by Simson L. Garfinkel (USAF/DARPA/NIST) starting 2010; open-sourced on GitHub; evolved for CFReDS/NIST benchmarks; version 2.x adds JSON, threading improvements.


