bulk_extractor

Extract features from disk images and files

SYNOPSIS

bulk_extractor [options] <input_source>
input_source can be a raw disk image file (e.g., /path/to/image.dd), a block device (e.g., /dev/sdb), a file, or a directory.

    Specify the output directory where all feature files and reports will be saved.

-e
    Enable a specific extractor. Multiple -e options can be used. Use --list-extractors to see available ones.

-x
    Disable a specific extractor. Useful for speeding up scans or reducing output noise.

-R, --recurse
    Recursively process files within a directory input (default for directory inputs).

--threads
    Set the number of CPU threads to use for processing. Defaults to the number of CPU cores.

--stop_on_error
    Halt processing immediately if an error is encountered.

--version
    Display the version information of bulk_extractor.

--help
    Show the extensive help message and list of all options.

DESCRIPTION

bulk_extractor is a high-performance digital forensics tool designed to recursively scan any input (e.g., disk image, file, or directory) and extract specific types of information, known as "features," without parsing the file system. It excels at "carving" data from unallocated space or damaged file systems. Its multi-threaded architecture enables rapid processing of large datasets, identifying crucial items such as email addresses, IP addresses, URLs, credit card numbers, geo-location data, and various file types.

The tool outputs these extracted features into separate, type-specific files (e.g., email.txt, url.txt), along with a report.xml, making analysis straightforward and data easily importable into other forensic platforms. It's particularly valuable in investigations where speed, comprehensive feature identification, and the ability to work with raw or compromised data are paramount.

CAVEATS

Limitations and Considerations:
1. Can generate a very large amount of output data, requiring significant disk space, especially for large inputs.
2. Some extractors (e.g., credit card numbers based on checksums) may produce false positives. Careful review of results is essential.
3. bulk_extractor does not reconstruct file systems; it focuses purely on pattern extraction from raw data.
4. Performance is heavily influenced by the speed of the input disk and the number of available CPU cores.

OUTPUT STRUCTURE

All output from bulk_extractor is directed to a specified output directory (using the -o option). This directory typically contains a report.xml file summarizing the run, and numerous 'feature files' (e.g., email.txt, url.txt, ccard.txt), each containing a list of detected features of a specific type, often with offsets and context.

EXTRACTOR SYSTEM

bulk_extractor operates through a modular 'extractor' system. Each extractor is a specialized scanner designed to find a particular type of feature (e.g., email addresses, JPEG headers, URLs). Users can enable or disable specific extractors to tailor the scan to their investigative needs, optimizing both performance and the relevance of the output.

HISTORY

bulk_extractor was initially developed by Simson L. Garfinkel and his team at the Naval Postgraduate School, starting around 2008-2009. The project aimed to create a high-speed, scalable tool for digital forensics capable of processing large disk images quickly, addressing the growing volume of digital evidence. It's an open-source tool that has since become a widely adopted standard in the digital forensics community for its efficiency in identifying and extracting patterns from raw data.