duperemove
Deduplicate files by finding and replacing common chunks
TLDR
Search for duplicate extents in a directory and show them
Deduplicate duplicate extents on a Btrfs or XFS (experimental) filesystem
Use a hash file to store extent hashes (less memory usage and can be reused on subsequent runs)
Limit I/O threads (for hashing and dedupe stage) and CPU threads (for duplicate extent finding stage)
SYNOPSIS
duperemove [options...] <pathspec>...
PARAMETERS
-h, --help
Display help message and exit
-V, --version
Display version information and exit
-v, --verbose
Increase output verbosity (repeatable)
-q, --quiet
Reduce output verbosity
-d, --dry-run
Scan files but do not perform deduplication
-D, --debug
Enable debug logging
-r FILE, --hashfile=FILE
Path to SQLite hash database (default: ~/.cache/duperemove/hashfile.db)
-f, --force
Overwrite existing hashfile if present
-l LIMIT, --max-extents=LIMIT
Maximum extent length in blocks (default: unlimited)
-m NUM, --hash-threads=NUM
Threads for hashing (default: 4)
--io-threads=NUM
Threads for I/O operations (default: 4)
--compare-threads=NUM
Threads for extent comparison (default: 2)
--dedupe-threads=NUM
Threads for deduplication (default: 2)
--multi-phase-scan
Use multi-phase scanning for large datasets
--allow-miscount
Allow hash collisions (risky, for testing)
--no-scrub
Skip post-deduplication filesystem scrub
DESCRIPTION
Duperemove is a powerful command-line tool for deduplicating data on Btrfs and XFS filesystems that support copy-on-write (COW) reflinks. It scans files or directories, computes cryptographic hashes (SHA-256 by default) for fixed-size data blocks (typically 4 KiB), and identifies duplicate extents. Matching extents are then deduplicated by creating filesystem reflinks, allowing multiple files to share the same physical blocks without data copying, thus saving disk space.
Key advantages include incremental operation via a persistent SQLite hash database (hashfile), which stores block hashes and metadata for reuse in future scans, minimizing reprocessing. It supports multi-threaded hashing and I/O for performance, dry-run mode to preview results, verbose/debug logging, and options to limit scans or tune behavior.
Ideal for environments with redundant data like virtual machine images, backups, container layers, or media libraries. An initial full scan builds the hashfile; subsequent runs process only new/changed data. Deduplication requires root privileges on most setups and is safe as it uses atomic reflink operations. However, it locks files during processing, so avoid running on heavily accessed directories. Performance scales with dataset size, RAM, and CPU cores.
CAVEATS
Requires Btrfs or XFS with reflink support.
Root privileges typically needed for reflinking.
High RAM/CPU usage on massive datasets.
Files locked during scan/dedupe; avoid live production dirs.
Hashfile grows large; backup recommended.
Not for non-reflink filesystems like ext4.
HASHFILE USAGE
SQLite DB stores block hashes/inodes for incremental scans.
Share across directories/machines by copying.
Run duperemove -drh /path first to build/test.
PERFORMANCE TIPS
Use SSDs for hashfile.
Increase threads on multi-core systems.
Split large dirs into pathspecs.
Combine with btrfs balance post-dedupe.
HISTORY
Developed by Mark Adams starting 2014 for Btrfs community.
Initial release focused on Btrfs COW dedup; v0.11+ added XFS reflink support.
Maintained on GitHub; integrated into some distros like Fedora.


