LinuxCommandLibrary

duperemove

Deduplicate files by finding and replacing common chunks

TLDR

Search for duplicate extents in a directory and show them

$ duperemove -r [path/to/directory]
copy

Deduplicate duplicate extents on a Btrfs or XFS (experimental) filesystem
$ duperemove -r -d [path/to/directory]
copy

Use a hash file to store extent hashes (less memory usage and can be reused on subsequent runs)
$ duperemove -r -d --hashfile=[path/to/hashfile] [path/to/directory]
copy

Limit I/O threads (for hashing and dedupe stage) and CPU threads (for duplicate extent finding stage)
$ duperemove -r -d --hashfile=[path/to/hashfile] --io-threads=[n] --cpu-threads=[n] [path/to/directory]
copy

SYNOPSIS

duperemove [options...] <pathspec>...

PARAMETERS

-h, --help
    Display help message and exit

-V, --version
    Display version information and exit

-v, --verbose
    Increase output verbosity (repeatable)

-q, --quiet
    Reduce output verbosity

-d, --dry-run
    Scan files but do not perform deduplication

-D, --debug
    Enable debug logging

-r FILE, --hashfile=FILE
    Path to SQLite hash database (default: ~/.cache/duperemove/hashfile.db)

-f, --force
    Overwrite existing hashfile if present

-l LIMIT, --max-extents=LIMIT
    Maximum extent length in blocks (default: unlimited)

-m NUM, --hash-threads=NUM
    Threads for hashing (default: 4)

--io-threads=NUM
    Threads for I/O operations (default: 4)

--compare-threads=NUM
    Threads for extent comparison (default: 2)

--dedupe-threads=NUM
    Threads for deduplication (default: 2)

--multi-phase-scan
    Use multi-phase scanning for large datasets

--allow-miscount
    Allow hash collisions (risky, for testing)

--no-scrub
    Skip post-deduplication filesystem scrub

DESCRIPTION

Duperemove is a powerful command-line tool for deduplicating data on Btrfs and XFS filesystems that support copy-on-write (COW) reflinks. It scans files or directories, computes cryptographic hashes (SHA-256 by default) for fixed-size data blocks (typically 4 KiB), and identifies duplicate extents. Matching extents are then deduplicated by creating filesystem reflinks, allowing multiple files to share the same physical blocks without data copying, thus saving disk space.

Key advantages include incremental operation via a persistent SQLite hash database (hashfile), which stores block hashes and metadata for reuse in future scans, minimizing reprocessing. It supports multi-threaded hashing and I/O for performance, dry-run mode to preview results, verbose/debug logging, and options to limit scans or tune behavior.

Ideal for environments with redundant data like virtual machine images, backups, container layers, or media libraries. An initial full scan builds the hashfile; subsequent runs process only new/changed data. Deduplication requires root privileges on most setups and is safe as it uses atomic reflink operations. However, it locks files during processing, so avoid running on heavily accessed directories. Performance scales with dataset size, RAM, and CPU cores.

CAVEATS

Requires Btrfs or XFS with reflink support.
Root privileges typically needed for reflinking.
High RAM/CPU usage on massive datasets.
Files locked during scan/dedupe; avoid live production dirs.
Hashfile grows large; backup recommended.
Not for non-reflink filesystems like ext4.

HASHFILE USAGE

SQLite DB stores block hashes/inodes for incremental scans.
Share across directories/machines by copying.
Run duperemove -drh /path first to build/test.

PERFORMANCE TIPS

Use SSDs for hashfile.
Increase threads on multi-core systems.
Split large dirs into pathspecs.
Combine with btrfs balance post-dedupe.

HISTORY

Developed by Mark Adams starting 2014 for Btrfs community.
Initial release focused on Btrfs COW dedup; v0.11+ added XFS reflink support.
Maintained on GitHub; integrated into some distros like Fedora.

SEE ALSO

btrfs(8), xfs_io(8), fdupes(1), jdupes(1), rdfind(1)

Copied to clipboard