duperemove
Deduplicate files by finding and replacing common chunks
TLDR
Search for duplicate extents in a directory and show them
Deduplicate duplicate extents on a Btrfs or XFS (experimental) filesystem
Use a hash file to store extent hashes (less memory usage and can be reused on subsequent runs)
Limit I/O threads (for hashing and dedupe stage) and CPU threads (for duplicate extent finding stage)
SYNOPSIS
duperemove [OPTIONS] <DIRECTORIES...>
PARAMETERS
-r, --reflink
Deduplicate files using copy-on-write (reflink) if the filesystem supports it (e.g., Btrfs, XFS).
-s, --symlink
Replace duplicate files with symlinks to a single original file.
-L, --hardlink
Replace duplicate files with hardlinks to a single original file.
-d, --dry-run
Perform a scan and report potential deduplication savings without making any changes.
-h NUM, --hash-threads=NUM
Number of threads to use for hashing file contents (default: number of CPU cores).
-j NUM, --io-threads=NUM
Number of threads to use for I/O operations (default: 1).
-p, --print-dupes
Print the paths of identified duplicate files.
-v, --verbose
Increase verbosity, showing more detailed progress and information.
-z, --0bytes
Add null bytes to the end of the input file list (for xargs -0).
-o, --omit-zero-length
Exclude zero-length files from the scan.
-A, --all-files
Scan all file types, including special files (default: only regular files).
-H TYPE, --hash=TYPE
Specify the hashing algorithm (e.g., 'murmur3', 'sha256').
-f, --fiemap
Use FIEMAP ioctl to detect common extents on filesystems that support it (faster for block-level deduplication).
-P, --progress
Show a progress bar during scanning and deduplication.
-i, --inodes
Process files based on their inode numbers, useful for hardlinked files.
-m, --match-size
Only compare files of the exact same size (faster but can miss some duplicates).
-x PATTERN, --exclude=PATTERN
Exclude files or directories matching the specified glob pattern.
--chunk-size=BYTES
Set the chunk size for block-level deduplication (default: 128KB).
--size=BYTES
Only process files larger than or equal to this size.
--dedupe-file=FILE
Deduplicate against a list of files from a specified file.
--hash-file=FILE
Read hashes from a specified file instead of calculating them.
--skip-gaps
Skip zero-filled gaps during block-level deduplication.
--dedupe-gaps
Attempt to deduplicate zero-filled gaps (requires --reflink).
--ignore-zeros
Ignore zero-filled blocks during hashing.
--sparse
Create sparse files when writing, if applicable.
--read-only
Do not attempt to write to any files or modify anything (similar to --dry-run but also applies to hash writing).
--version
Display version information and exit.
--help
Display a help message and exit.
DESCRIPTION
duperemove is a command-line utility designed to find and deduplicate redundant data within a filesystem. It works by first scanning specified directories and hashing file contents to identify identical files or identical blocks within files. Subsequently, it can optionally replace duplicate files with shared data blocks using reflink (copy-on-write), hardlinks, or symlinks. Reflink-based deduplication is particularly efficient on filesystems like Btrfs or XFS, as it allows multiple files to share the same physical data blocks without creating multiple copies, saving significant disk space. The utility supports multi-threading for hashing and I/O operations to speed up the process and can perform a dry run to show potential savings before making any permanent changes. It's an invaluable tool for managing disk space on systems with many duplicate files, such as development environments, backups, or virtual machine images.
CAVEATS
Using duperemove with options like --reflink, --symlink, or --hardlink modifies your filesystem. Always perform a --dry-run first to understand the potential impact and ensure no critical data is inadvertently affected. Reflink/copy-on-write functionality is only supported on specific filesystems such as Btrfs and XFS. Incorrect usage or interruptions during deduplication can lead to data loss or corruption, particularly if not using reflink and instead opting for symlinks/hardlinks which directly modify directory entries. Be mindful of potential performance overhead on very large datasets due to extensive hashing and I/O operations.
FILESYSTEM REQUIREMENTS FOR REFLINK
The highly efficient reflink-based deduplication (--reflink) is a key feature of duperemove, but it is only supported by filesystems that implement copy-on-write functionality at the block level. The most common Linux filesystems supporting this are Btrfs and XFS (with reflink support enabled during formatting or mounting). On other traditional filesystems (e.g., Ext4, FAT, NTFS via FUSE), duperemove will only be able to perform symlink or hardlink-based deduplication, or simply report duplicates without performing structural changes.
DEDUPLICATION METHODS EXPLAINED
duperemove offers several ways to handle identified duplicates:
Reflink (Copy-on-Write): This is the most recommended method for compatible filesystems. Files share physical data blocks, but appear as separate, independent files. Modifications to one file only affect its unique blocks, not the shared ones, by writing new data to new blocks. This method is space-efficient, safe, and transparent.
Symlinks: One file is kept as the original, and all duplicate files are replaced by symbolic links pointing to the original. This saves space but can be inconvenient as symlinks behave differently from regular files (e.g., when moving or backing up).
Hardlinks: One file is kept, and all duplicates are replaced by hardlinks to the original. This saves space and files appear as regular files. However, hardlinks are restricted to the same filesystem, and the data is only truly freed when the last hardlink pointing to the inode is removed.
HISTORY
duperemove was initially developed by Adam Borowski and later maintained by Joseph Salisbury and other contributors. Its development was heavily influenced by the growing adoption of modern Linux filesystems like Btrfs and XFS, which offer native copy-on-write (CoW) capabilities. The command arose from the need for efficient block-level deduplication tools to manage disk space, especially for use cases involving multiple copies of similar data (e.g., virtual machine images, incremental backups, software development environments). The command's focus shifted from simple file-level deduplication (like symlinking) to leveraging CoW to save space without altering file paths or breaking applications that expect separate files.