duperemove

Deduplicate files by finding and replacing common chunks

TLDR

Search for duplicate extents in a directory and show them

$ duperemove -r [path/to/directory]

Deduplicate duplicate extents on a Btrfs or XFS (experimental) filesystem

$ duperemove -r -d [path/to/directory]

Use a hash file to store extent hashes (less memory usage and can be reused on subsequent runs)

$ duperemove -r -d --hashfile=[path/to/hashfile] [path/to/directory]

Limit I/O threads (for hashing and dedupe stage) and CPU threads (for duplicate extent finding stage)

$ duperemove -r -d --hashfile=[path/to/hashfile] --io-threads=[n] --cpu-threads=[n] [path/to/directory]

SYNOPSIS

duperemove [OPTIONS] <DIRECTORIES...>

-r, --reflink
    Deduplicate files using copy-on-write (reflink) if the filesystem supports it (e.g., Btrfs, XFS).

-s, --symlink
    Replace duplicate files with symlinks to a single original file.

-L, --hardlink
    Replace duplicate files with hardlinks to a single original file.

-d, --dry-run
    Perform a scan and report potential deduplication savings without making any changes.

-h NUM, --hash-threads=NUM
    Number of threads to use for hashing file contents (default: number of CPU cores).

-j NUM, --io-threads=NUM
    Number of threads to use for I/O operations (default: 1).

-p, --print-dupes
    Print the paths of identified duplicate files.

-v, --verbose
    Increase verbosity, showing more detailed progress and information.

-z, --0bytes
    Add null bytes to the end of the input file list (for xargs -0).

-o, --omit-zero-length
    Exclude zero-length files from the scan.

-A, --all-files
    Scan all file types, including special files (default: only regular files).

-H TYPE, --hash=TYPE
    Specify the hashing algorithm (e.g., 'murmur3', 'sha256').

-f, --fiemap
    Use FIEMAP ioctl to detect common extents on filesystems that support it (faster for block-level deduplication).

-P, --progress
    Show a progress bar during scanning and deduplication.

-i, --inodes
    Process files based on their inode numbers, useful for hardlinked files.

-m, --match-size
    Only compare files of the exact same size (faster but can miss some duplicates).

-x PATTERN, --exclude=PATTERN
    Exclude files or directories matching the specified glob pattern.

--chunk-size=BYTES
    Set the chunk size for block-level deduplication (default: 128KB).

--size=BYTES
    Only process files larger than or equal to this size.

--dedupe-file=FILE
    Deduplicate against a list of files from a specified file.

--hash-file=FILE
    Read hashes from a specified file instead of calculating them.

--skip-gaps
    Skip zero-filled gaps during block-level deduplication.

--dedupe-gaps
    Attempt to deduplicate zero-filled gaps (requires --reflink).

--ignore-zeros
    Ignore zero-filled blocks during hashing.

--sparse
    Create sparse files when writing, if applicable.

--read-only
    Do not attempt to write to any files or modify anything (similar to --dry-run but also applies to hash writing).

--version
    Display version information and exit.

--help
    Display a help message and exit.

DESCRIPTION

duperemove is a command-line utility designed to find and deduplicate redundant data within a filesystem. It works by first scanning specified directories and hashing file contents to identify identical files or identical blocks within files. Subsequently, it can optionally replace duplicate files with shared data blocks using reflink (copy-on-write), hardlinks, or symlinks. Reflink-based deduplication is particularly efficient on filesystems like Btrfs or XFS, as it allows multiple files to share the same physical data blocks without creating multiple copies, saving significant disk space. The utility supports multi-threading for hashing and I/O operations to speed up the process and can perform a dry run to show potential savings before making any permanent changes. It's an invaluable tool for managing disk space on systems with many duplicate files, such as development environments, backups, or virtual machine images.

CAVEATS

Using duperemove with options like --reflink, --symlink, or --hardlink modifies your filesystem. Always perform a --dry-run first to understand the potential impact and ensure no critical data is inadvertently affected. Reflink/copy-on-write functionality is only supported on specific filesystems such as Btrfs and XFS. Incorrect usage or interruptions during deduplication can lead to data loss or corruption, particularly if not using reflink and instead opting for symlinks/hardlinks which directly modify directory entries. Be mindful of potential performance overhead on very large datasets due to extensive hashing and I/O operations.

FILESYSTEM REQUIREMENTS FOR REFLINK

The highly efficient reflink-based deduplication (--reflink) is a key feature of duperemove, but it is only supported by filesystems that implement copy-on-write functionality at the block level. The most common Linux filesystems supporting this are Btrfs and XFS (with reflink support enabled during formatting or mounting). On other traditional filesystems (e.g., Ext4, FAT, NTFS via FUSE), duperemove will only be able to perform symlink or hardlink-based deduplication, or simply report duplicates without performing structural changes.

DEDUPLICATION METHODS EXPLAINED

duperemove offers several ways to handle identified duplicates:

Reflink (Copy-on-Write): This is the most recommended method for compatible filesystems. Files share physical data blocks, but appear as separate, independent files. Modifications to one file only affect its unique blocks, not the shared ones, by writing new data to new blocks. This method is space-efficient, safe, and transparent.

Symlinks: One file is kept as the original, and all duplicate files are replaced by symbolic links pointing to the original. This saves space but can be inconvenient as symlinks behave differently from regular files (e.g., when moving or backing up).

Hardlinks: One file is kept, and all duplicates are replaced by hardlinks to the original. This saves space and files appear as regular files. However, hardlinks are restricted to the same filesystem, and the data is only truly freed when the last hardlink pointing to the inode is removed.

HISTORY

duperemove was initially developed by Adam Borowski and later maintained by Joseph Salisbury and other contributors. Its development was heavily influenced by the growing adoption of modern Linux filesystems like Btrfs and XFS, which offer native copy-on-write (CoW) capabilities. The command arose from the need for efficient block-level deduplication tools to manage disk space, especially for use cases involving multiple copies of similar data (e.g., virtual machine images, incremental backups, software development environments). The command's focus shifted from simple file-level deduplication (like symlinking) to leveraging CoW to save space without altering file paths or breaking applications that expect separate files.