LinuxCommandLibrary

rdfind

Find and remove duplicate files

TLDR

Identify all duplicates in a given directory and output a summary

$ rdfind -dryrun true [path/to/directory]
copy

Replace all duplicates with hardlinks
$ rdfind -makehardlinks true [path/to/directory]
copy

Replace all duplicates with symlinks/soft links
$ rdfind -makesymlinks true [path/to/directory]
copy

Delete all duplicates and do not ignore empty files
$ rdfind -deleteduplicates true -ignoreempty false [path/to/directory]
copy

SYNOPSIS

rdfind [options] directory1 [directory2] ...

PARAMETERS

-h, --help
    Display help message and exit.

-v, --version
    Display version information and exit.

-f, --follow-symlinks
    Follow symbolic links.

-n, --dry-run
    Perform a dry run without making any changes. Results are only printed.

-makesymlinks true|false
    Replace duplicates with symbolic links.

-makehardlinks true|false
    Replace duplicates with hard links.

-deleteduplicates true|false
    Delete duplicate files. Note: Use with caution!

-z, --output-name
    Specify output file name (default: results.txt).

-ignoreempty true|false
    Ignores zero length files

-minsize
    Ignore files smaller than specified size

DESCRIPTION

rdfind finds duplicate files across multiple directories.
It scans the specified directories and compares files based on size, then MD5/SHA1 checksum to identify potential duplicates. Instead of deleting files directly, it allows you to choose how to handle duplicates (hard linking, symbolic linking, or deleting only duplicates).
This makes it a safer and more flexible alternative to directly deleting duplicates.
rdfind is designed to work efficiently on large datasets and offers various options to customize the comparison process, reduce memory usage, and handle special cases like symbolic links. It's especially useful for managing large media collections or cleaning up redundant data across multiple storage locations.

CAVEATS

Using the `-deleteduplicates true` option can lead to data loss if not used carefully. Always double-check the output before deleting files.
Hard links will preserve the data but all copies will point to the same inode. Any changes to the data on one hard link will affect all the others.

OUTPUT FORMAT

The output file (results.txt by default) contains a list of duplicate file groups. The first file in each group is considered the original, and subsequent files are duplicates. You can then use the output file as input for other tools or scripts to process the duplicate files based on your chosen strategy.

ERROR HANDLING

rdfind attempts to handle errors gracefully, such as permission issues or inaccessible files.
It will print error messages to standard error (stderr) and continue processing other files. However, it's crucial to review the output for any error messages to ensure that all files were properly scanned.

HISTORY

rdfind was developed to efficiently find duplicate files and provide flexible options for handling them. It evolved to address the limitations of simpler duplicate finders, offering checksum-based verification and support for different linking strategies.
Its usage has grown in scenarios where large datasets require deduplication, such as managing media libraries, backups, or removing redundant data on servers. Development continues to optimize performance, improve error handling, and add new features based on user feedback.

SEE ALSO

fdupes(1), find(1)

Copied to clipboard