rdfind
Find and remove duplicate files
TLDR
Identify all duplicates in a given directory and output a summary
Replace all duplicates with hardlinks
Replace all duplicates with symlinks/soft links
Delete all duplicates and do not ignore empty files
SYNOPSIS
rdfind [options] directory1 [directory2] ...
PARAMETERS
-h, --help
Display help message and exit.
-v, --version
Display version information and exit.
-f, --follow-symlinks
Follow symbolic links.
-n, --dry-run
Perform a dry run without making any changes. Results are only printed.
-makesymlinks true|false
Replace duplicates with symbolic links.
-makehardlinks true|false
Replace duplicates with hard links.
-deleteduplicates true|false
Delete duplicate files. Note: Use with caution!
-z, --output-name
Specify output file name (default: results.txt).
-ignoreempty true|false
Ignores zero length files
-minsize
Ignore files smaller than specified size
DESCRIPTION
rdfind finds duplicate files across multiple directories.
It scans the specified directories and compares files based on size, then MD5/SHA1 checksum to identify potential duplicates. Instead of deleting files directly, it allows you to choose how to handle duplicates (hard linking, symbolic linking, or deleting only duplicates).
This makes it a safer and more flexible alternative to directly deleting duplicates.
rdfind is designed to work efficiently on large datasets and offers various options to customize the comparison process, reduce memory usage, and handle special cases like symbolic links. It's especially useful for managing large media collections or cleaning up redundant data across multiple storage locations.
CAVEATS
Using the `-deleteduplicates true` option can lead to data loss if not used carefully. Always double-check the output before deleting files.
Hard links will preserve the data but all copies will point to the same inode. Any changes to the data on one hard link will affect all the others.
OUTPUT FORMAT
The output file (results.txt by default) contains a list of duplicate file groups. The first file in each group is considered the original, and subsequent files are duplicates. You can then use the output file as input for other tools or scripts to process the duplicate files based on your chosen strategy.
ERROR HANDLING
rdfind attempts to handle errors gracefully, such as permission issues or inaccessible files.
It will print error messages to standard error (stderr) and continue processing other files. However, it's crucial to review the output for any error messages to ensure that all files were properly scanned.
HISTORY
rdfind was developed to efficiently find duplicate files and provide flexible options for handling them. It evolved to address the limitations of simpler duplicate finders, offering checksum-based verification and support for different linking strategies.
Its usage has grown in scenarios where large datasets require deduplication, such as managing media libraries, backups, or removing redundant data on servers. Development continues to optimize performance, improve error handling, and add new features based on user feedback.