rdfind

Find and remove duplicate files

TLDR

Identify all duplicates in a given directory and output a summary

$ rdfind -dryrun true [path/to/directory]

Replace all duplicates with hardlinks

$ rdfind -makehardlinks true [path/to/directory]

Replace all duplicates with symlinks/soft links

$ rdfind -makesymlinks true [path/to/directory]

Delete all duplicates and do not ignore empty files

$ rdfind -deleteduplicates true -ignoreempty false [path/to/directory]

-dryrun, -D
    Performs a dry run, showing what actions rdfind would take without actually modifying any files. Highly recommended before any destructive operations.

-deleteduplicates, -N
    Deletes all but one instance of each set of duplicate files found. This action permanently removes files and should be used with extreme care after a -dryrun.

-makelinks, -L
    Replaces all but one instance of each set of duplicate files with hard links to the remaining original file. This saves disk space without deleting content.

-output file, -o file
    Writes the list of identified duplicate files and proposed actions to the specified output file instead of standard output.

-checksum algorithm, -C algorithm
    Specifies the checksum algorithm to use for content comparison. Common options include md5 (default), sha1, and crc32.

-ignoreempty, -E
    Ignores zero-byte (empty) files during the duplicate finding process. Empty files are often not considered 'true' duplicates.

-recurse, -R
    Recurses into subdirectories (this is the default behavior).

-sizesort, -Z
    Sorts the output by file size, which can be useful for identifying the largest potential space savings.

-verbose, -V
    Increases the verbosity of the output, providing more information about the scanning process.

-help, -H
    Displays a help message with available options and usage information.

DESCRIPTION

The rdfind command is a powerful utility designed to locate and manage duplicate files across one or more specified directory trees. Unlike find, which primarily operates on file metadata, rdfind identifies identical files based on their actual content. It typically employs a two-phase approach: first, it groups files by size, and then it computes checksums (like MD5 or SHA1) for files within those size-identical groups to confirm byte-for-byte equality. Once duplicates are identified, rdfind offers various actions, including listing the duplicates, replacing redundant copies with hard links to a single original file, or safely deleting all but one instance of each duplicate set. Its efficiency makes it suitable for large filesystems, helping to reclaim disk space or ensure data consistency.

It's crucial to use rdfind's destructive options with caution, always employing the -dryrun flag first to preview changes.

CAVEATS

Using rdfind with options like -deleteduplicates or -makelinks can lead to irreversible data loss or unexpected filesystem changes if not used carefully. Always perform a -dryrun first and review its output thoroughly. The process can be I/O intensive, especially on large datasets, and may impact system performance during execution. While checksum algorithms like MD5 or SHA1 are highly reliable, theoretical (though extremely rare) collision risks exist, where two different files could produce the same checksum.

SAFE USAGE PRACTICES

Before executing any command that modifies files (e.g., -deleteduplicates or -makelinks), always run rdfind with the -dryrun flag. This allows you to inspect the list of identified duplicates and the proposed actions without making any changes. Review the output carefully to ensure that the files rdfind targets are indeed duplicates you wish to manage. Only after you are confident in the proposed actions should you remove the -dryrun flag and execute the command.

HOW IT IDENTIFIES DUPLICATES

rdfind primarily uses a two-stage process to identify duplicate files:

1. Size Comparison: It first groups files by their exact size. Files of different sizes cannot be duplicates, so this significantly reduces the number of files that need deeper analysis.
2. Checksum Verification: For files within the same size group, rdfind then calculates a checksum (e.g., MD5 or SHA1) for their content. Only if the checksums match are the files considered true byte-for-byte duplicates. This ensures accuracy even if files have identical names or timestamps but different content.

HISTORY

rdfind is a specialized utility developed for the specific task of efficiently finding and managing duplicate files. While its exact initial development history is not widely documented in general Linux utility timelines, it emerged as a practical solution to address the common problem of disk space waste and data redundancy on filesystems, often used by system administrators and power users for maintenance tasks. Its focus on content-based comparison (checksums) distinguishes it from general-purpose file searching tools.