fdupes

Identify and optionally delete duplicate files

TLDR

Search a single directory

$ fdupes [path/to/directory]

Search multiple directories

$ fdupes [directory1] [directory2]

Search a directory recursively

$ fdupes [[-r|--recurse]] [path/to/directory]

Search multiple directories, one recursively

$ fdupes [path/to/irectory1] [[-R|--recurse:]] [path/to/directory2]

Search recursively, considering hardlinks as duplicates

$ fdupes [[-rH|--recurse --hardlinks]] [path/to/directory]

Search recursively for duplicates and display interactive prompt to pick which ones to keep, deleting the others

$ fdupes [[-rd|--recurse --delete]] [path/to/directory]

Search recursively and delete duplicates without prompting

$ fdupes [[-rdN|--recurse --delete --noprompt]] [path/to/directory]

-r, --recurse
    Recursively scan specified directories.

-s, --symlink
    Follow symbolic links found in arguments and during recursive scans.

-d, --delete
    Prompt user for files to preserve, delete all others in a set of duplicates.

-N, --noprompt
    Used with --delete; preserve the first file, delete others without prompting. Use with extreme caution!

-f, --omitfirst
    Omit the first file in each set of duplicates from output.

-L, --hardlink
    Replace all duplicate files with hardlinks to the first file in each set.

-S, --symlink
    Replace all duplicate files with symlinks to the first file in each set.

-m, --summarize
    Summarize duplicate file information, showing total duplicates and space saved.

-n, --noempty
    Exclude zero-length files from consideration.

-1, --sameline
    List each set of duplicate files on a single line, separated by spaces.

-q, --quiet
    Do not show progress indicator.

-p, --preservepermissions
    Preserve permissions of original file when hardlinking or symlinking duplicates.

-v, --version
    Display fdupes version information and exit.

-h, --help
    Display usage instructions and exit.

DESCRIPTION

fdupes is a command-line utility written in C for efficiently finding and managing duplicate files within specified directories. It identifies duplicates by first comparing file sizes, then by comparing cryptographic hash signatures (like MD5 or SHA-1) of their content. This two-stage approach minimizes unnecessary byte-by-byte comparisons, making it quite fast.

Files with identical sizes are subjected to a full content hash validation to confirm their identical nature. Upon discovery, fdupes offers several powerful actions: it can simply list the duplicates, allow for interactive deletion where the user chooses which files to preserve, or replace duplicates with hardlinks or symlinks to the first encountered instance, thereby saving disk space. It supports recursive scanning of directories, following symbolic links, and various output formats for flexibility. As a robust tool, fdupes is invaluable for cleaning up cluttered storage, optimizing disk usage, and maintaining an organized file system.

CAVEATS

Using options like --delete, especially in combination with --noprompt, can lead to irreversible data loss if used incorrectly. Always double-check commands before execution.
The default behavior of fdupes might recursively scan directories depending on the version and compilation options; it's always safer to explicitly use --recurse if that's the desired behavior.
Hardlinking or symlinking duplicates changes the filesystem structure. Be aware of implications, especially with backups or applications that expect distinct file paths.
fdupes relies on file content hashes (MD5/SHA-1 by default). While collisions are rare, they are theoretically possible, meaning two different files could have the same hash. However, for practical purposes, this is usually not a concern for typical file management.

PERFORMANCE CONSIDERATIONS

fdupes processes files in order of size, which helps in quickly discarding non-duplicates. For large numbers of files, especially across different storage devices, its performance can be significantly impacted by disk I/O. Using faster storage or limiting the scope of the scan can improve speed.

CHECKSUM ALGORITHMS

While fdupes historically used MD5, newer versions might support SHA-1 or other algorithms, providing stronger guarantees against hash collisions. Check your specific version's man page for details on which algorithms are used or can be configured.

INTERACTIVE DELETION

The interactive deletion mode (-d) presents each set of duplicate files and asks the user to choose which files to keep. This is a very safe way to remove duplicates, as it provides granular control. The files are numbered, and the user enters the numbers of files to preserve, separated by spaces. Entering a blank line preserves all files in the current set, and entering ? or h shows help.

HISTORY

fdupes was originally written by Adrian Lopez and first released in 2003. It quickly became a popular and lightweight utility for duplicate file detection on Unix-like systems due to its efficiency and straightforward approach. Over the years, it has seen contributions from various developers, adding features like more robust hashing algorithms (e.g., SHA-1, SHA-256 in newer versions), improved output formatting, and additional options for handling duplicates like symlinking. Its development has focused on maintaining a fast and reliable tool for a common system administration task.