rmlint

Find and remove duplicate files

TLDR

Check directories for duplicated, empty, and broken files

$ rmlint [path/to/directory1 path/to/directory2 ...]

Check for duplicates bigger than a specific size, preferably keeping files in tagged directories (after the double slash)

$ rmlint [[-s|--size]] [1MB] [path/to/directory] // [path/to/original_directory]

Check for space wasters, keeping everything in the untagged directories

$ rmlint [[-k|--keep-all-untagged]] [path/to/directory] // [path/to/original_directory]

Delete duplicate files found by an execution of rmlint

$ ./rmlint.sh

Find duplicate directory trees based on data, ignoring names

$ rmlint [[-D|--merge-directories]] [path/to/directory]

Mark files at lower path [d]epth as originals, on tie choose shorter [l]ength

$ rmlint [[-S|--rank-by]] [dl] [path/to/directory]

Find files with identical filename and contents, and link rather than delete the duplicates

$ rmlint [[-c|--config]] sh:link [[-b|--match-basename]] [path/to/directory]

Use data as master directory. Find only duplicates in backup that are also in data. Do not delete any files in data

$ rmlint [path/to/backup] // [path/to/data] [[-k|--keep-all-tagged]] [[-m|--must-match-tagged]]

-o format
    Specify the output format. Common options include `sh` (shell script for removal), `json`, `csv`, and `parrot` (for the GUI).

-T types
    Define what kind of "lint" to find. Multiple types can be comma-separated, e.g., `duplicates,emptyfiles,badlinks`.

-L, --symlink
    Replace identified duplicate files with symbolic links to the kept file.

-D, --hardlink
    Replace identified duplicate files with hard links to the kept file.

-S size
    Ignore files smaller than a specified size (e.g., `1M`, `100K`).

-X, --destroy-badlinks
    Automatically remove broken symbolic links without generating a script.
Use with extreme caution.

-j N
    Use N CPU cores for parallel processing, speeding up scans on multi-core systems.

-V, --version
    Display version information and exit.

-h, --help
    Show the help message and exit.

DESCRIPTION

rmlint is a powerful and fast command-line utility designed for finding and cleaning up various types of "lint" on your file system. Its primary strength lies in identifying and managing duplicate files, but it also efficiently handles empty files, broken symbolic links, empty directories, and non-stripped binaries. Unlike simpler tools, rmlint employs a sophisticated scanning algorithm, often utilizing hashes for comparison, and can be configured to use various comparison levels from size to full content. It provides flexible output formats, most notably a shell script, allowing users to safely review and execute cleanup operations. This design choice emphasizes data safety by preventing accidental deletion. rmlint aims to help users reclaim disk space and maintain a tidy file system by intelligently detecting and facilitating the removal or linking of redundant data, making it a valuable tool for system maintenance.

CAVEATS

Always review the generated shell script (typically `rmlint.sh`) before executing it to ensure no critical files are inadvertently deleted.
Be particularly cautious with options that perform immediate actions, such as `-X` (`--destroy-badlinks`), as they bypass the review step.
Using hardlinks or symlinks (`-D`, `-L`) modifies file system metadata; understand the implications for backups and other tools.
Scanning large file systems can be resource-intensive, consuming significant CPU and disk I/O.

<I>OUTPUT FORMATS</I>

rmlint supports various output formats specified via the `-o` option. The most commonly used format is `sh`, which generates a shell script (e.g., `rmlint.sh`) containing `rm` or `ln` commands for the identified lint. This allows users to carefully inspect what will be deleted or linked before execution. Other formats like `json`, `csv`, and `parrot` (used by the GUI) are available for programmatic integration or advanced analysis.

<I>SAFETY AND REVIEW</I>

A core principle of rmlint's design is safety. By default, it does not directly delete files or modify the file system. Instead, it generates a shell script containing the necessary commands to perform the cleanup (e.g., `rm`, `ln`). Users are strongly encouraged to review this generated script (typically found in the current directory as `rmlint.sh`) to ensure that only intended files are affected. This crucial safety net helps prevent accidental data loss and provides full control over the cleanup process.

HISTORY

rmlint was created by Christopher Schramm and has been actively developed as a modern, fast, and feature-rich alternative to older duplicate file finders. Its design prioritizes speed, especially for large datasets, by leveraging efficient algorithms and parallel processing. It gained popularity for its emphasis on safety through its default behavior of generating reviewable shell scripts rather than directly modifying the file system.