LinuxCommandLibrary

fpsync

Synchronize files between two file systems

TLDR

Recursively synchronize a directory to another location

$ fpsync -v /[path/to/source]/ /[path/to/destination]/
copy

Recursively synchronize a directory with the final pass (It enables rsync's --delete option with each synchronization job)
$ fpsync -v -E /[path/to/source]/ /[path/to/destination]/
copy

Recursively synchronize a directory to a destination using 8 concurrent synchronization jobs
$ fpsync -v -n 8 -E /[path/to/source]/ /[path/to/destination]/
copy

Recursively synchronize a directory to a destination using 8 concurrent synchronization jobs spread over two remote workers (machine1 and machine2)
$ fpsync -v -n 8 -E -w login@machine1 -w login@machine2 -d /[path/to/shared/directory] /[path/to/source]/ /[path/to/destination]/
copy

Recursively synchronize a directory to a destination using 4 local workers, each one transferring at most 1000 files and 100 MB per synchronization job
$ fpsync -v -n 4 -f 1000 -s $((100 * 1024 * 1024)) /[path/to/source]/ /[path/to/destination]/
copy

Recursively synchronize any directories but exclude specific .snapshot* files (Note: Options and values must be separated by a pipe character)
$ fpsync -v -O "-x|.snapshot*" /[path/to/source]/ /[path/to/destination]/
copy

SYNOPSIS

fpsync [OPTIONS] SOURCE_DIR DESTINATION_DIR

PARAMETERS

-s, --src <path>
    Specifies the source directory for synchronization.

-d, --dst <path>
    Specifies the destination directory for synchronization.

-t, --type <type>
    Defines the transfer command to use: rsync, cp, mv, or tar.

-o, --opts <options>
    Passes additional options to the underlying transfer command specified by -t. Use quotes for multiple options, e.g., '-avz'.

-n, --num <num>
    Sets the number of parallel workers or processes to use for transfers.

-r, --redo
    Forces re-generation of file lists and re-transfer, even if previous lists exist.

-X, --rm-ext
    Removes extraneous files from the destination directory after transfer. Use with extreme caution.

-R, --rm-first
    Removes extraneous files from the destination directory before transfer. Use with extreme caution.

-e, --exclude <pattern>
    Excludes files or directories matching the specified pattern from synchronization.

-E, --exclude-file <file>
    Reads exclusion patterns from the specified file.

-D, --dry-run
    Performs a test run without making any actual changes or transfers.

-v, --verbose
    Increases the verbosity of the output.

-W, --workdir <path>
    Specifies a working directory for temporary files generated during the process. Defaults to /tmp.

-l, --list-only
    Only generates file lists; does not perform any actual transfers.

-h, --help
    Displays help message and exits.

DESCRIPTION

fpsync is a powerful and efficient command-line utility designed for synchronizing file systems, especially optimized for large directories containing millions of files. It is part of the fpart project and significantly differentiates itself from tools like rsync by leveraging parallelism in both the file listing/partitioning phase and the actual transfer phase.

Instead of generating a single, potentially huge, file list, fpsync dynamically partitions files into smaller, manageable chunks. These chunks are then processed and transferred in parallel by multiple worker processes, utilizing underlying tools such as rsync, cp, mv, or tar (often over SSH).

This parallel approach drastically reduces synchronization times for scenarios involving numerous small files, making it ideal for large-scale data migrations, incremental backups, or maintaining consistency across extensive datasets. It intelligently handles source and destination directories, offering flexible and robust synchronization strategies.

CAVEATS

fpsync relies on other standard Linux utilities (find, sort, rsync, cp, mv, tar), which must be present on the system.
Parallelism can be resource-intensive, potentially consuming significant CPU, memory, and I/O. Tune the number of workers (-n) based on system resources.
The temporary working directory (specified by -W or defaulting to /tmp) can accumulate large temporary files, especially with many workers or huge datasets. Ensure sufficient disk space.
When using tar as a transfer type, fpsync does not preserve sparse files, hard links, or certain extended attributes as rsync might.
The -R and -X options (remove extraneous files) are powerful and potentially destructive; always perform a --dry-run first when using them.

HOW FPSYNC WORKS

fpsync operates in two primary phases:
1. File Listing and Partitioning: It first uses fpart (and internally find) to recursively list files in the source directory. Instead of one large list, fpart can partition these files into smaller, more manageable subsets based on size or number of files.
2. Parallel Transfer: Once partitioned, fpsync dispatches these subsets to multiple worker processes (specified by -n). Each worker then independently executes the chosen transfer command (rsync, cp, mv, or tar) to synchronize its assigned subset of files to the destination. This parallel execution minimizes idle time and maximizes throughput.

CHOOSING A TRANSFER TYPE (<B>-T</B>)

The -t option allows selection of the underlying transfer command:

  • rsync: The most common and feature-rich choice, preserving permissions, timestamps, and handling incremental updates efficiently. Supports remote transfers via SSH.
  • cp: A simple copy command. Best for local transfers where advanced features are not needed.
  • mv: Moves files rather than copies. Use with extreme caution as it removes files from the source.
  • tar: Creates a tar archive of files and extracts them on the destination, often over SSH. Can be efficient for remote transfers but may have limitations compared to rsync regarding incremental updates or specific file attributes.

HISTORY

fpsync was developed by Ganaƫl Laplanche as a key component of the fpart project. Its primary motivation was to overcome the performance limitations of traditional synchronization tools, particularly rsync, when confronted with extremely large numbers of small files. Traditional tools often struggle with the overhead of building a single, monolithic file list, which can become a significant bottleneck.

fpsync was engineered to parallelize this process, dynamically partitioning file lists and distributing the transfer workload across multiple CPU cores or even different machines (via SSH for rsync/tar types). This design significantly accelerates synchronization operations for high-volume datasets, making it an essential tool for large-scale data migrations and backups in critical infrastructure environments.

SEE ALSO

fpart(1), rsync(1), find(1), cp(1), mv(1), tar(1)

Copied to clipboard