LinuxCommandLibrary

fpart

Generate partitioned file lists for backups

SYNOPSIS

fpart [OPTIONS] [PATH...]
fpart [OPTIONS] [PATH...] -o OUTPUT_DIR
fpart [OPTIONS] [PATH...] -x COMMAND [ARG...]
fpart [OPTIONS] [PATH...] -e COMMAND

PARAMETERS

-a AGE
    Select files and directories older than AGE. AGE can be specified with units (e.g., '1d' for 1 day, '2h' for 2 hours).

-c
    Count only files and directories, do not list them. Prints the total number of files and directories found.

-e COMMAND
    Execute COMMAND for each generated extent. Each file list is passed to COMMAND via standard input. Implies -x if no COMMAND is specified after paths.

-f N
    Partition the input into N extents (file lists).

-i FILE
    Read input paths from FILE instead of scanning file system paths provided as arguments. One path per line.

-k
    Keep original order of files and directories within extents. May disable some optimizations.

-l LOGFILE
    Log activities to LOGFILE.

-n N
    Maximum number of files/directories per extent. If specified with -s, whichever limit is reached first will trigger a new extent.

-o OUTPUT_DIR
    Output each extent to a separate file within OUTPUT_DIR. Files will be named 'extent.0', 'extent.1', etc.

-p N
    Number of parallel jobs to run when using -x or -e. Defaults to 1. Using 0 implies as many jobs as CPU cores.

-s SIZE
    Maximum total size per extent. SIZE can be specified with units (e.g., '10G', '500M', '1T'). If specified with -n, whichever limit is reached first will trigger a new extent.

-t
    List directories only (instead of files). Each extent will contain a list of directories.

-x
    Execute a COMMAND specified after all paths. The command's standard input will receive the file list. This is often used with xargs.

-z
    Compress output extent files using zstd if -o is used. Output files will have a '.zst' extension.

--help
    Display a help message and exit.

--version
    Output version information and exit.

DESCRIPTION

fpart is a powerful utility designed to rapidly scan large file systems and partition files and directories into manageable lists (referred to as "extents"). Unlike traditional tools like find, fpart is optimized for speed and scalability, especially when dealing with millions of files or terabytes of data. It can generate these lists based on criteria such as the number of files, total size, or even age.

The primary use case for fpart is to prepare data for parallel processing by other commands like tar, rsync, cp, or mv. It can either output the file lists to standard output (for piping to xargs) or directly execute commands for each generated extent. This makes it an invaluable tool for orchestrating large-scale backups, migrations, or data archival operations, significantly reducing processing time by leveraging multiple CPU cores and I/O channels.

CAVEATS

When using fpart with external commands like cp or mv, ensure the destination paths are handled correctly by the target command to avoid unintended file placements or overwrites. Be mindful of I/O load: while fpart itself is fast, launching many parallel jobs (via -p) that perform heavy I/O can saturate disk bandwidth. Error handling for commands executed with -e or -x is the responsibility of the calling script or command.

EXIT STATUS

0: Success.
1: General error.
2: Usage error (e.g., invalid options or arguments).

EXAMPLES

1. Archive a large directory into 10 parallel tar archives:
fpart -f 10 /mnt/data | xargs -P 10 -I % tar -czf backup-%.tar.gz --files-from=-

2. Create lists of files, each up to 10GB, into a directory:
fpart -s 10G /var/log /home -o /tmp/extent_lists

3. Synchronize files with rsync, processing 4 extents in parallel:
fpart -n 50000 -p 4 /source/data -e "rsync -avR --files-from=- . /destination/data"

4. Find all files older than 30 days and delete them (with confirmation):
fpart -a 30d /path/to/clean | xargs -p rm -v

HISTORY

fpart was developed by Google and later released as open-source software. It emerged from the need to efficiently manage and process extremely large file systems within Google's infrastructure, where traditional file traversal tools like find proved to be too slow and resource-intensive. Its design focuses on parallelism and robust handling of massive data sets, making it a valuable addition to the toolkit for system administrators and data engineers dealing with modern-day large-scale storage challenges.

SEE ALSO

find(1), xargs(1), tar(1), rsync(1), cp(1), mv(1)

Copied to clipboard