parallel

Run jobs in parallel using multiple CPUs

TLDR

Gzip several files at once, using all cores

$ parallel gzip ::: [path/to/file1 path/to/file2 ...]

Read arguments from stdin, run 4 jobs at once

$ ls *.txt | parallel [[-j|--jobs]] 4 gzip

Convert JPEG images to PNG using replacement strings

$ parallel convert {} {.}.png ::: *.jpg

Parallel xargs, cram as many args as possible onto one command

$ [args] | parallel -X [command]

Break stdin into ~1M blocks, feed each block to stdin of new command

$ cat [big_file.txt] | parallel --pipe --block 1M [command]

Run on multiple machines via SSH

$ parallel [[-S|--sshlogin]] [machine1],[machine2] [command] ::: [arg1] [arg2]

Download 4 files simultaneously from a text file containing links showing progress

$ parallel [[-j|--jobs]] 4 --bar --eta wget [[-q|--quote]] {} :::: [path/to/links.txt]

Print the jobs which parallel is running in stderr

$ parallel [[-t|--verbose]] [command] ::: [args]

parallel [options] [command [arguments...]] ::: [arguments...]
parallel [options] [command [arguments...]] :::+ [arguments...]
parallel [options] [command [arguments...]] -- [arguments...]
parallel [options] < input_file
some_command | parallel [options] [command [arguments...]]

PARAMETERS

-j N, --jobs N
    Run N jobs in parallel. 0 means as many as possible (one per CPU core).

-n N, --max-args N
    Use at most N arguments per command line. Similar to xargs -n.

--pipe
    Input is read from stdin and passed as a single argument per job, line by line.

--pipe-part
    Input is read from stdin and split into parts, each part becoming stdin for a job.

--rsh shell, --ssh shell
    Use rsh or ssh for remote execution on specified hosts.

--workdir dir
    Change to this directory on the remote or local host before executing the command.

--results dir
    Store output, stderr, and exit status of each job in files within the specified directory.

--colsep regexp
    Treat regexp as the argument separator in input lines. Default is whitespace.

--group
    Group the output of each command together, printing it only after the command finishes.

--keep-order
    Output results in the same order as the input arguments, possibly delaying output.

--dry-run
    Print the commands that would be executed without actually running them.

--delay N
    Wait N seconds between starting new jobs. Useful for preventing resource exhaustion.

--eta
    Show estimated time of arrival for the entire job.

--no-notice
    Suppress the initial 'To cite GNU Parallel...' notice.

--timeout N
    Terminate a job if it runs longer than N seconds.

DESCRIPTION

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. It reads arguments from standard input or arguments given on the command line, and then executes commands for each argument. It can replace xargs and for loops for parallel processing, often drastically speeding up operations by utilizing all available CPU cores. parallel excels at distributing workloads across multiple processors, machines, or even remote systems via SSH. It provides robust features for managing job output, errors, and execution order, making it a versatile "swiss army knife" for various parallel computing tasks, from data processing to batch operations. Its flexibility allows users to control the number of simultaneous jobs, resource allocation, and even interact with other programs for complex workflows.

CAVEATS

Quoting and Escaping: Special characters and spaces in arguments often require careful quoting to ensure they are passed correctly to the executed command, especially when using shell interpretation.
Shell Interpretation: By default, parallel executes commands through a shell. This can lead to unexpected behavior if commands contain shell-specific syntax or unquoted variables. Using --shell /bin/sh -c can help clarify.
Resource Consumption: While efficient, running too many parallel jobs without sufficient system resources (CPU, RAM, I/O) can lead to performance degradation or instability. Careful tuning of the --jobs parameter is crucial.
Dependency: GNU parallel is primarily written in Perl and is usually a separate package that might not be installed by default on all Linux distributions.

INPUT SOURCES AND ARGUMENT HANDLING

parallel is incredibly flexible in handling input. It can take arguments directly on the command line using :::, read them from standard input (one argument per line), or process files directly. Understanding how parallel handles arguments, particularly with the replacement strings like {} (full argument), {.} (basename without extension), {/} (dirname), {#} (job number), and {%} (job percentage), is crucial for advanced usage. Proper quoting and argument transformation are essential for correctness.

REMOTE EXECUTION

A significant feature of parallel is its ability to execute commands on remote hosts via ssh or rsh. This allows for distributed computing without complex setup, using simple shell commands to spread a workload across multiple machines.

PROGRESS AND ETA

For long-running tasks, parallel provides excellent feedback, including progress bars and estimated time of arrival (with --eta), which greatly aids in monitoring and managing large-scale operations.

HISTORY

GNU parallel was created by Ole Tange and first released around 2007. It was designed to fill a gap between xargs and more complex job schedulers, providing a simple yet powerful way to parallelize shell commands. Its development has been continuous, making it a very mature and feature-rich tool. It gained significant popularity due to its flexibility, ease of use, and ability to drastically speed up common command-line tasks by leveraging multi-core processors. It is widely adopted across various fields, from scientific computing to system administration.