LinuxCommandLibrary

torchrun

TLDR

Run distributed training on a single node with 4 GPUs

$ torchrun --standalone --nproc_per_node=4 [train.py]
copy
Run multi-node training with 2 nodes and 4 GPUs each
$ torchrun --nnodes=2 --nproc_per_node=4 --rdzv_endpoint=[master_ip:29500] [train.py]
copy
Run with specific rendezvous backend
$ torchrun --nnodes=2 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=[master_ip:29500] [train.py]
copy
Run with fault tolerance allowing restarts
$ torchrun --nnodes=2 --nproc_per_node=4 --max_restarts=3 --rdzv_endpoint=[master_ip:29500] [train.py]
copy
Run single GPU training (equivalent to regular python)
$ torchrun --standalone --nproc_per_node=1 [train.py]
copy

SYNOPSIS

torchrun [--nnodes N] [--nproc_per_node N] [--rdzv_backend backend] [--rdzv_endpoint host:port] [--standalone] script.py [scriptargs_]

DESCRIPTION

torchrun is PyTorch's distributed training launcher that replaces the deprecated torch.distributed.launch. It spawns multiple processes across GPUs and nodes, setting up the distributed environment for training neural networks at scale.
The launcher supports various distributed strategies including Data Distributed Parallel (DDP), Fully Sharded Data Parallel (FSDP), tensor parallelism, and hybrid approaches. It automatically sets environment variables like RANK, WORLDSIZE, LOCALRANK, MASTERADDR, and MASTERPORT for distributed communication.
For single-node multi-GPU training, use --standalone mode. For multi-node training, all nodes must specify the same rendezvous endpoint where they coordinate. The launcher supports elastic training with dynamic node counts and fault tolerance with automatic restarts when workers fail.

PARAMETERS

--nnodes min:max or N_

Number of nodes participating in training. Can be a range for elastic training.
--nproc_per_node N
Number of processes to spawn per node. Typically equals the number of GPUs.
--standalone
Single-node mode without external rendezvous. Sets up local rendezvous automatically.
--rdzv_backend backend
Rendezvous backend: c10d (default), etcd, etcd-v2, or static.
--rdzv_endpoint host:port
Rendezvous endpoint address. For c10d, the master node's IP and port.
--rdzv_id id
User-defined ID for the rendezvous group. All nodes must use the same ID.
--max_restarts N
Maximum number of worker group restarts on failure. Default is 0.
--node_rank N
Rank of this node (for static rendezvous).
--master_addr addr
Master node address (legacy, use --rdzv_endpoint instead).
--master_port port
Master node port (legacy, use --rdzv_endpoint instead).
--local-rank
Passed to the training script indicating the local process rank.

CAVEATS

Training scripts must be written to handle distributed setup using torch.distributed. The number of processes per node should not exceed available GPUs for GPU training. All nodes in a distributed job must use compatible PyTorch versions and NCCL configurations. Network firewalls must allow communication on the rendezvous port.

HISTORY

torchrun was introduced in PyTorch 1.10 (October 2021) as part of the TorchElastic project, replacing the older torch.distributed.launch module. It was designed to provide elastic and fault-tolerant distributed training capabilities. From PyTorch 2.0 (March 2023), the command-line argument style changed from underscores to dashes (--local-rank instead of --local_rank).

SEE ALSO

Copied to clipboard