srun

Run parallel jobs on allocated cluster nodes

TLDR

Submit a basic interactive job

$ srun --pty /bin/bash

Submit an interactive job with different attributes

$ srun --ntasks-per-node [num_cores] --mem-per-cpu [memory_MB] --pty /bin/bash

Connect to a worker node with a job running

$ srun --jobid [job_id] --pty /bin/bash

SYNOPSIS

srun [OPTIONS...] <executable> [<arguments...>]

Common Usage Examples:
srun -N 1 -n 1 --pty bash
srun -n 16 --cpu-bind=cores ./my_parallel_app

-N, --nodes=<nodes>
    Specifies the minimum number of nodes required for the job.

-n, --ntasks=<tasks>
    Specifies the number of tasks (processes) to be launched by srun.

-c, --cpus-per-task=<cpus>
    Requests a specific number of CPUs per task. Useful for multi-threaded applications.

-t, --time=<time>
    Sets a time limit for the job or job step. Format: <minutes>, <minutes>:<seconds>, <hours>:<minutes>:<seconds>, or <days>-<hours>.

--mem=<memory>
    Specifies the maximum amount of real memory required per node. Suffixes like M (MB), G (GB) can be used.

-p, --partition=<partition_name>
    Requests a specific partition (queue) for the job.

-A, --account=<account_name>
    Charges the job to the specified account.

-J, --job-name=<name>
    Assigns a descriptive name to the job. This name appears in squeue output.

-o, --output=<file>
    Redirects standard output of the job to the specified file. Special characters like %j (job ID) are supported.

-e, --error=<file>
    Redirects standard error of the job to the specified file.

--pty
    Allocates a pseudo-terminal for the job, allowing interactive sessions (e.g., launching a shell).

--cpu-bind=<type>
    Binds tasks to specific CPUs or cores. Common types include none, cores, sockets, threads.

--exclusive
    Requests exclusive use of allocated nodes, meaning no other jobs will run on them.

DESCRIPTION

srun is the primary command in the Slurm Workload Manager used to submit and execute jobs, typically interactively or as a job step within an existing resource allocation. It requests resources from the Slurm controller, launches tasks on the allocated nodes, and manages standard input/output redirection. Unlike sbatch, which is used for submitting scripts for batch execution, srun is ideal for initiating interactive sessions, debugging parallel applications, or running a specific command across multiple allocated compute nodes. It supports a comprehensive set of options to define resource requirements (e.g., number of tasks, CPUs per task, memory, time limit), job properties, and task distribution, making it an indispensable tool for High-Performance Computing (HPC) users.

CAVEATS

srun typically blocks and waits for resources to become available, unlike sbatch which returns immediately.
It is commonly used within an existing allocation obtained via salloc or from within an sbatch script to launch job steps.
Users must have appropriate permissions and the Slurm environment must be properly configured on the cluster to use srun.

JOB STEPS AND ALLOCATIONS

While srun can initiate a new job allocation, it is also frequently used to launch "job steps" within a pre-existing resource allocation. An allocation can be obtained via salloc (for interactive use) or sbatch (for batch scripts). Running multiple srun commands within a single allocation allows for efficient execution of different phases of a computation or multiple distinct executables without incurring the overhead of new resource requests for each step.

INTERACTIVE DEBUGGING

One of srun's most powerful features is its ability to launch interactive sessions. Using options like --pty bash, users can obtain a shell prompt directly on an allocated compute node, enabling real-time debugging, environment inspection, and interactive execution of commands and scripts within the cluster's high-performance environment. This is crucial for developing and troubleshooting complex parallel applications.

HISTORY

srun is a core component of the Slurm Workload Manager, which was originally developed at Lawrence Livermore National Laboratory (LLNL) starting around 2002-2003. It emerged as an open-source, scalable, and fault-tolerant alternative to other job schedulers, designed to efficiently manage resources on large-scale High-Performance Computing (HPC) clusters. srun has been central to Slurm's functionality from its inception, providing the essential capability to launch and manage parallel applications and interactive sessions on allocated compute resources.