LinuxCommandLibrary

slurmstepd

Execute a single task of a Slurm job

TLDR

Start the daemon

$ slurmstepd
copy

SYNOPSIS

slurmstepd [-D] [-J job_id] [-s step_id] [-c command] [-N node_name] [-W work_dir] [...]

Note: slurmstepd is an internal Slurm daemon and is not intended for direct user invocation. Its parameters are primarily for internal Slurm communication.

PARAMETERS

-J job_id
    Specifies the Slurm job ID that this step belongs to.

-s step_id
    Indicates the unique job step ID within the specified job.

-c command
    The command line or executable path that slurmstepd is to execute for this job step.

-N node_name
    The name of the compute node where this slurmstepd instance is running.

-W work_dir
    Sets the working directory for the job step processes.

-u user
    The user ID under which the job step processes should be run.

-D
    Detaches slurmstepd from the controlling terminal, running it as a daemon in the background.

-V
    Prints the version number of slurmstepd and exits.

DESCRIPTION

slurmstepd is a critical component of the Slurm Workload Manager, designed to execute and manage individual job steps within an allocated job on a compute node. When a job is submitted and an allocation is granted, slurmstepd is automatically launched by the slurmd daemon (often initiated by srun) for each distinct job step. Its primary responsibilities include launching user tasks (e.g., MPI processes, OpenMP threads), enforcing resource limits (such as CPU, memory, and I/O), managing standard input/output redirection, and meticulously tracking the execution status of the job step. It serves as an essential intermediary between the Slurm controller (slurmctld) and the actual user processes, ensuring that job steps adhere strictly to allocated resources and system policies. Upon completion or failure, slurmstepd is also responsible for proper cleanup and reporting status back to the Slurm daemons.

CAVEATS

slurmstepd is an internal component of the Slurm Workload Manager and is not designed for direct user interaction or manual execution. Attempting to run it manually or interfering with its operation can lead to unpredictable behavior, job failures, or system instability. Its proper functioning relies heavily on the overall Slurm configuration (slurm.conf) and the specific parameters passed to it by other Slurm daemons and client commands.

INTERNAL OPERATION

slurmstepd is typically spawned by the slurmd daemon on a compute node in response to a request from srun or an initiated batch job. It then executes the specified user command within the allocated cgroups or process groups, applying all resource limits and tracking progress. It communicates back to slurmd regarding the step's status, resource usage, and completion.

LOGGING AND DEBUGGING

Information and errors from slurmstepd are generally logged to the main slurmd log file (configured via LogFile in slurm.conf). For debugging job execution issues, increasing the DebugFlags or SlurmdDebug level in slurm.conf can provide more verbose output from slurmstepd within the slurmd log.

HISTORY

slurmstepd has been a fundamental part of the Slurm Workload Manager's architecture since its inception. As Slurm evolved from its roots at Lawrence Livermore National Laboratory into a widely adopted open-source scheduler, the role of slurmstepd as the dedicated executor for job steps on compute nodes remained central. Its development has mirrored Slurm's advancements, adapting to support features like advanced resource allocation, task plugins, process tracking, and integration with container technologies, all while maintaining its core function of ensuring robust and compliant job step execution.

SEE ALSO

srun(1), sbatch(1), salloc(1), slurmctld(8), slurmd(8), slurm.conf(5)

Copied to clipboard