LinuxCommandLibrary

slurmctld

Manages Slurm workload and resource management

TLDR

Clear all previous slurmctld states from its last checkpoint

$ slurmctld -c
copy

Set the daemon's nice value to the specified value, typically a negative number
$ slurmctld -n [value]
copy

Write log messages to the specified file
$ slurmctld -L [path/to/output_file]
copy

Display help
$ slurmctld -h
copy

Display version
$ slurmctld -V
copy

SYNOPSIS

slurmctld [OPTIONS]

Commonly invoked directly or via a service manager (e.g., systemd) without arguments, or with specific options for debugging or configuration validation.

PARAMETERS

-b
    Bootstraps the daemon. Primarily for initial setup, less common with modern service managers like systemd.

-c
    Checks the configuration file (slurm.conf) for validity and exits. Does not start the daemon.

-D
    Enables debug mode. Increases the verbosity of log messages to aid in troubleshooting.

-d <seconds>
    Instructs the daemon to suspend operation for the specified number of seconds, typically used for testing purposes.

-f
    Runs the daemon in the foreground, preventing it from detaching from the controlling terminal.

-h
    Displays a brief help message with a list of available command-line options.

-i
    Notifies systemd or similar init systems that the daemon's initialization is complete.

-L <file>
    Specifies an alternative log file path for the daemon's output, overriding the default or slurm.conf setting.

-n
    Does not daemonize, similar to -f but often with different implications for logging destinations.

-R
    Restores the saved state of the daemon from its state file, crucial for maintaining cluster state across restarts.

-s
    Directs log output to stderr (standard error), useful when running in the foreground for interactive debugging.

-v
    Increases the verbosity level of log messages. Can be repeated for higher verbosity (e.g., -vvv).

DESCRIPTION

The slurmctld daemon is the central management component of the Slurm Workload Manager. It is responsible for monitoring the state of all nodes and partitions within a Slurm cluster, dynamically scheduling jobs based on available resources and policy, and allocating those resources to user tasks. It serves as the primary interface for all user commands, processing job submissions, status queries, and requests for cluster information. The daemon reads its operational configuration from the slurm.conf file and maintains the authoritative state of the entire cluster. As the core of the Slurm system, slurmctld is critical for the proper functioning of any Slurm-managed cluster; without it, no jobs can be scheduled, managed, or executed on the cluster resources.

CAVEATS

Only one active slurmctld instance can manage a Slurm cluster at any given time, though High Availability (HA) configurations allow for hot standby daemons. Its proper operation is entirely dependent on a correct slurm.conf file. Misconfigurations can lead to jobs not scheduling, nodes being marked down, or other cluster-wide issues. All log messages are critical for troubleshooting.

CONFIGURATION FILES

slurmctld relies heavily on the slurm.conf file for its operational parameters, including node definitions, partitions, security settings, and logging locations. It also manages its operational state in files like slurmctld.state and logs activity to slurmctld.log (or as configured).

SIGNAL HANDLING

slurmctld responds to various signals: SIGHUP reloads the configuration; SIGTERM and SIGINT initiate a graceful shutdown; SIGUSR1 and SIGUSR2 can be used to increase or decrease the debug logging level respectively, without restarting the daemon.

HIGH AVAILABILITY (HA)

For critical environments, slurmctld supports an active/backup configuration. This allows a standby daemon to take over management of the cluster automatically if the primary slurmctld fails, ensuring continuous operation.

HISTORY

Slurm (Simple Linux Utility for Resource Management) was developed at Lawrence Livermore National Laboratory (LLNL) as an open-source workload manager. The slurmctld daemon has been a core component since its inception in the early 2000s, evolving to meet the demands of large-scale HPC clusters. It replaced earlier proprietary and open-source systems like Moab, Maui, and PBS/Torque, gaining significant traction due to its scalability, flexibility, and robust feature set.

SEE ALSO

slurm.conf(5), sbatch(1), srun(1), squeue(1), sinfo(1), sacct(1), slurmdbd(8), slurmd(8), sched_diag(1)

Copied to clipboard