LinuxCommandLibrary

sdiag

Diagnose and report system health

TLDR

Show all performance counters related to the execution of slurmctld

$ sdiag [[-a|--all]]
copy

Reset performance counters related to the execution of slurmctld
$ sdiag [[-r|--reset]]
copy

Specify the output format
$ sdiag [[-a|--all]] --[json|yaml]
copy

Specify the cluster to send commands to
$ sdiag [[-a|--all]] [[-M|--cluster]] [cluster_name]
copy

SYNOPSIS

sdiag [-a] [-d directory] [-n node_list | -N node_file] [-c case_id] [-s] [-f file_list] [-e expiration_days] [-q] [-v] [-V]

PARAMETERS

-a
    Collects all available diagnostic data. This option gathers a more extensive set of information, which can be useful for complex issues but may produce a larger output file.

-d directory
    Specifies the target directory where the collected diagnostic data will be stored. By default, it's typically /var/mmfs/sdiag/.

-n node_list
    Specifies a comma-separated list of node names or IP addresses from which to collect diagnostic data.

-N node_file
    Specifies a file containing a list of node names, one per line, from which to collect diagnostic data.

-c case_id
    Associates the collected data with a specific IBM problem management record (PMR) or case ID. This helps IBM support organize and track data efficiently.

-s
    Suppresses the creation of a tarball. The collected files remain in their raw format in the specified output directory, allowing for direct inspection.

-f file_list
    Specifies a comma-separated list of additional files or directories to include in the diagnostic collection.

-e expiration_days
    Sets an expiration period in days for the collected diagnostic data. After this period, the data may be automatically cleaned up by the system.

-q
    Runs the command in quiet mode, suppressing most output to the console, useful for automated scripts.

-v
    Runs the command in verbose mode, providing more detailed output about the collection process, useful for monitoring progress.

-V
    Displays the version information of the sdiag command and exits.

DESCRIPTION

The sdiag command, part of IBM Spectrum Scale (formerly GPFS), is a critical utility designed to collect comprehensive diagnostic information from a cluster. Its primary purpose is to assist IBM support in troubleshooting and resolving issues related to the file system or its underlying infrastructure. When invoked, sdiag gathers a wide array of system-level and Spectrum Scale-specific data, including various logs (e.g., system logs, trace logs), configuration files, network configurations, kernel information, and the output of other relevant mm commands. This collected data is typically packaged into a compressed tarball, which can then be securely transmitted to IBM for analysis. The command is highly configurable, allowing users to specify target nodes, output directories, and the scope of data collection, making it an indispensable tool for maintaining the health and stability of IBM Spectrum Scale environments.

CAVEATS

sdiag is not a standard Linux command and is exclusively available in environments running IBM Spectrum Scale (formerly GPFS). Its functionality and options are specific to this product. Executing sdiag can generate a large volume of data, potentially consuming significant disk space, especially when using the -a option. Data collection can also be resource-intensive on the affected nodes. Users should be mindful of data privacy as collected logs and configuration files may contain sensitive information. Always review the collected data before sharing it externally.

DATA TRANSFER TO IBM SUPPORT

Once sdiag has collected data, the resulting tarball is typically uploaded to IBM support via secure methods such as the Enhanced Customer Data Repository (ECuRep) or FTP. The -c option is particularly useful here to associate the collected data directly with an open service request, ensuring it reaches the correct support case.

LOCAL DATA ANALYSIS

While primarily intended for IBM support, the collected data can also be invaluable for local administrators. By using the -s option to prevent tarball creation, individual logs and configuration files can be directly accessed for immediate investigation of system behavior or potential issues without needing to unpack a large archive, aiding in proactive problem-solving.

HISTORY

The sdiag command has been an integral part of IBM's high-performance file system offering since its days as GPFS (General Parallel File System). It was developed to streamline the process of gathering diagnostic information from complex clustered environments, enabling IBM support to more efficiently troubleshoot customer issues. With the evolution of GPFS into IBM Spectrum Scale, sdiag has continued to be a core utility, adapted to new features and capabilities of the file system, remaining crucial for maintaining cluster health and facilitating support interactions.

SEE ALSO

mmlscluster(1), mmlsnode(1), mmdiag(1), mmtracectl(1)

Copied to clipboard