nvidia-smi

Monitor NVIDIA GPU status and usage

TLDR

Display information on all available GPUs and processes using them

$ nvidia-smi

Display more detailed GPU information

$ nvidia-smi [[-q|--query]]

Monitor overall GPU usage with 1-second update interval

$ nvidia-smi dmon

-h, --help
    Displays comprehensive help information about command usage and available options.

-L, --list-gpus
    Lists all detected NVIDIA GPUs in the system, along with their unique identifiers (UUIDs).

-q, --query
    Queries and displays specific GPU attributes. Often used with --display and --format to customize output.

-l, --loop=
    Continuously displays GPU information, refreshing the output every interval seconds.

-i, --id=
    Selects a specific GPU for operations or queries by its numerical index or UUID.

-a, --all-info
    Displays all available comprehensive information for all GPUs in the system.

--format=
    Specifies the output format, such as csv, xml, or json, for parsed data.

--display=
    Specifies which fields or categories of information to display in the output when querying, e.g., 'utilization.gpu,memory.used'.

--gpu-reset
    Attempts to reset the GPU to a healthy state, which can resolve certain errors. This operation often requires root privileges.

--persistence-mode=<0|1>
    Sets the GPU persistence mode. '1' keeps the GPU driver loaded even when not in use, improving application startup times.

--power-limit=
    Sets the maximum power consumption limit (in watts) for the specified GPU. Requires root privileges.

--driver-version
    Displays the installed NVIDIA display driver version.

--cuda-version
    Displays the CUDA driver version supported by the current NVIDIA display driver.

DESCRIPTION

nvidia-smi (NVIDIA System Management Interface) is a powerful command-line utility provided by NVIDIA for monitoring and managing NVIDIA GPUs. It offers real-time insights into various GPU metrics, including utilization percentages, memory usage, temperature, power consumption, and clock speeds. Furthermore, it allows users to inspect processes running on the GPU, set various operational parameters like power limits, and manage GPU persistence modes. This tool is indispensable for system administrators, deep learning practitioners, and anyone working with NVIDIA hardware, providing critical information for performance optimization, troubleshooting, and resource management within Linux environments.

CAVEATS

nvidia-smi requires an installed NVIDIA graphics driver and a supported NVIDIA GPU. Many management operations, such as GPU resets or setting power limits, require root (superuser) privileges. The availability of certain features and metrics may vary across different NVIDIA driver versions and GPU architectures.

UNDERLYING TECHNOLOGY

nvidia-smi serves as a command-line interface to the NVIDIA Management Library (NVML). NVML is a C-based API that provides a robust and comprehensive set of functions for programmatically managing and monitoring NVIDIA GPUs. All data reported and actions performed by nvidia-smi are facilitated through NVML.

TYPICAL USE CASES

Commonly used by data scientists and machine learning engineers to observe GPU utilization during model training and inference. System administrators employ it for resource allocation, troubleshooting, and ensuring optimal performance in GPU-accelerated server farms. Enthusiasts and gamers use it to monitor GPU health, temperature, and performance metrics.

HISTORY

Developed by NVIDIA Corporation, nvidia-smi has been an integral part of the NVIDIA driver package for many years. It evolved from a basic monitoring tool into a comprehensive management interface, continually updated to support new GPU features, architectures, and the demands of high-performance computing and AI workloads. Its ubiquitous presence makes it the de facto standard for GPU monitoring and management on Linux systems with NVIDIA hardware.