dcgmi

Monitor NVIDIA Data Center GPUs

TLDR

Display information on all available GPUs and processes using them

$ dcgmi discovery [[-l|--list]]

List created groups

$ dcgmi group [[-l|--list]]

Display current usage statistics for device 0

$ dcgmi dmon [[-e|--field-id]][1001,1002,1003,1004,1005]

Display help

$ dcgmi [[-h|--help]]

Display help for a subcommand

$ dcgmi [subcommand] [[-h|--help]]

SYNOPSIS

dcgmi <subcommand> [<subcommand_options>]
Example: dcgmi dmon -e 100,101 -d 500
Example: dcgmi health -c all

discovery
    Discover and list GPUs and their properties.

dmon
    Stream device monitoring metrics and statistics.

health
    Check the health status of GPUs or the system.

diag
    Run diagnostic tests on GPUs to identify issues.

modules
    Manage DCGM policy modules and their configurations.

policy
    Configure and view GPU performance policies (e.g., power limits).

config
    Manage GPU persistence mode and compute mode settings.

start
    Start the DCGM service (daemon).

stop
    Stop the DCGM service (daemon).

status
    Check the running status of the DCGM service.

group
    Create, manage, and list GPU groups for targeted operations.

profile
    Monitor GPU application profiling metrics.

sysinfo
    Display system information relevant to DCGM and GPUs.

field
    List available DCGM fields for monitoring and their IDs.

version
    Display the DCGM library and dcgmi utility versions.

-h, --help
    Display help information for the command or a specific subcommand.

DESCRIPTION

dcgmi is a command-line interface (CLI) tool that interacts with the NVIDIA Data Center GPU Manager (DCGM) service. It provides robust capabilities for monitoring, diagnostics, and management of NVIDIA GPUs in data center and HPC environments.

Users can leverage dcgmi to retrieve real-time telemetry data (like GPU utilization, memory usage, temperature, power consumption), perform health checks, run diagnostic tests, configure GPU policies, and manage GPU groupings. It is designed for system administrators, cluster managers, and developers who need programmatic access to GPU information and control over large deployments of NVIDIA GPUs, enabling proactive maintenance, performance tuning, and fault detection.

Unlike nvidia-smi which is typically for a single node's GPUs, dcgmi is built for scale, supporting agent-client architecture for remote monitoring and management across multiple nodes. It's an essential tool for maintaining the health and efficiency of GPU-accelerated clusters.

CAVEATS

Requires NVIDIA GPUs and the NVIDIA Data Center GPU Manager (DCGM) software package to be installed and the dcgm service running.
Often requires root or sudo privileges to execute commands, especially for configuration changes or health checks.
The dcgmi client communicates with the dcgm service; ensure the service is active and accessible.
Understanding the various subcommands and their specific options can have a learning curve due to the tool's comprehensive nature.

DCGM SERVICE

dcgmi is a client-side utility that communicates with the dcgm background service (daemon). The service must be running for dcgmi commands to function correctly. This client-server architecture allows for remote monitoring and management of GPUs across multiple nodes.

PROGRAMMATIC ACCESS

Beyond the CLI, DCGM also offers a C API and Python bindings, allowing developers to integrate GPU monitoring and management capabilities directly into their applications and scripts for automated workflows and custom tooling.

HEALTH CHECKS AND DIAGNOSTICS

One of dcgmi's key strengths is its ability to perform comprehensive health checks and run extensive diagnostic tests (e.g., memory, compute, interconnect tests) to identify potential hardware issues proactively, minimizing downtime in critical GPU-accelerated environments.

HISTORY

dcgmi is an integral part of the NVIDIA Data Center GPU Manager (DCGM) suite, which was developed by NVIDIA to address the growing needs of managing large-scale GPU deployments in data centers, high-performance computing (HPC) clusters, and AI/ML infrastructure. It was designed to provide a more robust, scalable, and programmatic interface for GPU management compared to individual node tools. Its development has focused on enabling comprehensive monitoring, diagnostic capabilities, and policy enforcement across a fleet of GPUs, facilitating automated operations and ensuring optimal cluster health and performance.