dcgmi
Monitor NVIDIA Data Center GPUs
TLDR
Display information on all available GPUs and processes using them
List created groups
Display current usage statistics for device 0
Display help
Display help for a subcommand
SYNOPSIS
dcgmi <subcommand> [<subcommand_options>]
Example: dcgmi dmon -e 100,101 -d 500
Example: dcgmi health -c all
PARAMETERS
discovery
Discover and list GPUs and their properties.
dmon
Stream device monitoring metrics and statistics.
health
Check the health status of GPUs or the system.
diag
Run diagnostic tests on GPUs to identify issues.
modules
Manage DCGM policy modules and their configurations.
policy
Configure and view GPU performance policies (e.g., power limits).
config
Manage GPU persistence mode and compute mode settings.
start
Start the DCGM service (daemon).
stop
Stop the DCGM service (daemon).
status
Check the running status of the DCGM service.
group
Create, manage, and list GPU groups for targeted operations.
profile
Monitor GPU application profiling metrics.
sysinfo
Display system information relevant to DCGM and GPUs.
field
List available DCGM fields for monitoring and their IDs.
version
Display the DCGM library and dcgmi utility versions.
-h, --help
Display help information for the command or a specific subcommand.
DESCRIPTION
dcgmi is a command-line interface (CLI) tool that interacts with the NVIDIA Data Center GPU Manager (DCGM) service. It provides robust capabilities for monitoring, diagnostics, and management of NVIDIA GPUs in data center and HPC environments.
Users can leverage dcgmi to retrieve real-time telemetry data (like GPU utilization, memory usage, temperature, power consumption), perform health checks, run diagnostic tests, configure GPU policies, and manage GPU groupings. It is designed for system administrators, cluster managers, and developers who need programmatic access to GPU information and control over large deployments of NVIDIA GPUs, enabling proactive maintenance, performance tuning, and fault detection.
Unlike nvidia-smi which is typically for a single node's GPUs, dcgmi is built for scale, supporting agent-client architecture for remote monitoring and management across multiple nodes. It's an essential tool for maintaining the health and efficiency of GPU-accelerated clusters.
CAVEATS
Requires NVIDIA GPUs and the NVIDIA Data Center GPU Manager (DCGM) software package to be installed and the dcgm service running.
Often requires root or sudo privileges to execute commands, especially for configuration changes or health checks.
The dcgmi client communicates with the dcgm service; ensure the service is active and accessible.
Understanding the various subcommands and their specific options can have a learning curve due to the tool's comprehensive nature.
DCGM SERVICE
dcgmi is a client-side utility that communicates with the dcgm background service (daemon). The service must be running for dcgmi commands to function correctly. This client-server architecture allows for remote monitoring and management of GPUs across multiple nodes.
PROGRAMMATIC ACCESS
Beyond the CLI, DCGM also offers a C API and Python bindings, allowing developers to integrate GPU monitoring and management capabilities directly into their applications and scripts for automated workflows and custom tooling.
HEALTH CHECKS AND DIAGNOSTICS
One of dcgmi's key strengths is its ability to perform comprehensive health checks and run extensive diagnostic tests (e.g., memory, compute, interconnect tests) to identify potential hardware issues proactively, minimizing downtime in critical GPU-accelerated environments.
HISTORY
dcgmi is an integral part of the NVIDIA Data Center GPU Manager (DCGM) suite, which was developed by NVIDIA to address the growing needs of managing large-scale GPU deployments in data centers, high-performance computing (HPC) clusters, and AI/ML infrastructure. It was designed to provide a more robust, scalable, and programmatic interface for GPU management compared to individual node tools. Its development has focused on enabling comprehensive monitoring, diagnostic capabilities, and policy enforcement across a fleet of GPUs, facilitating automated operations and ensuring optimal cluster health and performance.
SEE ALSO
nvidia-smi(1), lspci(8), systemctl(1)