dcgmi
NVIDIA data center GPU management interface
TLDR
Show GPU discovery information
SYNOPSIS
dcgmi subsystem [options]
DESCRIPTION
dcgmi is the command-line interface for NVIDIA's Data Center GPU Manager (DCGM). It provides monitoring, management, and diagnostic capabilities for NVIDIA GPUs in data center and HPC environments.
The tool enables administrators to monitor GPU health, run diagnostics, track performance metrics, and manage GPU groups for policy enforcement. It integrates with job schedulers and cluster management systems for automated GPU management.
DCGM tracks hundreds of GPU metrics including temperature, power, memory usage, and error counts. The diagnostic subsystem can detect hardware issues before they cause failures, supporting proactive maintenance.
PARAMETERS
SUBSYSTEM
Management subsystem: discovery, health, diag, dmon, group, topo, etc.discovery -l
List discovered GPUs.health -g GROUP
Check health of GPU group.diag -r LEVEL
Run diagnostics (level 1-4).dmon
Real-time monitoring dashboard.group -c NAME
Create named GPU group.topo -g GROUP
Show interconnect topology.--help
Display help information.
CAVEATS
Requires NVIDIA DCGM service running on the host. Only works with supported NVIDIA data center GPUs. Some diagnostics require GPUs to be idle. Elevated privileges needed for certain operations.
HISTORY
DCGM was developed by NVIDIA and released around 2016 for enterprise GPU deployments. dcgmi provides CLI access to DCGM functionality, complementing the API and GUI interfaces for data center GPU fleet management.
SEE ALSO
nvidia-smi(1), nvtop(1), gpustat(1)
