dcgmi
Monitor NVIDIA Data Center GPUs
TLDR
Display information on all available GPUs and processes using them
List created groups
Display current usage statistics for device 0
Display help
Display help for a subcommand
SYNOPSIS
dcgmi [<global-options>] <group> <command> [<args>]
PARAMETERS
-h, --help
Show help message and exit
-v, --version
Print version information
-d, --debug
Enable debug output
-V NUM, --verbosity=NUM
Set verbosity level (0-4)
-j, --json
Output in JSON format
--no-header
Suppress column headers
-i IDLIST, --id=IDLIST
Comma-separated GPU IDs (UUIDs, bus-ids, indices); defaults to all GPUs
-g IDLIST, --gpu=IDLIST
Alias for --id
-G GIDLIST, --group-id=GIDLIST
Comma-separated DCGM group IDs
--csv
Output in CSV format
--keys-only
Output only keys for policies/fields
DESCRIPTION
The dcgmi command is the primary CLI interface for NVIDIA's Data Center GPU Manager (DCGM), a framework for monitoring, managing, and diagnostics of NVIDIA GPUs in enterprise and data center environments.
DCGM enables large-scale GPU telemetry across clusters, supporting metrics like utilization, temperature, power, memory, ECC errors, and PCIe bandwidth. dcgmi provides commands for GPU discovery, running diagnostics, querying fields, managing groups, policies, processes, NVLink, sensors, and more.
It outputs in human-readable, CSV, or JSON formats for scripting and automation. Targeted at multi-GPU/node setups, it complements nvidia-smi by offering cluster-aware features like fabric management and policy enforcement.
Usage requires the NVIDIA driver, DCGM libraries (from nvidia-dcgm package), and typically root or NVIDIA group privileges. Ideal for HPC, AI training, and cloud providers to ensure GPU health and performance.
CAVEATS
Requires NVIDIA DCGM installed (nvidia-dcgm package); needs root or 'nvidia-dcgm' group access. Not for consumer GPUs. Some features need DCGM hostengine running. High verbosity or frequent queries may impact performance.
KEY GROUPS
config, diag, discovery, field, group, info, log, nvlink, perf, policy, process, sensor, switch, topology.
Example: dcgmi discovery -l lists GPUs; dcgmi diag -r 3 runs level-3 diagnostics.
INSTALLATION
On Ubuntu/RHEL: apt/yum install datacenter-gpu-manager. Start with systemctl start nvidia-dcgm. Verify: dcgmi discovery -l.
HISTORY
Introduced in 2018 with DCGM 1.0 alongside Volta GPUs. Evolved with Ampere/Ada/Hopper support in CUDA 11+. Tracks NVIDIA driver releases; current versions in DCGM 3.x emphasize AI/HPC observability and fabric management.
SEE ALSO
nvidia-smi(1), dcgm(8), nvtop(1)


