LinuxCommandLibrary

dcgmi

Monitor NVIDIA Data Center GPUs

TLDR

Display information on all available GPUs and processes using them

$ dcgmi discovery [[-l|--list]]
copy

List created groups
$ dcgmi group [[-l|--list]]
copy

Display current usage statistics for device 0
$ dcgmi dmon [[-e|--field-id]][1001,1002,1003,1004,1005]
copy

Display help
$ dcgmi [[-h|--help]]
copy

Display help for a subcommand
$ dcgmi [subcommand] [[-h|--help]]
copy

SYNOPSIS

dcgmi [<global-options>] <group> <command> [<args>]

PARAMETERS

-h, --help
    Show help message and exit

-v, --version
    Print version information

-d, --debug
    Enable debug output

-V NUM, --verbosity=NUM
    Set verbosity level (0-4)

-j, --json
    Output in JSON format

--no-header
    Suppress column headers

-i IDLIST, --id=IDLIST
    Comma-separated GPU IDs (UUIDs, bus-ids, indices); defaults to all GPUs

-g IDLIST, --gpu=IDLIST
    Alias for --id

-G GIDLIST, --group-id=GIDLIST
    Comma-separated DCGM group IDs

--csv
    Output in CSV format

--keys-only
    Output only keys for policies/fields

DESCRIPTION

The dcgmi command is the primary CLI interface for NVIDIA's Data Center GPU Manager (DCGM), a framework for monitoring, managing, and diagnostics of NVIDIA GPUs in enterprise and data center environments.

DCGM enables large-scale GPU telemetry across clusters, supporting metrics like utilization, temperature, power, memory, ECC errors, and PCIe bandwidth. dcgmi provides commands for GPU discovery, running diagnostics, querying fields, managing groups, policies, processes, NVLink, sensors, and more.

It outputs in human-readable, CSV, or JSON formats for scripting and automation. Targeted at multi-GPU/node setups, it complements nvidia-smi by offering cluster-aware features like fabric management and policy enforcement.

Usage requires the NVIDIA driver, DCGM libraries (from nvidia-dcgm package), and typically root or NVIDIA group privileges. Ideal for HPC, AI training, and cloud providers to ensure GPU health and performance.

CAVEATS

Requires NVIDIA DCGM installed (nvidia-dcgm package); needs root or 'nvidia-dcgm' group access. Not for consumer GPUs. Some features need DCGM hostengine running. High verbosity or frequent queries may impact performance.

KEY GROUPS

config, diag, discovery, field, group, info, log, nvlink, perf, policy, process, sensor, switch, topology.
Example: dcgmi discovery -l lists GPUs; dcgmi diag -r 3 runs level-3 diagnostics.

INSTALLATION

On Ubuntu/RHEL: apt/yum install datacenter-gpu-manager. Start with systemctl start nvidia-dcgm. Verify: dcgmi discovery -l.

HISTORY

Introduced in 2018 with DCGM 1.0 alongside Volta GPUs. Evolved with Ampere/Ada/Hopper support in CUDA 11+. Tracks NVIDIA driver releases; current versions in DCGM 3.x emphasize AI/HPC observability and fabric management.

SEE ALSO

nvidia-smi(1), dcgm(8), nvtop(1)

Copied to clipboard