LinuxCommandLibrary

dvc

Manage machine learning experiments and data

TLDR

Initialize a new DVC project

$ dvc init
copy

Configure a remote storage location
$ dvc remote add [storage_name] [url]
copy

Add one or more data files or directories to tracking
$ dvc add [path/to/file_or_directory]
copy

Show project status
$ dvc status
copy

Upload tracked files to remote storage
$ dvc push
copy

Download tracked files from remote storage
$ dvc pull
copy

Display help
$ dvc [[-h|--help]]
copy

Display version
$ dvc --version
copy

SYNOPSIS

dvc [-h] [-V] [--cd <DIR>] [-q] [-v <LEVEL>] [<SUBCOMMAND>] [<ARGS>]

PARAMETERS

-h, --help
    Show help message and exit

-V, --version
    Show program's version and exit

--cd <DIR>
    Change to directory DIR before command execution

-q, --quiet
    Suppress all output except errors (same as -v 0)

-v, --verbose [<LEVEL>]
    More output verbosity (LEVEL from 2 to 10; default 1, same as -q for 0)

DESCRIPTION

DVC (Data Version Control) is an open-source command-line tool designed for data scientists and ML engineers to version data, models, and experiments like Git versions code.

It solves key challenges in ML workflows: tracking large datasets and models without bloating Git repos, reproducible pipelines via dependency graphs, and efficient experiment management. DVC stores pointers to data in Git and keeps actual files in remote storages (S3, GCS, Azure, SSH, etc.) or local cache.

Core workflow: dvc init sets up .dvc dir; dvc add data.csv tracks and hashes file; dvc push uploads to remote; dvc run -o model.pkl -m metrics.json script.py defines pipeline stages with inputs/outputs; dvc repro rebuilds only changed stages using cache.

Integrates seamlessly with Git: commit DVC files to Git, data stays external. Supports metrics/plots viewing (dvc metrics show), experiments (dvc exp), and params (dvc params). Free, no vendor lock-in, used by thousands in production ML.

CAVEATS

Requires Git repository; install via pip install dvc or package managers. Large data needs remote storage configured (dvc remote add). Not a core Linux utility.

COMMON SUBCOMMANDS

dvc init: Initialize DVC repo.
dvc add FILE: Track data/model.
dvc push: Upload cache to remote.
dvc pull: Download from remote.
dvc repro: Reproduce pipelines.

INSTALLATION

pip install dvc or conda install -c conda-forge dvc. For remotes: pip install dvc[s3], etc.

HISTORY

Developed by Iterative.ai; first release July 2017 (v0.1). Evolved from MLflow needs, now v3.x with studio integration. Widely adopted in MLOps communities.

SEE ALSO

git(1)

Copied to clipboard