LinuxCommandLibrary

dvc

version control system for machine learning projects

TLDR

Initialize DVC repository

$ dvc init
copy
Track data file
$ dvc add [data/dataset.csv]
copy
Push data to remote
$ dvc push
copy
Pull data from remote
$ dvc pull
copy
Run pipeline
$ dvc repro
copy
Show pipeline DAG
$ dvc dag
copy
Configure remote storage
$ dvc remote add -d [myremote] [s3://bucket/path]
copy

SYNOPSIS

dvc command [options]

DESCRIPTION

DVC (Data Version Control) is a version control system for machine learning projects. It tracks large files, datasets, and models alongside Git, without storing them in the Git repository.
DVC stores file metadata (.dvc files) in Git while the actual data goes to configurable remote storage (S3, GCS, Azure, SSH, etc.). This enables versioning large files and sharing datasets across teams.
The pipeline feature defines reproducible ML workflows, tracking dependencies and outputs for experiment management.

PARAMETERS

COMMAND

DVC operation to perform.
init
Initialize DVC in repository.
add FILE
Track file or directory.
push
Upload tracked data to remote.
pull
Download tracked data from remote.
repro
Reproduce pipeline.
remote add NAME URL
Add remote storage.
--help
Display help information.

CONFIGURATION

.dvc/config

Repository-level DVC configuration including remote storage settings.
~/.config/dvc/config
Global user configuration for DVC defaults and preferences.
.dvc/config.local
Local repository config for machine-specific settings not committed to Git.

CAVEATS

Requires Git repository. Large data transfers depend on network speed. Remote storage may incur costs. Pipeline reproduction needs matching environment.

HISTORY

DVC was created by iterative.ai and released in 2017. It addresses the challenge of versioning large datasets and ML models that don't fit well in Git, enabling reproducible machine learning workflows.

SEE ALSO

git(1), mlflow(1)

> TERMINAL_GEAR

Curated for the Linux community

Copied to clipboard

> TERMINAL_GEAR

Curated for the Linux community