dvc

Manage machine learning experiments and data

TLDR

Execute a DVC subcommand

$ dvc [subcommand]

Display general help

$ dvc [[-h|--help]]

Display help about a specific subcommand

$ dvc [subcommand] [[-h|--help]]

Display version

$ dvc --version

SYNOPSIS

dvc [GLOBAL_OPTIONS] <COMMAND> [COMMAND_OPTIONS] [ARGUMENTS]

Examples of common usage:
dvc init
dvc add data/raw_data.csv
dvc repro
dvc push
dvc pull

--help, -h
    Show the top-level help message or help for a specific DVC command.

--version
    Print the DVC version.

--cd <path>
    Change to <path> before executing the command. Useful for running DVC from a script or an arbitrary directory.

--config <path>
    Specify a path to a custom DVC configuration file to use instead of the default .dvc/config.

--global
    Operate on the global DVC configuration file (typically in the user's home directory).

--local
    Operate on the local (project-specific) DVC configuration file within the current DVC repository.

--no-scm
    Do not look for an SCM (Source Code Management) repository like Git. Useful for DVC operations outside of a Git repo.

--quiet, -q
    Suppress all output from DVC, showing only errors.

--verbose, -v
    Be more verbose during command execution, providing detailed information about the process.

DESCRIPTION

DVC, short for Data Version Control, is an open-source command-line tool designed to make machine learning projects reproducible, shareable, and scalable. It extends traditional Git-based version control to handle large files, datasets, and machine learning models, which Git is not optimized for.

DVC works by creating small .dvc pointer files that are versioned by Git, while the actual large data and model files are stored in a DVC-managed cache and can be synced with various remote storage options (e.g., S3, Google Cloud Storage, Azure Blob Storage, SSH, local storage).

Key features include data and model versioning, management of reproducible ML pipelines via a dvc.yaml file, and experiment tracking. It helps in maintaining data provenance, facilitating collaboration among data scientists, and enabling MLOps workflows by ensuring that experiments and models can be easily recreated and deployed.

CAVEATS

DVC is designed to complement Git, not replace it. It relies heavily on Git for versioning the project's codebase and the small .dvc pointer files. Users must be familiar with Git workflows.

Proper configuration of remote storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage) is essential for sharing data and models effectively across teams or different environments.

While powerful, DVC can have a learning curve, especially for users new to data versioning concepts and distributed storage systems.

INTEGRATION WITH GIT

DVC's fundamental design principle is its deep integration with Git. While Git efficiently versions source code, DVC handles large binary files (datasets, models) by storing them in a dedicated cache and in remote storage. It then places small .dvc files within the Git repository. These .dvc files serve as pointers, containing metadata and checksums of the actual large files. This hybrid approach allows Git to track the version of the data without storing the data itself, enabling efficient branching, merging, and cloning of ML projects with large datasets.

REPRODUCIBLE ML PIPELINES

A core feature of DVC is its ability to define and execute reproducible machine learning pipelines. This is achieved through a dvc.yaml file, which describes the various stages of a data science workflow, such as data preprocessing, feature engineering, model training, and evaluation. DVC tracks the inputs, outputs, and dependencies for each stage. When inputs change, DVC intelligently re-runs only the necessary stages, ensuring that the entire workflow can be reproduced consistently. This promotes transparency, reduces errors, and significantly aids in experiment management and result validation.

HISTORY

DVC was created by Iterative.ai and first publicly released in 2017. It emerged from the growing need to apply software engineering best practices, particularly version control and continuous integration/delivery (CI/CD), to data-intensive machine learning projects. Its development has been consistently driven by an active open-source community, focusing on making ML development more reproducible, collaborative, and manageable at scale.

dvc