dvc
Manage machine learning experiments and data
TLDR
Execute a DVC subcommand
Display general help
Display help about a specific subcommand
Display version
SYNOPSIS
dvc [GLOBAL_OPTIONS] <COMMAND> [COMMAND_OPTIONS] [ARGUMENTS]
Examples of common usage:dvc init
dvc add data/raw_data.csv
dvc repro
dvc push
dvc pull
PARAMETERS
--help, -h
Show the top-level help message or help for a specific DVC command.
--version
Print the DVC version.
--cd <path>
Change to <path> before executing the command. Useful for running DVC from a script or an arbitrary directory.
--config <path>
Specify a path to a custom DVC configuration file to use instead of the default .dvc/config
.
--global
Operate on the global DVC configuration file (typically in the user's home directory).
--local
Operate on the local (project-specific) DVC configuration file within the current DVC repository.
--no-scm
Do not look for an SCM (Source Code Management) repository like Git. Useful for DVC operations outside of a Git repo.
--quiet, -q
Suppress all output from DVC, showing only errors.
--verbose, -v
Be more verbose during command execution, providing detailed information about the process.
DESCRIPTION
DVC, short for Data Version Control, is an open-source command-line tool designed to make machine learning projects reproducible, shareable, and scalable. It extends traditional Git-based version control to handle large files, datasets, and machine learning models, which Git is not optimized for.
DVC works by creating small .dvc
pointer files that are versioned by Git, while the actual large data and model files are stored in a DVC-managed cache and can be synced with various remote storage options (e.g., S3, Google Cloud Storage, Azure Blob Storage, SSH, local storage).
Key features include data and model versioning, management of reproducible ML pipelines via a dvc.yaml
file, and experiment tracking. It helps in maintaining data provenance, facilitating collaboration among data scientists, and enabling MLOps workflows by ensuring that experiments and models can be easily recreated and deployed.
CAVEATS
DVC is designed to complement Git, not replace it. It relies heavily on Git for versioning the project's codebase and the small .dvc
pointer files. Users must be familiar with Git workflows.
Proper configuration of remote storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage) is essential for sharing data and models effectively across teams or different environments.
While powerful, DVC can have a learning curve, especially for users new to data versioning concepts and distributed storage systems.
INTEGRATION WITH GIT
DVC's fundamental design principle is its deep integration with Git. While Git efficiently versions source code, DVC handles large binary files (datasets, models) by storing them in a dedicated cache and in remote storage. It then places small .dvc
files within the Git repository. These .dvc
files serve as pointers, containing metadata and checksums of the actual large files. This hybrid approach allows Git to track the version of the data without storing the data itself, enabling efficient branching, merging, and cloning of ML projects with large datasets.
REPRODUCIBLE ML PIPELINES
A core feature of DVC is its ability to define and execute reproducible machine learning pipelines. This is achieved through a dvc.yaml
file, which describes the various stages of a data science workflow, such as data preprocessing, feature engineering, model training, and evaluation. DVC tracks the inputs, outputs, and dependencies for each stage. When inputs change, DVC intelligently re-runs only the necessary stages, ensuring that the entire workflow can be reproduced consistently. This promotes transparency, reduces errors, and significantly aids in experiment management and result validation.
HISTORY
DVC was created by Iterative.ai and first publicly released in 2017. It emerged from the growing need to apply software engineering best practices, particularly version control and continuous integration/delivery (CI/CD), to data-intensive machine learning projects. Its development has been consistently driven by an active open-source community, focusing on making ML development more reproducible, collaborative, and manageable at scale.
SEE ALSO
git(1): Version control system for source code., mlflow(1): An open-source platform for managing the end-to-end machine learning lifecycle., lakefs(1): An open-source project that delivers Git-like branching and versioning for data lakes., pachctl(1): The command-line tool for Pachyderm, which provides data versioning and data pipelines for ML.