dvc
Manage machine learning experiments and data
TLDR
Execute a DVC subcommand
Display general help
Display help about a specific subcommand
Display version
SYNOPSIS
dvc
PARAMETERS
init
Initialize a DVC repository.
add
Track a data file or directory.
run [options]
Define a pipeline stage to run a command and track its dependencies and outputs.
dag
Visualize the DVC pipeline as a directed acyclic graph.
commit
Record changes to tracked data and pipelines.
push
Upload tracked data to a remote storage location.
pull
Download tracked data from a remote storage location.
status
Show the status of the DVC repository.
exp
Run and manage experiments (DVC Experiment Management). Includes commands like `dvc exp run`, `dvc exp show`, etc.
--version
Show the DVC version.
--help
Show help information.
DESCRIPTION
DVC (Data Version Control) is an open-source version control system for machine learning projects. It extends Git to handle large datasets, machine learning models, and experimental workflows.
Unlike traditional Git, which is designed for code, DVC focuses on managing the data and models that are essential for ML projects. It tracks data lineage, dependencies, and metrics, making it easier to reproduce experiments, collaborate with others, and deploy models.
DVC uses a .dvc directory to store metadata about tracked data files and directories. It supports various storage backends, including cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage) and local/remote file systems. This allows users to store large datasets externally and track changes without storing the data directly in Git.
CAVEATS
DVC relies on Git for versioning the DVC metadata files (.dvc). Understanding Git is crucial for effective DVC usage. Large datasets may require significant storage space in the configured remote.
DATA CACHING
DVC uses content-addressable storage (CAS) to cache data. This allows DVC to efficiently share data between different experiments and projects, avoiding redundant data storage and transfer.
REPRODUCIBILITY
DVC makes it easier to reproduce machine learning experiments by tracking the data, code, and dependencies used to generate results.
COLLABORATION
DVC enables effective collaboration on ML projects by providing a standardized way to manage data and models, ensuring that team members can easily share and reproduce each other's work.
HISTORY
DVC was created to address the challenges of versioning data and models in machine learning projects. It was initially designed to work alongside Git, providing a familiar interface for managing data assets. Over time, DVC has evolved to include features such as pipeline management, experiment tracking, and collaboration tools, becoming a comprehensive platform for ML project management.
SEE ALSO
git(1)