LinuxCommandLibrary

dvc-add

Track files or directories with DVC

TLDR

Add a single target file to the index

$ dvc add [path/to/file]
copy

Add a target directory to the index
$ dvc add [path/to/directory]
copy

Recursively add all the files in a given target directory
$ dvc add --recursive [path/to/directory]
copy

Add a target file with a custom .dvc filename
$ dvc add --file [custom_name.dvc] [path/to/file]
copy

SYNOPSIS

dvc add [-h] [-q | -v] [--no-commit] [--file <path>] [--desc <text>] [--to-remote] [--external] [--glob] [--name <name>] [--hash {md5,etag}] [--no-size] [--no-hash] [--force] <path> [<path> ...]

PARAMETERS

<path> [<path> ...]
    One or more paths to files or directories to track.

-h, --help
    Show the help message and exit.

-q, --quiet
    Suppress output.

-v, --verbose
    Be more verbose.

--no-commit
    Do not automatically commit the generated .dvc file to Git.

--file <path>
    Specify the path and name for the generated .dvc file. Defaults to <target_path>.dvc.

--desc <text>
    Add a description to the .dvc file, visible in dvc status and dvc metrics show (if applicable).

--to-remote
    Add an external file directly to remote storage without bringing it to the local cache first. Requires --external.

--external
    Add files or directories located outside of the current DVC repository.

--glob
    Treat the provided <path> as a glob pattern (e.g., data/*.csv).

--name <name>
    Assign a specific name to the added data item in the .dvc file, useful for directory outputs.

--hash {md5,etag}
    Algorithm to use for hashing the data. Defaults to MD5.

--no-size
    Do not include size information in the .dvc file.

--no-hash
    Do not include hash information in the .dvc file. Use with caution as it impacts data integrity checks.

--force
    Overwrite an existing .dvc file without prompting.

DESCRIPTION

dvc add is a fundamental DVC command used to register files or directories within a DVC repository. Instead of storing large data directly in Git, DVC stores it in a special local cache or a remote storage (like S3, GCS, etc.).

When you run dvc add <path>, DVC computes a content hash of the data, moves or copies the actual data to its cache, and then creates a small, lightweight .dvc text file (e.g., data.dvc) at the specified location. This .dvc file contains a pointer to the data in the cache, along with its hash, size, and other metadata.

This .dvc file is designed to be versioned with Git. By versioning these small metadata files in Git, DVC enables efficient tracking of large datasets and machine learning models, ensuring reproducibility and collaboration without bloating your Git repository. It's the essential first step to bring your data under DVC's control.

CAVEATS

Adding very large files or directories with dvc add can be time-consuming due to the hashing and copying process to the DVC cache. By default, the original files are moved into the DVC cache; if you need to keep them in their original location, consider configuring DVC cache types (e.g., symlinks) or using --external for files already on external storage. Always remember to commit the generated .dvc file to Git after running dvc add (unless --no-commit is used) to ensure proper version tracking. .dvc files are critical metadata and should not be deleted.

THE <I>.DVC</I> FILE

The primary output of dvc add is a small YAML file (e.g., data.dvc) that serves as a lightweight pointer to the actual data. This file contains crucial metadata such as the data's content hash (MD5 by default), its size, and the original path. These .dvc files are designed to be committed to Git. This allows Git to version the metadata of large files, which in turn enables DVC to retrieve the correct data from its cache or remote storage corresponding to any Git commit.

DVC CACHE

When data is added with dvc add, it is typically moved or copied into a DVC-managed cache directory (usually .dvc/cache within the repository, or a shared global cache configured by dvc config). This cache stores data content-addressed, meaning identical files or directories added to DVC only occupy storage space once, thanks to deduplication. The .dvc file created by dvc add then points to this deduplicated, cached data.

HISTORY

DVC (Data Version Control), launched by Iterative.ai in 2017, was developed to address the significant challenge of versioning large datasets and machine learning models, which traditional Git alone cannot handle efficiently. dvc add has been a core command since DVC's inception, fundamental to its operation. It embodies DVC's paradigm of versioning data by creating lightweight pointer files (.dvc files) that Git can manage, while DVC handles the actual data in its cache or remote storage. Its design has continuously evolved to support diverse storage backends, external data sources, and performance optimizations, solidifying its role as a cornerstone in modern MLOps workflows.

SEE ALSO

dvc run(1), dvc push(1), dvc pull(1), dvc remove(1), dvc status(1), git add(1)

Copied to clipboard