LinuxCommandLibrary

dvc-add

Track files or directories with DVC

TLDR

Add a single target file to the index

$ dvc add [path/to/file]
copy

Add a target directory to the index
$ dvc add [path/to/directory]
copy

Recursively add all the files in a given target directory
$ dvc add --recursive [path/to/directory]
copy

Add a target file with a custom .dvc filename
$ dvc add --file [custom_name.dvc] [path/to/file]
copy

SYNOPSIS

dvc add [-h] [-d] [-o] [-M] [-f] [-v] [-q] [-p] [--external] [--no-exec] [--glob ] [--progress] [--cloud] [--remote ] [--jobs ] [ ...]

PARAMETERS

-h, --help
    Show help message and exit.

-d, --no-exec
    Create the .dvc file but skip execution (skipping the copying of data to the cache).

--glob
    Treat as a glob-style pattern. DVC will recursively search for files matching that .

-o, --out
    Specify an output path (instead of creating it automatically). For advanced use cases. Useful when the default behavior is not desired, such as preserving symlinks.

--external
    Treat as external data source even if it is located in the workspace.

-M, --move
    Move the original data file or directory to the DVC cache after adding it. Implies `--no-exec`.

-f, --force
    Force the operation even if the target `.dvc` file already exists. Overwrites the existing one.

-v, --verbose
    Increase verbosity. Print debug logs.

-q, --quiet
    Suppress printing of output to the console.

-p, --progress
    Show progress bar (enabled by default). Useful for tracking the copying of large files.

--cloud
    Upload data to the remote storage after adding it (if a remote is configured).

--remote
    Name of the remote to upload the data to. Defaults to the current remote (configured by `dvc remote default`). Requires `--cloud`.

--jobs
    Number of jobs to run in parallel. Defaults to the number of CPU cores.

[ ...]
    Path to the data file or directory to add to DVC tracking. Can be multiple paths.

DESCRIPTION

The `dvc add` command is used to track data files or directories with DVC (Data Version Control). It adds the specified data to the DVC cache, generates a `.dvc` file that describes the data's location and MD5 checksum, and links the cached data to the workspace. This allows DVC to track changes to the data and manage its versions. DVC will create a lock file using the name of the original data source to manage concurrent accesses. Data files added to DVC are treated as dependencies in the DVC pipeline and can be used as inputs for DVC stages. This makes it possible to automatically version, manage, and re-produce your machine learning workflows using DVC's version control features. DVC handles large datasets and model files efficiently by caching them outside of your Git repository. Data files can be added to DVC recursively using the appropriate flag.

Usage Example: To add a directory named `data` to DVC tracking, you would run `dvc add data`. This will generate a `data.dvc` file in the same directory and move the original data directory in the dvc cache.

CAVEATS

Existing `.dvc` files will be overwritten if using `-f`. Ensure the DVC remote is properly configured before using `--cloud`.

HOW DVC STORES DATA

DVC stores data outside of the Git repository in a special cache directory. By default, this directory is `.dvc/cache`. When you use `dvc add`, DVC calculates the MD5 hash of the data file or directory. This hash is used as the content address in the cache. DVC then creates a link (copy/symlink/hardlink depending on the configuration) to the cached data from the workspace location. This ensures that the data is versioned independently of Git, which is better suited for code.

CONCURRENT ACCESS MANAGEMENT

DVC will create a lock file using the name of the original data source to manage concurrent accesses.

SEE ALSO

dvc init(1), dvc run(1), dvc commit(1), dvc push(1)

Copied to clipboard