dvc-add
Track files or directories with DVC
TLDR
Add a single target file to the index
Add a target directory to the index
Recursively add all the files in a given target directory
Add a target file with a custom .dvc filename
SYNOPSIS
dvc add [-h] [-d] [-o] [-M] [-f] [-v] [-q] [-p] [--external] [--no-exec] [--glob
PARAMETERS
-h, --help
Show help message and exit.
-d, --no-exec
Create the .dvc file but skip execution (skipping the copying of data to the cache).
--glob
Treat
-o, --out
Specify an output path (instead of creating it automatically). For advanced use cases. Useful when the default behavior is not desired, such as preserving symlinks.
--external
Treat
-M, --move
Move the original data file or directory to the DVC cache after adding it. Implies `--no-exec`.
-f, --force
Force the operation even if the target `.dvc` file already exists. Overwrites the existing one.
-v, --verbose
Increase verbosity. Print debug logs.
-q, --quiet
Suppress printing of output to the console.
-p, --progress
Show progress bar (enabled by default). Useful for tracking the copying of large files.
--cloud
Upload data to the remote storage after adding it (if a remote is configured).
--remote
Name of the remote to upload the data to. Defaults to the current remote (configured by `dvc remote default`). Requires `--cloud`.
--jobs
Number of jobs to run in parallel. Defaults to the number of CPU cores.
Path to the data file or directory to add to DVC tracking. Can be multiple paths.
DESCRIPTION
The `dvc add` command is used to track data files or directories with DVC (Data Version Control). It adds the specified data to the DVC cache, generates a `.dvc` file that describes the data's location and MD5 checksum, and links the cached data to the workspace. This allows DVC to track changes to the data and manage its versions. DVC will create a lock file using the name of the original data source to manage concurrent accesses. Data files added to DVC are treated as dependencies in the DVC pipeline and can be used as inputs for DVC stages. This makes it possible to automatically version, manage, and re-produce your machine learning workflows using DVC's version control features. DVC handles large datasets and model files efficiently by caching them outside of your Git repository. Data files can be added to DVC recursively using the appropriate flag.
Usage Example: To add a directory named `data` to DVC tracking, you would run `dvc add data`. This will generate a `data.dvc` file in the same directory and move the original data directory in the dvc cache.
CAVEATS
Existing `.dvc` files will be overwritten if using `-f`. Ensure the DVC remote is properly configured before using `--cloud`.
HOW DVC STORES DATA
DVC stores data outside of the Git repository in a special cache directory. By default, this directory is `.dvc/cache`. When you use `dvc add`, DVC calculates the MD5 hash of the data file or directory. This hash is used as the content address in the cache. DVC then creates a link (copy/symlink/hardlink depending on the configuration) to the cached data from the workspace location. This ensures that the data is versioned independently of Git, which is better suited for code.
CONCURRENT ACCESS MANAGEMENT
DVC will create a lock file using the name of the original data source to manage concurrent accesses.