dvc-commit
Save changes to tracked data and pipelines
TLDR
Commit changes to all DVC-tracked files and directories
Commit changes to a specified DVC-tracked target
Recursively commit all DVC-tracked files in a directory
SYNOPSIS
dvc commit [paths...] [--json]
PARAMETERS
paths...
Optional. One or more paths to DVC-tracked files, directories, or .dvc files. If specified, only changes to these specific targets will be committed. If omitted, DVC commits changes for all modified tracked data within the project.
--json
Output results in JSON format, which can be useful for scripting and programmatic integration.
DESCRIPTION
dvc commit is a core DVC (Data Version Control) command used to finalize changes made to DVC-tracked files or directories. When you modify data that DVC tracks, these changes are not automatically reflected in its metadata files (.dvc files).
The dvc commit command scans the specified targets (or all modified tracked data if no targets are given), calculates their hashes, copies or links the new data to DVC's local cache, and updates the corresponding .dvc file to point to this new cache entry. This makes the new state of the data versionable.
It's often used after modifying DVC-tracked data and before git add and git commit to version the .dvc files themselves. It ensures that the .dvc file accurately reflects the current state of the data it represents, allowing for proper data versioning and reproducibility.
CAVEATS
dvc commit only stages changes to DVC's internal cache and updates .dvc files. You must follow up with git add .dvc_file and git commit to version these metadata changes in your Git repository.
Does not interact with Git's staging area for other files; it exclusively deals with DVC-tracked data.
For large datasets, the hashing and caching process can take a significant amount of time.
This command does not upload data to remote storage; dvc push is required for that.
WORKFLOW INTEGRATION
Integrated into the Git workflow, dvc commit is typically followed by git add .dvc_file and git commit to version the metadata itself within your Git repository.
HASH-BASED VERSIONING
DVC uses content-addressable storage. When dvc commit is run, it computes a hash (e.g., MD5) of the data's content, and this hash is used to name the data in the DVC cache. This hash, along with file size, is recorded in the .dvc file. This ensures data integrity, enables efficient deduplication, and allows DVC to track data changes precisely.
HISTORY
DVC (Data Version Control) was created by Iterative.ai, with its first public release around 2017. dvc commit has been a fundamental command since its early versions, forming the core mechanism for versioning data similar to how git commit versions code. Its design philosophy aligns with Git's, making it familiar to developers for managing data alongside code efficiently.
SEE ALSO
dvc add(1), dvc push(1), dvc pull(1), dvc status(1), git commit(1)