LinuxCommandLibrary

dvc-gc

Clean up unused data in DVC repository

TLDR

Garbage collect from the cache, keeping only versions referenced by the current workspace

$ dvc gc [[-w|--workspace]]
copy

Garbage collect from the cache, keeping only versions referenced by branch, tags, and commits
$ dvc gc [[-a|--all-branches]] [[-T|--all-tags]] [[-a|--all-commits]]
copy

Garbage collect from the cache, including the default cloud remote storage (if set)
$ dvc gc [[-a|--all-commits]] [[-c|--cloud]]
copy

Garbage collect from the cache, including a specific cloud remote storage
$ dvc gc [[-a|--all-commits]] [[-c|--cloud]] [[-r|--remote]] [remote_name]
copy

SYNOPSIS

dvc gc [<path>] [<options>]

Common usage:
dvc gc [--workspace | --all-branches | --all-tags | --all-commits] [--dry-run] [--cloud] [-r <remote>] [--force]

PARAMETERS

--workspace, -w
    Collect garbage from data referenced by the current workspace. This is the default scope if no other scope option is provided.

--all-branches, -b
    Collect garbage from data referenced by all branches in the repository.

--all-tags, -t
    Collect garbage from data referenced by all tags in the repository.

--all-commits, -a
    Collect garbage from data referenced by all commits in the repository. This is the most comprehensive option and implicitly includes all branches and tags.

--dry-run, -d
    Show what data would be removed without actually deleting anything. Highly recommended before running dvc gc without --dry-run.

--force, -f
    Force deletion of data without interactive confirmation. Use with caution, especially with --cloud.

--cloud
    Collect garbage from the DVC remote storage instead of the local cache. Requires the --remote option.

--remote <name>, -r <name>
    Specify the DVC remote to clean up when using the --cloud option.

--jobs <jobs>, -j <jobs>
    Number of jobs to run simultaneously for parallel processing, useful for large datasets.

--unsaved-only
    Removes only data that is not saved in the current workspace. This helps clean up stale cache entries without affecting current working data.

<path>
    Optional: Path to a specific DVC-tracked file or directory to clean up. If omitted, the entire project's cache is considered.

DESCRIPTION

dvc gc is a Data Version Control (DVC) command that performs garbage collection on the DVC cache. Its primary purpose is to remove data files and directories from the cache that are no longer referenced by any DVC-tracked .dvc files or Git commits within the current repository or specified remote. This helps in reclaiming disk space occupied by old, unused, or deleted data versions.

It can operate on the local cache, or on cloud remotes. The command offers various options to specify which references to consider (e.g., current workspace, all branches, all tags, all commits) and whether to perform a dry run before actual deletion. It's an essential maintenance tool for DVC repositories to manage storage efficiently.

CAVEATS

dvc gc is a destructive operation. Once data is removed from the cache or remote, it is generally not recoverable without external backups.

Always use --dry-run first to preview what will be deleted.
Carefully consider the scope of cleaning (e.g., --workspace, --all-branches, --all-commits) to avoid accidentally deleting data referenced by other branches or tags. --all-commits is often the safest for local cache.
When cleaning cloud remotes (with --cloud), ensure you have appropriate permissions and understand the impact on shared storage.

HOW IT WORKS

dvc gc identifies data objects in the DVC cache (which are typically symlinks or hardlinks to actual files) that are no longer referenced by any .dvc file or Git commit within the specified scope (e.g., current workspace, all branches). It then removes these unreferenced objects, freeing up disk space. For cloud remotes, it identifies and removes unreferenced objects directly from the remote storage.

SCOPE OF CLEANING

The default behavior cleans the local cache based on the current workspace. To ensure no important data is accidentally deleted, it's often recommended to use options like --all-branches, --all-tags, or --all-commits to consider all potential references across your repository's history.

HISTORY

DVC (Data Version Control) was created by Iterative.ai, with its initial public release in 2017. The dvc gc command was introduced as a core utility for managing disk space by cleaning the DVC cache, a fundamental aspect of DVC's design to separate data storage from Git. Its functionality has evolved to include cleaning cloud remotes, reflecting DVC's expanded capabilities for distributed data management and collaboration.

SEE ALSO

dvc repro(1), dvc cache dir(1), dvc push(1), git gc(1)

Copied to clipboard