LinuxCommandLibrary

dvc-fetch

Download data or models tracked by DVC

TLDR

Fetch the latest changes from the default remote upstream repository (if set)

$ dvc fetch
copy

Fetch changes from a specific remote upstream repository
$ dvc fetch [[-r|--remote]] [remote_name]
copy

Fetch the latest changes for a specific target/s
$ dvc fetch [target/s]
copy

Fetch changes for all branch and tags
$ dvc fetch [[-a|--all-branches]] [[-T|--all-tags]]
copy

Fetch changes for all commits
$ dvc fetch [[-A|--all-commits]]
copy

SYNOPSIS

dvc fetch [options] [targets...]

PARAMETERS

--remote
    Specifies the DVC remote (e.g., S3, GCS, local path) to fetch data from. If omitted, the default remote is used.

--jobs
    Sets the number of parallel downloads to run simultaneously. Defaults to 1.

--all-branches
    Fetches data referenced across all Git branches in the repository.

--all-tags
    Fetches data referenced by all Git tags in the repository.

--all-experiments
    Fetches data associated with all DVC experiments, including queued and finished ones.

--recursive
    Recursively fetches DVC-tracked data from subdirectories within the current repository.

--force
    Forces fetching, even if the data already exists in the cache, potentially overwriting corrupted entries.

--dry-run
    Performs a trial run with no changes made, showing which data would be fetched.

--show-url
    Displays the URL of the data being fetched from the remote storage.

--no-run-cache
    Prevents fetching of run cache data from dvc.yaml files, only fetching outputs.

[targets...]
    Optional. Specifies particular DVC-tracked files or directories to fetch. If omitted, all DVC-tracked data in the current repository (as defined in .dvc files and dvc.yaml) is fetched.

DESCRIPTION

dvc-fetch is a core command in the Data Version Control (DVC) system, primarily used to download data and models from a configured remote storage location into the local DVC cache. Unlike dvc pull, which both fetches data and checks it out to the workspace, dvc-fetch
only populates the local DVC cache with the data referenced in .dvc files or dvc.yaml files. This means it retrieves the data content but does not modify the user's working directory.

This separation is beneficial for scenarios where users need to pre-populate their cache without altering the current state of their working directory, such as in CI/CD pipelines, preparing data for offline use, or debugging.

When dvc-fetch runs, it uses the metadata (hashes, paths) stored in .dvc files to identify the required data artifacts. It then efficiently downloads only the missing or outdated parts from the remote, ensuring data integrity through hash verification and optimizing network bandwidth usage.

CAVEATS

  • DVC Repository Required: dvc-fetch must be executed within an initialized DVC repository (after running dvc init).
  • Remote Configuration: A DVC remote storage must be properly configured (e.g., via dvc remote add) and accessible.
  • Disk Space: Sufficient local disk space is required to store the fetched data in the DVC cache.
  • No Workspace Modification: Remember that dvc-fetch only populates the cache; you'll need dvc checkout or dvc pull to link the data into your workspace.
  • Permissions: Appropriate network and access permissions are essential for connecting to and downloading from the remote storage.

DVC CACHE

dvc-fetch downloads data into the DVC cache, which is a content-addressable storage area. Data is stored by its hash, enabling efficient deduplication across different versions of a dataset or even across multiple DVC projects. This cache is typically located outside the Git repository (e.g., in .dvc/cache or a user-defined location) to prevent bloating the Git history.

DATA INTEGRITY

A key feature of DVC is its focus on data integrity. During dvc-fetch, DVC verifies the integrity of the downloaded data by comparing its MD5 (or other specified hash) with the hash recorded in the corresponding .dvc metadata file. This ensures that the data fetched is precisely what was intended and prevents corruption during transfer.

HISTORY

DVC (Data Version Control) was open-sourced by Iterative.ai in 2017 with the ambitious goal of bringing Git-like version control capabilities to large datasets and machine learning models. dvc-fetch has been a fundamental command since DVC's early versions, providing the core functionality for retrieving data artifacts efficiently.

Its design reflects the need for robust data management in MLOps, allowing users to pre-download data for reproducibility, CI/CD pipelines, and collaborative machine learning projects without directly altering the working directory's state. The command has evolved with DVC's features, incorporating support for advanced concepts like experiments, branches, and tags.

SEE ALSO

dvc(1), dvc-pull(1), dvc-push(1), dvc-add(1), dvc-remote(1), dvc-checkout(1), git(1)

Copied to clipboard