LinuxCommandLibrary

dvc-checkout

Restore data tracked by DVC

TLDR

Checkout the latest version of all target files and directories

$ dvc checkout
copy

Checkout the latest version of a specified target
$ dvc checkout [target]
copy

Checkout a specific version of a target from a different Git commit/tag/branch
$ git checkout [commit_hash|tag|branch] [target] && dvc checkout [target]
copy

SYNOPSIS

dvc checkout [...] [options]

Commonly used forms:
dvc checkout
dvc checkout my_data.dvc
dvc checkout data/features/
dvc checkout data.csv model.pkl

PARAMETERS

[...]
    Optional: One or more paths to DVC-tracked files, directories, or .dvc files. If omitted, checks out all .dvc files in the current directory and subdirectories.

--relink
    Forces recreation of links or copies even if the target is already present and correct. Useful for fixing broken links or changing link types.

-q, --quiet
    Suppresses all output messages, showing only errors.

-v, --verbose
    Enables verbose output, showing more detailed information about the operation.

--jobs
    Specifies the number of parallel jobs to run when materializing data. Defaults to 1 (no parallelism). Set to 0 to use all available CPU cores.

--no-run-cache
    Disables the use of the DVC run cache when checking out stages. This means stages will be re-run instead of using cached outputs, if applicable.

--with-deps
    When used with , it will checkout not only the specified target(s) but also all their DVC-tracked dependencies.

DESCRIPTION

dvc checkout is a fundamental command in Data Version Control (DVC) used to restore data files and directories tracked by DVC to a specific state or version. It materializes the data from the DVC cache into your workspace.

Similar to how Git restores source code files, dvc checkout ensures data reproducibility by linking or copying the correct data versions identified by corresponding .dvc files. When executed without arguments, it checks out all DVC-tracked files and directories in the current repository. If paths are specified, it operates only on those targets.

This command is crucial for switching between different data versions, reverting unwanted changes, or setting up a clean working environment based on a specific Git commit that includes DVC-tracked data. It interacts with the DVC cache, which stores immutable versions of your data. If the required data is not in the local cache, you might need to use dvc pull to retrieve it from a remote storage.

CAVEATS

Data Availability: dvc checkout retrieves data from the local DVC cache. If the required data is not in the cache, you will need to run dvc pull first to download it from a configured remote storage.
Overwriting Changes: Be cautious when running dvc checkout as it can overwrite existing files in your workspace that are tracked by DVC. Ensure you have committed or stashed any unsaved changes if they are important.
Git Integration: dvc checkout works best when your DVC repository is integrated with a Git repository. It often follows Git checkouts to align data versions with code versions.

HISTORY

DVC (Data Version Control) was first open-sourced by Iterative.ai in 2017. The dvc checkout command has been a core component of DVC since its early versions, designed to mirror the familiarity of Git's checkout mechanism but applied specifically to machine learning data and models. Its purpose is to facilitate the reproducibility and versioning of large datasets and model files, which Git is not designed to handle efficiently. The command's functionality has evolved to include options like --jobs for parallel processing and --with-deps for managing dependencies, improving performance and usability over time.

SEE ALSO

dvc add(1) - Adds files and directories to DVC for version control., dvc pull(1) - Downloads data from DVC remote storage to the local cache and checks it out., dvc push(1) - Uploads data from the local DVC cache to a remote storage., dvc status(1) - Shows the status of DVC-tracked files and stages., dvc repro(1) - Reproduces a DVC pipeline by running stages., git checkout(1) - The analogous Git command for source code, often used in conjunction with dvc checkout.

Copied to clipboard