git-annex
Manage large files with Git
TLDR
Initialize a repo with Git annex
Add a file
Show the current status of a file or directory
Synchronize a local repository with a remote
Get a file or directory
Display help
SYNOPSIS
git-annex command [options] [arguments]
Common commands:
git-annex init
git-annex add file...
git-annex get file...
git-annex drop file...
git-annex sync
PARAMETERS
init
Initializes a git-annex repository. This converts a Git repository into a git-annex enabled one, setting up the necessary internal structures.
add
Adds specified files to the annex. Instead of adding content to Git, it moves the content to the annex and places a symlink (or pointer file) in the Git repository.
get
Retrieves the content of annexed files from one of the available locations (e.g., another repository, a configured remote, or a cloud backend) to the local repository.
drop
Removes the local content of annexed files, freeing up local disk space. The content remains available in other configured locations or remotes. This is safe as long as the content exists elsewhere.
sync
Synchronizes changes between the local git-annex repository and its remotes. This includes both Git tree changes and information about annexed file content availability.
whereis
Shows which repositories or locations store the content for specified annexed files, helping to understand data distribution.
fsck
Checks the integrity of the annexed content, ensuring that all files are present and their hashes match, optionally repairing issues or checking for duplicate content.
info
Displays detailed information about the git-annex repository, including statistics about annexed files, storage sizes, and configured remotes.
DESCRIPTION
git-annex extends Git to manage large files and data without storing their content directly within the Git repository. It achieves this by storing the actual file content in a separate 'annex' and placing symlinks (or special pointers on platforms without robust symlink support) in the Git repository. This allows Git to track the metadata (filenames, permissions, version history) of these files, while git-annex handles the storage, distribution, and integrity of their potentially massive content.
It supports various storage backends, including local directories, network drives, and cloud storage providers. Key features include content-addressing (files are identified by their cryptographic hash, ensuring integrity), distributed availability (content can be stored in multiple locations), and efficient synchronization. git-annex is particularly valuable for projects involving large datasets, media files, or binaries, where storing raw content in Git would be impractical or impossible, enabling collaborative work on large file collections.
CAVEATS
While powerful, git-annex introduces a new layer of complexity compared to plain Git. Users must understand the distinction between Git tracking metadata and git-annex managing content.
Direct modification of annexed files can lead to detached content or unexpected behavior; changes should typically be made by 'unlocking' the file, editing, and then 'adding' (or 'locking') again.
Cross-platform symlink support (especially on Windows) can sometimes be challenging, though git-annex offers alternative mechanisms for these environments.
Managing available disk space and ensuring content is available when needed requires careful planning and use of commands like git-annex get and git-annex drop.
HOW IT WORKS
git-annex replaces large files in your Git repository with symlinks (or special 'pointer' files on some systems). These symlinks point to the actual file content, which is stored in a hidden directory (the 'annex') alongside your Git repository. The file content itself is identified by its cryptographic hash (e.g., SHA256), ensuring integrity and content-addressing. When you 'add' a file, its content is moved to the annex, and a symlink is created. When you 'get' a file, git-annex ensures its content is present locally by fetching it from a known location. This allows Git to track only the small symlink, making large repositories fast and efficient, while git-annex manages the heavy lifting of data storage and retrieval across multiple locations.
COMMON USE CASES
git-annex is ideal for:
Scientific Data: Managing large experimental results, simulations, or datasets in research projects.
Media Production: Versioning large video, audio, or image files in collaborative environments.
Archiving: Creating distributed, integrity-checked archives of large collections of documents or historical data.
Software Development: Handling large binaries, dependencies, or build artifacts that don't belong in the main Git repository.
Distributed Collaboration: Sharing large datasets among geographically dispersed teams without relying on a single central server for content.
HISTORY
git-annex was created by Joey Hess in 2010 to address Git's inherent limitations with large files. Git is optimized for tracking changes in text files and performs poorly with large binary blobs, as it copies the entire file for each revision. Hess envisioned a system that could leverage Git's distributed metadata tracking capabilities while outsourcing the content storage. The project has seen continuous development, evolving to support various storage backends and advanced features like distributed content availability and integrity checking, becoming a crucial tool for data management in scientific research, archiving, and media production.