LinuxCommandLibrary

git-annex

Manage large files with Git

TLDR

Initialize a repo with Git annex

$ git annex init
copy

Add a file
$ git annex add [path/to/file_or_directory]
copy

Show the current status of a file or directory
$ git annex status [path/to/file_or_directory]
copy

Synchronize a local repository with a remote
$ git annex [remote]
copy

Get a file or directory
$ git annex get [path/to/file_or_directory]
copy

Display help
$ git annex help
copy

SYNOPSIS

git annex subcommand [options...] [files...] | path

PARAMETERS

init [name]
    Initialize a Git repository with annex

add [options] files
    Add files to the annex

rm [options] files
    Remove files from annex and Git

drop [options] files
    Remove content from local annex (keeps pointer)

get [options] files
    Retrieve file content to local annex

sync [options]
    Sync annex changes with remotes

copy/move [options] files
    Copy or move content to/from remotes

lock/unlock files
    Lock files to prevent modification or unlock for editing

status
    Show annex status

unused
    Find unreferenced annex files

--auto
    Enable automatic mode for bulk operations

--fast
    Skip expensive checks

--json
    Output in JSON format

--debug
    Enable debug logging

--help / -h
    Show help for command or subcommand

DESCRIPTION

git-annex extends Git to efficiently manage large files, datasets, and binaries without bloating the repository history. Instead of storing file contents directly in Git, it uses pointer files committed to Git, while actual content is stored in an annex with pluggable backends (local disk, SSH remotes, S3, Glacier, WebDAV, torrent, etc.).

This enables distributed storage where repositories can track availability of content across locations. Key workflows include adding files (git annex add), syncing presence/metadata (git annex sync), retrieving content (git annex get), and removing content while keeping pointers (git annex drop). It supports direct mode (files checked out directly) and indirect mode (pointers), preferred remotes, group policies, and automatic handling via --auto.

Ideal for scientific data, media libraries, backups, and collaborative projects. Integrates seamlessly with Git for version control of metadata. Requires Git 1.7.10+, uses adjusted branches for annex state. Powerful for datalad or reproducible research stacks.

CAVEATS

High disk usage possible with many remotes; steep learning curve for advanced backends and policies; requires consistent Git workflow; direct mode alters checkout behavior.

MODES

Indirect: Uses symlinks/pointers (default).
Direct: Files stored directly in repo (git annex init --direct).

KEY CONCEPT: PREFERRED CONTENT

Configures what content each repo wants/keeps via git annex wanted/preferred-content.

EXAMPLE WORKFLOW

git annex init myrepo
git annex add largefile.dat
git annex sync origin
git annex drop largefile.dat --from origin

HISTORY

Developed by Joey Hess starting 2010 as 'git-annex'; evolved from personal needs for large file Git handling. Actively maintained (v10+ in 2023), powers tools like DataLad. Key milestones: special remotes (2011), direct mode (2014), crypto backends.

SEE ALSO

git(1), rsync(1), ssh(1)

Copied to clipboard