nextflow

Run and manage computational pipelines

TLDR

Run a pipeline, use cached results from previous runs

$ nextflow run [main.nf] -resume

Run a specific release of a remote workflow from GitHub

$ nextflow run [user/repo] -revision [release_tag]

Run with a given work directory for intermediate files, save execution report

$ nextflow run [workflow] -work-dir [path/to/directory] -with-report [report.html]

Show details of previous runs in current directory

$ nextflow log

Remove cache and intermediate files for a specific run

$ nextflow clean -force [run_name]

List all downloaded projects

$ nextflow list

Pull the latest version of a remote workflow from Bitbucket

$ nextflow pull [user/repo] -hub bitbucket

Update Nextflow

$ nextflow self-update

SYNOPSIS

nextflow <command> [<options>] [<arguments>]
nextflow run <pipeline> [<options>] [<params>]

-h, --help
    Displays help information for the command or global options.

-v, --version
    Shows the Nextflow version number.

-C <file>, --config <file>
    Loads an additional configuration file for the pipeline execution.

-bg, --background
    Runs the Nextflow pipeline execution in the background.

-q, --quiet
    Suppresses informational messages, showing only warnings or errors.

-ansi <mode>
    Enables or disables ANSI console output colors (e.g., 'true', 'false', 'auto').

-r <revision>, --revision <revision>
    Specifies the pipeline version to execute (e.g., Git branch, tag, or commit ID).

-profile <name>
    Specifies one or more configuration profiles to apply (e.g., 'docker', 'conda', 'slurm').

-params-file <file>
    Loads pipeline parameters from a YAML or JSON file.

-entry <name>
    Specifies the entry point (workflow) to execute within the pipeline.

-resume
    Resumes a previous pipeline execution from the last successful checkpoint.

-w <dir>, --work-dir <dir>
    Sets the pipeline's work directory where intermediate files are stored.

-name <name>
    Assigns a custom name to the pipeline execution, useful for tracking.

-latest
    Forces the download of the latest revision of the pipeline from a repository.

-offline
    Disables internet access for pipeline execution (e.g., no Git fetch or container pull).

-stub
    Creates empty stub output files for processes, useful for testing pipeline logic.

-with-docker <image>
    Executes processes using the specified Docker image for containerization.

-with-singularity <image>
    Executes processes using the specified Singularity image for containerization.

-with-conda <env>
    Enables Conda environment management for process dependencies.

-with-tower
    Enables monitoring and logging of the pipeline execution with Nextflow Tower.

--container-engine <engine>
    Specifies the container engine to use (e.g., 'docker', 'singularity', 'podman').

DESCRIPTION

Nextflow is an open-source workflow management system designed for building and deploying complex computational pipelines. It excels at handling large-scale data processing in a reproducible and portable manner across various computing environments. Built on the reactive programming paradigm, Nextflow simplifies the development of parallel and distributed processes. It manages dependencies, tracks outputs, and automatically retries failed tasks, ensuring robust execution. A core feature is its ability to seamlessly integrate with container technologies like Docker and Singularity, and resource managers such as Slurm, SGE, Kubernetes, and AWS Batch, enabling pipelines to run consistently from a local machine to a high-performance computing cluster or cloud environment. It's widely adopted in bioinformatics for its efficiency and reproducibility.

CAVEATS

While powerful, Nextflow has a learning curve, especially when writing complex pipelines or debugging issues related to executor and container environments. Resource management and optimization can require significant effort to get optimal performance on diverse infrastructures. The DSL2 (Domain Specific Language 2) for pipeline definition, while expressive, requires familiarity with Groovy-like syntax and concepts.

NEXTFLOW DSL2 AND MODULES

Nextflow pipelines are defined using a Groovy-based Domain Specific Language (DSL). DSL2, introduced in Nextflow version 20.07.0, revolutionized pipeline development by enabling the creation of modular components (processes and workflows) that can be easily shared, reused, and composed into larger pipelines. This greatly enhances pipeline organization, maintainability, and collaboration.

CONFIGURATION FILES

Nextflow uses a powerful configuration system, primarily via .config files. These files allow users to define various settings like executor types, container images, resource requests (CPU, memory), and custom parameters. Configurations can be layered, with system-wide, project-specific, and user-specific settings overriding each other, providing immense flexibility for different execution environments.

CHANNELS

The core mechanism for data flow in Nextflow is the "channel." Channels are asynchronous queues that connect processes, allowing them to exchange data without direct file system interaction. This reactive data-flow model is fundamental to Nextflow's ability to parallelize tasks efficiently and manage dependencies.

HISTORY

Nextflow was initially developed by Paolo Di Tommaso at the Centre for Genomic Regulation (CRG) in Barcelona, Spain, and publicly released in 2013. It emerged from the need for a more robust and reproducible way to manage complex data analysis pipelines in genomics. Its key innovation was the adoption of a reactive, data-flow programming model, inspired by functional programming, which made pipeline parallelization and error handling more intuitive. The introduction of DSL2 in recent years significantly improved pipeline modularity and reusability, further solidifying its position as a leading tool in scientific computing, particularly in bioinformatics and life sciences.