picard
Manipulate and analyze high-throughput sequencing (HTS) data
TLDR
Start Picard
Open a set of files
Display the version of Picard installed
SYNOPSIS
java -jar /path/to/picard.jar TOOL_NAME [OPTION1=VALUE1 OPTION2=VALUE2 ...]
Example: java -jar picard.jar MarkDuplicates I=input.bam O=output.bam M=metrics.txt
PARAMETERS
(General Note)
This section describes common parameters used across Picard tools. Picard is a suite, and each TOOL_NAME has its own specific set of arguments.
I=<FILE> or INPUT=<FILE>
Specifies the input file path, typically a SAM, BAM, or CRAM alignment file.
O=<FILE> or OUTPUT=<FILE>
Specifies the output file path where results will be written.
M=<FILE> or METRICS_FILE=<FILE>
Used by certain tools to output a metrics file summarizing the operation.
VALIDATION_STRINGENCY=<LEVEL>
Controls how strictly input data is validated. Common levels include STRICT, LENIENT, or SILENT.
TMP_DIR=<DIRECTORY>
Specifies a directory for temporary files created during processing.
CREATE_INDEX=<true|false>
A boolean flag common to tools that produce BAM/CRAM output, to create an index file (.bai or .crai).
DESCRIPTION
Picard is a Java-based command-line toolkit developed by the Broad Institute for processing high-throughput sequencing (HTS) data, particularly in genomic research. It provides a wide array of tools for manipulating SAM/BAM/CRAM files (sequence alignment/map formats), VCF files (variant call format), and performing quality control metrics. Common tasks include marking PCR duplicates, adding or replacing read groups, sorting and indexing alignment files, and collecting various sequencing quality metrics. Picard is a critical component in many bioinformatics pipelines, often used in conjunction with other tools like GATK and Samtools, and is integral to preparing data for variant calling.
CAVEATS
Java Dependency: Picard is a Java application and requires a Java Runtime Environment (JRE) or Java Development Kit (JDK) (version 8 or newer is usually recommended) to be installed on the system.
Resource Intensive: Processing large genomic datasets can be computationally intensive, requiring significant RAM and CPU resources. I/O performance is also crucial.
Not a native Linux Command: Unlike standard Linux commands (e.g., ls
, grep
), Picard is typically invoked via java -jar
and is not usually found directly in system PATHs without a wrapper script.
INSTALLATION AND USAGE
Picard is distributed as a single executable JAR file. To use it, download the picard.jar
from the official Broad Institute GATK website. It's then invoked using java -jar picard.jar
followed by the desired tool name and its specific arguments. It's common practice to create a shell alias or wrapper script for convenience, e.g., alias picard='java -jar /path/to/picard.jar'
.
KEY SUBCOMMANDS (TOOLS)
Picard is a collection of many individual tools. Some of the most frequently used include:
MarkDuplicates: Identifies and marks duplicate reads in a BAM file.
AddOrReplaceReadGroups: Adds or replaces read group information in a BAM file.
CollectWgsMetrics: Gathers whole-genome sequencing metrics like coverage.
SortSam: Sorts SAM/BAM files by coordinate or query name.
BuildBamIndex: Creates a BAM index (.bai) file for random access.
MergeSamFiles: Merges multiple SAM/BAM files into one.
HISTORY
Picard was developed by the Broad Institute of MIT and Harvard, initially as an internal toolset to support their large-scale genomic sequencing efforts. It became an open-source project and quickly gained widespread adoption in the bioinformatics community due to its robust handling of common HTS data manipulation tasks. It is frequently updated and maintained, often in conjunction with the GATK (Genome Analysis Toolkit), forming a cornerstone of many standard genomic data processing pipelines, particularly those used for human genome sequencing.
SEE ALSO
samtools(1), bcftools(1), GATK, htslib