vcftools
Analyze and manipulate VCF variant files
SYNOPSIS
vcftools --vcf
vcftools --gzvcf
vcftools [GENERAL_OPTIONS]
PARAMETERS
--vcf
Specifies the input VCF file. Mandatory for most operations.
--gzvcf
Specifies the input gzipped VCF file (.vcf.gz).
--out
Defines the prefix for all output files generated by vcftools.
--recode
Instructs vcftools to output a new VCF file after applying filters or modifications.
--recode-INFO-all
Used with --recode to ensure all INFO fields from the original VCF are retained in the output.
--keep
Retains only the individuals whose IDs are listed in the specified file (one ID per line).
--remove
Removes individuals whose IDs are listed in the specified file.
--snps
Includes only SNPs whose IDs are listed in the specified file.
--exclude-snps
Excludes SNPs whose IDs are listed in the specified file.
--positions
Includes only sites (chromosome and position) listed in the specified file.
--regions
Includes only sites within genomic regions specified in the file (e.g., BED format).
--chr
Processes only variants on the specified chromosome.
--from-bp
Specifies the start base pair position for a genomic region filter.
--to-bp
Specifies the end base pair position for a genomic region filter.
--min-alleles
Includes only sites with a minimum number of alleles.
--max-alleles
Includes only sites with a maximum number of alleles.
--remove-indels
Excludes insertion/deletion variants, retaining only SNPs.
--min-maf
Filters sites with a minor allele frequency (MAF) below the specified threshold (0.0-0.5).
--max-maf
Filters sites with a minor allele frequency (MAF) above the specified threshold (0.0-0.5).
--max-missing
Filters sites with a proportion of missing genotypes above the specified threshold (e.g., 0.5 for 50% missing data).
--min-meanDP
Filters sites with a mean depth of coverage across individuals below the specified threshold.
--hardy
Calculates and outputs Hardy-Weinberg equilibrium (HWE) statistics for each site.
--freq
Calculates and outputs allele frequencies for each site.
--site-pi
Calculates and outputs nucleotide diversity (Pi) per site.
--window-pi
Calculates and outputs nucleotide diversity (Pi) in non-overlapping windows of specified size (in base pairs).
--fst
Calculates and outputs Fst between populations specified in the input file (e.g., with population assignments for individuals).
--diff-site
Compares two VCF files and reports differences at shared sites.
--012
Outputs a genotype matrix where genotypes are represented as 0 (homozygous reference), 1 (heterozygous), or 2 (homozygous alternate), with -1 for missing data.
--force-index
Forces the creation of a VCF index file (.vcf.idx or .vcf.gz.tbi) if one doesn't exist, which can speed up region-based queries.
--remove-filtered-all
Removes all sites that have anything in the FILTER column (i.e., are marked as failed by previous steps).
DESCRIPTION
vcftools is a powerful and versatile suite of command-line utilities designed for working with Variant Call Format (VCF) files, the standard for storing genomic sequence variations. It provides a wide array of functionalities for manipulating, analyzing, filtering, and summarizing VCF data. Users can perform tasks such as calculating genetic diversity statistics (e.g., allele frequencies, nucleotide diversity, Fst), filtering variants based on quality scores, allele frequencies, or missing data thresholds, extracting specific genomic regions or individuals, comparing VCF files, and converting between different formats. Its modular design allows users to chain commands for complex bioinformatics workflows, making it an indispensable tool for population genetics, medical genomics, and genetic association studies.
CAVEATS
While powerful, vcftools can be memory-intensive for extremely large VCF files, especially when performing certain global statistics. Its performance may vary significantly depending on the specific analyses and filters applied. Some advanced statistical calculations (e.g., Fst, population-specific analyses) require additional input files defining population assignments or sample groups. Users should ensure their VCF files adhere strictly to the VCF specification for reliable processing.
COMMON WORKFLOWS
vcftools is frequently used for quality control (e.g., removing low-quality variants or individuals, filtering by missing data or allele frequency), extracting subsets of data (e.g., specific chromosomes or regions), and calculating various population genetic statistics like allele frequencies, nucleotide diversity (Pi), and measures of population differentiation (Fst).
OUTPUT FILE NAMING
When using the --out option, vcftools appends specific suffixes to the output files based on the operation performed. For instance, allele frequency output will be <prefix>.frq, Hardy-Weinberg equilibrium output will be <prefix>.hwe, and a recoded VCF will be <prefix>.recode.vcf. Users should be aware of these conventions for locating their results.
HISTORY
Developed as part of the 1000 Genomes Project, vcftools quickly became a widely adopted standard for initial processing and quality control of VCF files. Its development addressed the growing need for a flexible command-line toolkit to handle the rapidly increasing volume of genomic variation data. It remains actively maintained and widely used in population genetics and clinical genomics due to its comprehensive feature set and ease of integration into bioinformatics pipelines.
SEE ALSO
bcftools(1), PLINK(1), GATK (VariantFiltration), samtools(1)