csvstack
Vertically concatenate multiple CSV files
SYNOPSIS
csvstack [OPTIONS] FILE1 [FILE2 ...]
PARAMETERS
-h, --help
Show a help message and exit.
-n COLUMN_NAME, --column-name COLUMN_NAME
Specify the name for the new 'source' column added to the output. Defaults to 'source'.
-g GROUP_BY, --group-by GROUP_BY
Instead of creating a new 'source' column, use an existing column (or a comma-separated list of columns) to identify groups. Input files must contain these columns.
-i, --no-inference
Disable type inference when parsing input data. All data will be treated as strings.
-I SKIP_LINES, --skip-lines SKIP_LINES
Skip the specified number of initial lines in each input file before processing, useful for skipping comments or metadata.
-d DELIMITER, --delimiter DELIMITER
Specify the character used to delimit fields in the input CSV files. Defaults to a comma.
-t, --tabs
Specify that the input files are tab-separated (TSV) instead of comma-separated.
-q QUOTECHAR, --quotechar QUOTECHAR
Specify the character used to quote fields containing special characters. Defaults to a double quote (").
-u ESCAPECHAR, --escapechar ESCAPECHAR
Specify the character used to escape the `QUOTECHAR` within a quoted field.
-p PARSE_DATES, --parse-dates PARSE_DATES
A comma-separated list of column names to parse as dates.
-l WHITELIST, --parse-dimensional-whitelist WHITELIST
A comma-separated list of column names that are considered 'dimensional' and always parsed. (Advanced option for `agate` integration).
-e ENCODING, --encoding ENCODING
Specify the character encoding of the input files. Defaults to UTF-8.
-z, --gzip
Decompress gzipped input files automatically.
-b, --bz2
Decompress bzip2 input files automatically.
--zero
Output zeros instead of empty strings for missing numeric values.
--blanks
Do not convert empty strings to `null` values; treat them as actual empty strings.
--no-header-row
Treat the first row of each input file as data, not as a header row. Assumes consistent column order.
--snifflimit SNIFFLIMIT
Limit the number of rows to sniff for column types. Defaults to 1024.
-v, --verbose
Print extra information and progress messages to stderr.
-W, --skip-errors
Skip rows that cannot be parsed successfully, continuing processing.
DESCRIPTION
csvstack is a powerful command-line utility from the csvkit suite designed to vertically concatenate multiple CSV files into a single, cohesive dataset. Its primary feature is the ability to add a new column, by default named 'source', to the combined output. This new column clearly indicates which original input file each row originated from, preserving data provenance.
The command efficiently handles different column sets across input files by taking the union of all column names. For any column not present in a particular input file, csvstack will fill the corresponding cells with `null` values (or blanks/zeros depending on options). This makes it invaluable for consolidating data from various sources with potentially inconsistent schemas, allowing for unified analysis without losing track of the original context. It's an essential tool for data aggregation and preparation workflows.
CAVEATS
When combining files with differing column sets, csvstack will create a union of all column names. Columns present in some files but not others will be padded with empty values (or `null`, zeros, or blanks depending on options) in the output. While csvstack attempts intelligent type inference, inconsistencies in data types for the same column across different input files might lead to unexpected type coercion. For large numbers of files or extremely large individual files, performance can be a consideration as all data must be read and processed.
COLUMN ALIGNMENT
csvstack intelligently handles input files with different header rows. It constructs a 'master' header by taking the union of all unique column names from all input files. When processing each file, it aligns its data to this master header. If an input file lacks a column present in another, that column's values will be `null` (or empty strings/zeros based on options) for rows from that specific file in the final output.
OUTPUT REDIRECTION
By default, csvstack writes its combined output to standard output (`stdout`). To save the result to a new file, you should redirect the output using your shell's redirection operator, for example: `csvstack file1.csv file2.csv > combined.csv`.
HISTORY
csvstack is a component of csvkit, a collection of utilities for working with CSV files from the command line, developed by Christopher Groskopf. csvkit was created to provide a robust, Python-based alternative to traditional shell scripting for common CSV manipulation tasks, drawing inspiration from tools like jq for JSON. Its development aimed to simplify and standardize operations like stacking, joining, and filtering CSV data, making it a popular choice for data scientists and analysts in modern data workflows.
SEE ALSO
csvjoin(1): Joins CSV files horizontally based on common key columns., csvcut(1): Selects, reorders, or removes columns from CSV files., csvsort(1): Sorts CSV files by one or more columns., cat(1): A basic Unix command for concatenating files, but does not handle CSV headers or add source columns., awk(1) / sed(1): General-purpose text processing tools that can perform similar tasks with more complex scripting.


