LinuxCommandLibrary

csvstack

Vertically concatenate multiple CSV files

SYNOPSIS

csvstack [OPTIONS] FILE1 [FILE2 ...]

PARAMETERS

-h, --help
    Show a help message and exit.

-n COLUMN_NAME, --column-name COLUMN_NAME
    Specify the name for the new 'source' column added to the output. Defaults to 'source'.

-g GROUP_BY, --group-by GROUP_BY
    Instead of creating a new 'source' column, use an existing column (or a comma-separated list of columns) to identify groups. Input files must contain these columns.

-i, --no-inference
    Disable type inference when parsing input data. All data will be treated as strings.

-I SKIP_LINES, --skip-lines SKIP_LINES
    Skip the specified number of initial lines in each input file before processing, useful for skipping comments or metadata.

-d DELIMITER, --delimiter DELIMITER
    Specify the character used to delimit fields in the input CSV files. Defaults to a comma.

-t, --tabs
    Specify that the input files are tab-separated (TSV) instead of comma-separated.

-q QUOTECHAR, --quotechar QUOTECHAR
    Specify the character used to quote fields containing special characters. Defaults to a double quote (").

-u ESCAPECHAR, --escapechar ESCAPECHAR
    Specify the character used to escape the `QUOTECHAR` within a quoted field.

-p PARSE_DATES, --parse-dates PARSE_DATES
    A comma-separated list of column names to parse as dates.

-l WHITELIST, --parse-dimensional-whitelist WHITELIST
    A comma-separated list of column names that are considered 'dimensional' and always parsed. (Advanced option for `agate` integration).

-e ENCODING, --encoding ENCODING
    Specify the character encoding of the input files. Defaults to UTF-8.

-z, --gzip
    Decompress gzipped input files automatically.

-b, --bz2
    Decompress bzip2 input files automatically.

--zero
    Output zeros instead of empty strings for missing numeric values.

--blanks
    Do not convert empty strings to `null` values; treat them as actual empty strings.

--no-header-row
    Treat the first row of each input file as data, not as a header row. Assumes consistent column order.

--snifflimit SNIFFLIMIT
    Limit the number of rows to sniff for column types. Defaults to 1024.

-v, --verbose
    Print extra information and progress messages to stderr.

-W, --skip-errors
    Skip rows that cannot be parsed successfully, continuing processing.

DESCRIPTION

csvstack is a powerful command-line utility from the csvkit suite designed to vertically concatenate multiple CSV files into a single, cohesive dataset. Its primary feature is the ability to add a new column, by default named 'source', to the combined output. This new column clearly indicates which original input file each row originated from, preserving data provenance.

The command efficiently handles different column sets across input files by taking the union of all column names. For any column not present in a particular input file, csvstack will fill the corresponding cells with `null` values (or blanks/zeros depending on options). This makes it invaluable for consolidating data from various sources with potentially inconsistent schemas, allowing for unified analysis without losing track of the original context. It's an essential tool for data aggregation and preparation workflows.

CAVEATS

When combining files with differing column sets, csvstack will create a union of all column names. Columns present in some files but not others will be padded with empty values (or `null`, zeros, or blanks depending on options) in the output. While csvstack attempts intelligent type inference, inconsistencies in data types for the same column across different input files might lead to unexpected type coercion. For large numbers of files or extremely large individual files, performance can be a consideration as all data must be read and processed.

COLUMN ALIGNMENT

csvstack intelligently handles input files with different header rows. It constructs a 'master' header by taking the union of all unique column names from all input files. When processing each file, it aligns its data to this master header. If an input file lacks a column present in another, that column's values will be `null` (or empty strings/zeros based on options) for rows from that specific file in the final output.

OUTPUT REDIRECTION

By default, csvstack writes its combined output to standard output (`stdout`). To save the result to a new file, you should redirect the output using your shell's redirection operator, for example: `csvstack file1.csv file2.csv > combined.csv`.

HISTORY

csvstack is a component of csvkit, a collection of utilities for working with CSV files from the command line, developed by Christopher Groskopf. csvkit was created to provide a robust, Python-based alternative to traditional shell scripting for common CSV manipulation tasks, drawing inspiration from tools like jq for JSON. Its development aimed to simplify and standardize operations like stacking, joining, and filtering CSV data, making it a popular choice for data scientists and analysts in modern data workflows.

SEE ALSO

csvjoin(1): Joins CSV files horizontally based on common key columns., csvcut(1): Selects, reorders, or removes columns from CSV files., csvsort(1): Sorts CSV files by one or more columns., cat(1): A basic Unix command for concatenating files, but does not handle CSV headers or add source columns., awk(1) / sed(1): General-purpose text processing tools that can perform similar tasks with more complex scripting.

Copied to clipboard