LinuxCommandLibrary

csvjoin

Join CSV files based on common columns

SYNOPSIS

csvjoin [OPTIONS] FILE1 [FILE2 ...]

PARAMETERS

-h, --help
    Show help message and exit.

-d DELIMITER, --delimiter DELIMITER
    The delimiting character of the input CSV files.

-t, --tabs
    Specify that the input CSV file is tab-separated.

-q QUOTECHAR, --quotechar QUOTECHAR
    Character used to quote fields in the input CSV files.

-u {0,1,2,3}, --quoting {0,1,2,3}
    Control quoting behavior. 0=QUOTE_MINIMAL, 1=QUOTE_ALL, 2=QUOTE_NONNUMERIC, 3=QUOTE_NONE.

-b, --doublequote
    Treat double quotes as delimiters.

-p ESCAPECHAR, --escapechar ESCAPECHAR
    Character used to escape the delimiter if --quoting 3 is used.

-z FIELD_SIZE_LIMIT, --field-size-limit FIELD_SIZE_LIMIT
    Maximum length of a field in the input CSV files.

-e ENCODING, --encoding ENCODING
    Specify the encoding of the input CSV files.

-L LOCALE, --locale LOCALE
    Specify the locale of the input CSV files.

-S, --skipinitialspace
    Skip spaces immediately following the delimiter.

--skip-header
    Skip the first line of the input CSV files. Useful if the header is not a header but part of the data.

--no-header-row
    Specify that the input CSV files do not have header rows.

--snifflimit SNIFFLIMIT
    Limit CSV dialect sniffing to the specified number of bytes.

--no-inference
    Disable type inference when parsing CSV files, treating all data as strings.

--zero
    Enable zero-based column indexing instead of one-based.

-v, --verbose
    Print more information to stderr during execution.

-H, --linenumbers
    Output line numbers as a new column.

-V, --version
    Show program's version number and exit.

-c COLUMNS, --columns COLUMNS
    A comma-separated list of column names or 1-based indices to join on. When joining more than two files, the columns apply sequentially (e.g., first column for FILE1-FILE2, second for (FILE1+FILE2)-FILE3).

--left
    Perform a left outer join. Returns all rows from the left file, and matched rows from the right.

--outer
    Perform a full outer join. Returns all rows when there is a match in either file.

--inner
    Perform an inner join (default). Returns rows only when there is a match in both files.

--right
    Perform a right outer join. Returns all rows from the right file, and matched rows from the left.

--keep-extra-columns
    Keep extra columns from the right-hand file(s) even if they have the same name as columns in the left-hand file(s). Duplicate names will be prefixed.

DESCRIPTION

csvjoin is a powerful command-line utility from the csvkit suite that enables users to perform SQL-style join operations on two or more CSV files. It merges rows from multiple CSVs based on common column values, acting much like a relational database system but directly on flat files.

Users can specify the columns to join on, choose different join types (inner, outer, left, right), and handle files with or without header rows. This tool simplifies complex data integration tasks, making it accessible to data analysts and developers working with CSV data on the command line. It's particularly useful for combining datasets from various sources into a unified view for further analysis or reporting.

CAVEATS

Performance and Memory: For extremely large CSV files, csvjoin can be memory-intensive as it often loads files into memory for processing. This can become a bottleneck, potentially impacting performance or causing out-of-memory errors.

Column Name Conflicts: By default, csvjoin will drop columns with duplicate names from subsequent files unless --keep-extra-columns is used. When used, duplicate column names from the right-hand file(s) will be prefixed to ensure uniqueness (e.g., FILE2_column_name). Users must be mindful of naming conventions or use this flag carefully.

Strict Key Matching: Joins require exact matches on specified key columns. Differences in case, leading/trailing spaces, or variations in data types (if not handled by csvkit's inference or explicit settings) will prevent matches and lead to missing data in the joined output.

JOIN TYPES EXPLAINED

csvjoin supports several types of SQL-style joins, allowing flexible data merging:

Inner Join (default): Returns rows only when there is a match in the join columns of both CSV files. Non-matching rows from either file are excluded.

Left Outer Join (--left): Returns all rows from the 'left' (first specified) CSV file, and the matched rows from the 'right' (subsequent) CSV file. If there is no match in the right CSV, NULL (empty) values are used for columns from the right file.

Right Outer Join (--right): Returns all rows from the 'right' (subsequent) CSV file, and the matched rows from the 'left' (first specified) CSV file. If there is no match in the left CSV, NULL (empty) values are used for columns from the left file.

Full Outer Join (--outer): Returns all rows when there is a match in one of the CSV files. If there is no match for a row in either the left or right file, NULL (empty) values are used for the columns from the non-matching file.

HISTORY

csvjoin is an integral part of the csvkit suite of command-line tools, a project originally conceived and developed by Christopher Groskopf. The initiative for csvkit began around 2011 with a clear objective: to provide an easy-to-use, powerful, and Unix-philosophy-compliant set of utilities specifically designed for manipulating CSV data.

The development of csvkit and csvjoin was driven by a recognized need to bridge the gap between basic Unix text processing tools (such as cut, sort, and the traditional join command) and more complex database or scripting solutions. It offers SQL-like capabilities directly on flat CSV files, making sophisticated data wrangling tasks accessible to data analysts, journalists, and developers who work extensively with CSV data on the command line. Its ongoing development reflects a commitment to providing robust and user-friendly tools in an era where CSV remains a ubiquitous data format.

SEE ALSO

csvcut(1), csvgrep(1), csvstack(1), join(1), awk(1), cut(1)

Copied to clipboard