join
Join lines from two files by common field
TLDR
Join two files on the first (default) field
Join two files using a comma (instead of a space) as the field separator
Join field3 of file1 with field1 of file2
Produce a line for each unpairable line for file1
Join a file from stdin
SYNOPSIS
join [OPTION]... FILE1 FILE2
PARAMETERS
-a FILENUM
In addition to the default output, print a line for each unpairable line in file FILENUM (1 or 2).
-e EMPTY
Replace empty output fields with the string EMPTY.
-i, --ignore-case
Ignore differences in case when comparing fields.
-j FIELD
Equivalent to -1 FIELD -2 FIELD, joining on the specified FIELD in both files.
-o FORMAT
Construct the output line according to the specified FORMAT. FORMAT consists of space-separated field specifications (e.g., '1.1 2.2' for first field of file 1 and second of file 2).
-t CHAR
Use CHAR as the input and output field separator. By default, whitespace is used.
-v FILENUM
Like -a FILENUM, but suppress paired output lines. Only unpairable lines from FILENUM are printed.
-1 FIELD
Join on the specified FIELD of FILE1.
-2 FIELD
Join on the specified FIELD of FILE2.
--check-order
Check that the input files are properly sorted on the join field. Exit with error if not.
--nocheck-order
Do not check that input files are sorted. This can lead to incorrect output if files are unsorted.
--header
Treat the first line of each file as field headers, which are not compared for joining but are printed as part of the output.
DESCRIPTION
The join command is a powerful utility in Linux that performs an operation similar to a database JOIN on two text files. It merges lines from two different files that share a common field, referred to as the 'join field' or 'key'. By default, join assumes fields are separated by whitespace (spaces, tabs, newlines) and uses the first field of each file as the join key. For the command to work correctly, both input files must be sorted on the specified join field in ascending order. If the files are not sorted, join might produce incorrect or incomplete output. The output consists of the join field, followed by the remaining fields of the first matching line from FILE1, then the remaining fields of the first matching line from FILE2. If a key is present in one file but not the other, by default, those lines are not printed. Options allow for inclusion of unmatched lines, custom delimiters, specific join fields, and output formatting.
CAVEATS
Sorted Input is Crucial: The most common pitfall when using join is not ensuring that both input files are sorted on the join key. If files are unsorted, join will likely produce incorrect or incomplete results without warning (unless --check-order is used). Use the sort command beforehand to prepare your files.
Default Delimiter: By default, join treats sequences of whitespace as a single delimiter. If your fields contain spaces or your delimiter is a single space, it's safer to explicitly use -t ' ' to define a single space as the delimiter.
Handling Missing Fields: When using -o FORMAT, if a specified field doesn't exist on a line, join will output an empty string by default. This can be controlled with the -e option.
DEFAULT BEHAVIOR
By default, join uses the first field of each file as the join key and treats any sequence of whitespace (spaces, tabs, newlines) as a single field delimiter. If no options are specified, it will output the common join field, followed by the remaining fields from FILE1, then the remaining fields from FILE2, for lines that have a matching key in both files.
SORTING FILES FOR JOIN
It is critical to sort both input files on the join key before passing them to join. For example, to join on the first field:
sort -k1 file1.txt > file1_sorted.txt
sort -k1 file2.txt > file2_sorted.txt
join file1_sorted.txt file2_sorted.txt
If joining on a different field, say the third:
sort -k3 file1.txt > file1_sorted.txt
sort -k3 file2.txt > file2_sorted.txt
join -j 3 file1_sorted.txt file2_sorted.txt
OUTPUT FORMAT CUSTOMIZATION
The -o FORMAT option provides fine-grained control over the output. Fields are referenced as FILENUM.FIELDNUM (e.g., '1.3' for the third field of the first file). You can specify the order and inclusion of fields. For instance, to output the join key, the second field of file 1, and the third field of file 2:
join -o '1.1 1.2 2.3' file1 file2
HISTORY
The join command is a venerable Unix utility, part of the core toolset since early versions of Unix. Its design reflects the philosophy of providing powerful, focused text processing tools that can be chained together. Its functionality, akin to relational database joins, highlights how early Unix systems provided sophisticated data manipulation capabilities through command-line tools. It has been a standard component of POSIX (Portable Operating System Interface) since its inception, ensuring its widespread availability and consistent behavior across various Unix-like operating systems.