LinuxCommandLibrary

comm

Compare two sorted files line by line

TLDR

Produce three tab-separated columns: lines only in first file, lines only in second file and common lines

$ comm [file1] [file2]
copy

Print only lines common to both files
$ comm -12 [file1] [file2]
copy

Print only lines common to both files, reading one file from stdin
$ cat [file1] | comm -12 - [file2]
copy

Get lines only found in first file, saving the result to a third file
$ comm -23 [file1] [file2] > [file1_only]
copy

Print lines only found in second file, when the files aren't sorted
$ comm -13 <(sort [file1]) <(sort [file2])
copy

SYNOPSIS

comm [OPTION]... FILE1 FILE2

PARAMETERS

-1
    Suppress printing of column 1 (lines unique to FILE1).

-2
    Suppress printing of column 2 (lines unique to FILE2).

-3
    Suppress printing of column 3 (lines common to both files).

--output-delimiter=STR
    Separate columns with STR. Default is tab characters.

--nocheck-order
    Do not check that the input is sorted. This can lead to incorrect output if files are not sorted.

--zero-terminated, -z
    Line delimiter is NUL, not newline. Input lines can contain newlines.

--help
    Display help message and exit.

--version
    Output version information and exit.

DESCRIPTION

The comm command compares two already sorted files, line by line. It outputs three columns by default: lines unique to the first file, lines unique to the second file, and lines common to both files.

This utility is particularly useful for identifying differences and commonalities between datasets when the order of lines is significant. For accurate results, both input files must be sorted in the same collating sequence; otherwise, comm may produce incorrect or incomplete output. Users can control which columns are displayed using options like -1, -2, and -3 to suppress specific columns, allowing focus on unique or common lines. It can also accept standard input for one of the files by specifying a hyphen (-) as an argument.

CAVEATS

The most crucial caveat is that both input files must be sorted for comm to function correctly and produce reliable output. If files are not sorted, the results will be unpredictable and likely incorrect.

comm performs a character-by-character comparison. Leading/trailing whitespace or case differences will be treated as different lines unless normalized prior to comparison.

It is not designed for comparing unsorted files or for complex diffing scenarios (like showing line changes within a block), for which diff is more appropriate.

COLUMN STRUCTURE

By default, comm outputs three columns, separated by tab characters (configurable with --output-delimiter): Column 1 contains lines found only in FILE1; Column 2 contains lines found only in FILE2; and Column 3 contains lines found in both FILE1 and FILE2.

STANDARD INPUT USAGE

One of the input files can be specified as a hyphen (-), which indicates that comm should read from standard input for that file. For example, comm file1 - reads the content for the second file from standard input.

HISTORY

comm is part of the GNU Core Utilities, a collection of fundamental tools commonly found on Unix-like operating systems. Its basic functionality has been a standard Unix utility for a long time, indicating its foundational role in text processing. The core logic of comparing sorted streams is efficient and well-established.

SEE ALSO

sort(1), uniq(1), diff(1), join(1)

Copied to clipboard