LinuxCommandLibrary

datamash

Perform basic numeric/string grouping calculations

TLDR

Get max, min, mean, and median of a single column of numbers

$ seq 3 | datamash max 1 min 1 mean 1 median 1
copy

Get the mean of a single column of float numbers (floats must use "," and not ".")
$ echo -e '1.0\n2.5\n3.1\n4.3\n5.6\n5.7' | tr '.' ',' | datamash mean 1
copy

Get the mean of a single column of numbers with a given decimal precision
$ echo -e '1\n2\n3\n4\n5\n5' | datamash [[-R|--round]] [number_of_decimals_wanted] mean 1
copy

Get the mean of a single column of numbers ignoring "Na" and "NaN" (literal) strings
$ echo -e '1\n2\nNa\n3\nNaN' | datamash --narm mean 1
copy

SYNOPSIS

datamash [OPTION...] [OPERATION...] [FILE...]

PARAMETERS

-t, --field-separator=CHAR
    Use CHAR as input fields delimiter (default: whitespace)

-g, --group[=K]
    Group by field K (repeatable for multiple keys)

--groups[=K]
    Alias for --group

-s, --sort
    Sort input before grouping (implies --unique)

--narm
    Skip NaN/infinite values ('Not Available' removal)

-H, --header-in
    Treat first line as header, skip in grouping

--header-out
    Print header line in output

-R, --round-digits=N
    Round floats to N decimal places

--full
    Operate on full lines, not fields

--collate
    Collate groups horizontally instead of vertically

--pipe
    Use pipes for streaming multi-operation processing

--json
    Output in JSON format

-h, --help
    Display help and exit

-V, --version
    Output version info and exit

DESCRIPTION

Datamash is a versatile command-line utility for performing statistical computations on tabular data. It reads input from files or standard input, treating lines as records and whitespace-separated (or custom delimiter) fields within them. Users specify operations like sum, mean, min, max, median, standard deviation, and more, optionally grouping results by one or more keys.

It excels at aggregating data efficiently without needing scripting languages like awk or Perl. For example, compute the average of column 2 grouped by column 1. Supports sorting input for unique counts, header handling, and numeric rounding. Ideal for data analysis pipelines, CSV/TSV processing, and quick stats on logs or datasets.

Unlike full-featured tools, datamash focuses on speed and simplicity for common aggregations, handling large inputs via streaming.

CAVEATS

Does not parse quoted CSV fields; assumes uniform delimiters. Grouping requires sorted input for some ops unless -s used. Limited to predefined operations—no custom functions.

COMMON OPERATIONS

sum FIELD, mean FIELD, min FIELD, max FIELD, median FIELD, count FIELD, unique FIELD, sd FIELD (std dev), first FIELD, last FIELD. Syntax: op field>output-field for renaming.
Example: datamash -g 1 mean 2

EXAMPLE USAGE

datamash sum 2 < data.txt sums column 2.
datamash -t, -g 1 -s count 1 < csvfile.csv counts unique groups by col1.

HISTORY

Developed by Assaf Gordon; first released in 2014 as a GNU project to provide efficient, standalone stats without scripting dependencies. Integrated into many distros post-2015.

SEE ALSO

awk(1), cut(1), paste(1), sort(1), uniq(1)

Copied to clipboard