LinuxCommandLibrary

datamash

Perform basic numeric/string grouping calculations

TLDR

Get max, min, mean and median of a single column of numbers

$ seq 3 | datamash max 1 min 1 mean 1 median 1
copy

Get the mean of a single column of float numbers (floats must use "," and not ".")
$ echo -e '1.0\n2.5\n3.1\n4.3\n5.6\n5.7' | tr '.' ',' | datamash mean 1
copy

Get the mean of a single column of numbers with a given decimal precision
$ echo -e '1\n2\n3\n4\n5\n5' | datamash [[-R|--round]] [number_of_decimals_wanted] mean 1
copy

Get the mean of a single column of numbers ignoring "Na" and "NaN" (literal) strings
$ echo -e '1\n2\nNa\n3\nNaN' | datamash --narm mean 1
copy

SYNOPSIS

datamash [OPTIONS] operation [field...]
Example: datamash -s -g 1 sum 2 < input.txt

PARAMETERS

-h, --help
    Show help message and exit.

-V, --version
    Show version information and exit.

-t, --field-separator=X
    Use X instead of TAB as field delimiter.

-w, --whitespace
    Use whitespace (one or more spaces/tabs) for field delimiters. Overrides -t.

-d, --decimal-separator=X
    Use X instead of '.' as decimal separator (e.g., use comma ',' for European notation).

-s, --sort
    Sort the input before grouping. Necessary when input is not pre-sorted on the group fields.

-H, --no-header
    Disable header parsing and printing. Treat the first line as data.

-i, --ignore-case
    Ignore case when comparing strings (grouping or string operations).

-f, --full
    Print also non-grouping fields.

-g, --group=X[,Y,Z]
    Group via fields X, Y, Z. Can be given multiple times.

operations (e.g., sum, min, max, mean, median, count, first, last...)
    Operation to perform. Field numbers specify which fields to use in the calculation. Check manual page for full operations list.

DESCRIPTION

Datamash is a command-line tool that performs basic numeric/string/statistical computations. It is designed to be a *'data cruncher'*, operating on tabular input (typically text files) and providing summary statistics. Unlike tools like `awk` or `R`, Datamash focuses on grouped aggregations; it excels at performing calculations on a per-group basis, where groups are defined by fields in the input data. It supports operations like sum, min, max, mean, median, count, first, last, and more. Datamash is a versatile tool for data analysis, reporting, and data cleaning, especially when dealing with large datasets. It efficiently handles delimited text files, making it a powerful tool in data pipelines.

CAVEATS

Datamash assumes the input data is properly formatted and consistent. Errors can occur if the data contains unexpected characters or inconsistencies. Sorting is often necessary before grouping, and performance can degrade with very large unsorted datasets. It may not be suitable for complex statistical analyses.

EXIT STATUS

Datamash returns 0 on success, and non-zero on errors, such as incorrect command-line options, missing input data, or invalid data types.

FIELD NUMBERS

Field numbers are 1-based, not 0-based. Field numbers can be repeated in an operation to do multiple calculations against a field. Field ranges such as 2-4 are also supported.

SEE ALSO

awk(1), sort(1), cut(1), uniq(1)

Copied to clipboard