awk

Process and transform text-based data

TLDR

Print the fifth column (a.k.a. field) in a space-separated file

$ awk '{print $5}' [path/to/file]

Print the second column of the lines containing "foo" in a space-separated file

$ awk '/[foo]/ {print $2}' [path/to/file]

Print the last column of each line in a file, using a comma (instead of space) as a field separator

$ awk -F ',' '{print $NF}' [path/to/file]

Sum the values in the first column of a file and print the total

$ awk '{s+=$1} END {print s}' [path/to/file]

Print every third line starting from the first line

$ awk 'NR%3==1' [path/to/file]

Print different values based on conditions

$ awk '{if ($1 == "foo") print "Exact match foo"; else if ($1 ~ "bar") print "Partial match bar"; else print "Baz"}' [path/to/file]

Print all the lines which the 10th column value is between a min and a max

$ awk '($10 >= [min_value] && $10 <= [max_value])'

Print table of users with UID >=1000 with header and formatted output, using colon as separator (%-20s mean: 20 left-align string characters, %6s means: 6 right-align string characters)

$ awk 'BEGIN {FS=":";printf "%-20s %6s %25s\n", "Name", "UID", "Shell"} $4 >= 1000 {printf "%-20s %6d %25s\n", $1, $4, $7}' /etc/passwd

SYNOPSIS

awk [options] 'program' [file ...]
awk [options] -f program-file [file ...]

-F fs
    Sets the input field separator to fs (a regular expression). By default, whitespace (spaces and tabs) is used.

-v var=val
    Assigns the value val to the variable var before the BEGIN block of the awk program is executed. Useful for passing shell variables.

-f program-file
    Reads the awk program source from the specified program-file instead of from the command line.

-W option
    Used with GNU awk (gawk) to specify compatibility options or extensions, e.g., -W posix for strict POSIX compliance or -W traditional for original awk behavior.

--help
    Displays a help message and exits.

--version
    Displays version information and exits.

DESCRIPTION

The awk utility is a versatile and powerful text processing language designed for pattern scanning and processing. It reads input files (or standard input) line by line, comparing each line against a set of patterns. For every line that matches a pattern, awk executes a corresponding action.

At its core, awk works by splitting each input line into fields, accessible via variables like $1 (first field), $2 (second field), and $0 (the entire line). It uses whitespace as the default field separator, but this can be customized. awk provides a rich set of built-in variables (e.g., NR for record number, NF for number of fields), arithmetic and string functions, and control flow statements (like if, else, and loops), making it a full-fledged programming language for data manipulation. It's commonly used for data extraction, report generation, reformatting data, and performing various text-based analytical tasks.

CAVEATS

While powerful, awk can be resource-intensive for extremely large datasets if not optimized, particularly when storing entire lines or fields in arrays. Its compact syntax, while efficient, can be initially challenging for newcomers.

There are subtle behavioral differences between various awk implementations, primarily between the original AT&T awk, nawk (new awk), and gawk (GNU awk). Most Linux systems use gawk, which offers the most features and POSIX compliance, but awareness of potential discrepancies is important when porting scripts.

PROGRAM STRUCTURE

An awk program consists of a sequence of pattern-action statements. It typically follows the structure:
BEGIN { actions }
pattern { actions }
...
END { actions }

The BEGIN block is executed once before any input lines are processed. Pattern-action pairs are applied to each input line. The END block is executed once after all input lines have been processed.

BUILT-IN VARIABLES

awk provides several useful built-in variables:
NR: Current record (line) number.
NF: Number of fields in the current record.
$0: The entire current record (line).
$1, $2, ...: Individual fields of the current record.
FS: Input field separator (default is whitespace).
RS: Input record separator (default is newline).
OFS: Output field separator (default is space).
ORS: Output record separator (default is newline).
FILENAME: Name of the current input file being processed.

PATTERNS AND ACTIONS

A pattern can be a regular expression (e.g., /regex/), a relational expression (e.g., $3 > 10), a range (/start_regex/,/end_regex/), or a combination. If no pattern is specified, the action is performed for every line. An action is a sequence of statements enclosed in curly braces {}, which can include printing, variable assignments, conditional logic, loops, and function calls.

HISTORY

The awk language was created at Bell Laboratories in 1977 by Alfred Aho, Peter Weinberger, and Brian Kernighan, whose surnames form the acronym. It was designed as a tool for text processing and reporting within the Unix environment, embodying the Unix philosophy of small, specialized tools that can be combined.

In the mid-1980s, a new version known as nawk (new awk) was developed, introducing significant enhancements such as user-defined functions, dynamic regular expressions, and more robust error handling. Later, the GNU Project developed gawk (GNU awk), aiming for full compatibility with nawk while adding further extensions and making it the most commonly used awk implementation on Linux and other free Unix-like systems today.

awk