gawk
Process and manipulate text-based data
TLDR
Print the fifth column (a.k.a. field) in a space-separated file
Print the second column of the lines containing "foo" in a space-separated file
Print the last column of each line in a file, using a comma (instead of space) as a field separator
Sum the values in the first column of a file and print the total
Print every third line starting from the first line
Print different values based on conditions
Print all the lines which the 10th column value is between a min and a max
Print table of users with UID >=1000 with header and formatted output, using colon as separator (%-20s mean: 20 left-align string characters, %6s means: 6 right-align string characters)
SYNOPSIS
gawk [options] [-f program-file | --source program-text] [file ...]
gawk [options] program-text [file ...]
PARAMETERS
-F fs
Sets the field separator for input records to fs. This can be a single character or a regular expression.
-v var=value
Assigns a value to a variable var before program execution begins. Useful for passing external data into the gawk script.
-f program-file
Reads the gawk program source from the specified program-file, allowing for larger, more complex scripts.
-W lint, --lint
Issues warnings about constructs that are not portable to other awk implementations or might indicate a mistake in the script logic.
-W posix, --posix
Disables GNU extensions and uses POSIX awk features exclusively, ensuring strict compatibility with the POSIX standard.
-W traditional, --traditional
Behaves like original awk, disabling most GNU extensions for backward compatibility.
-E encoding, --encoding=encoding
Specifies the character encoding for input and output, supporting various character sets.
--help
Prints a summary of the gawk command-line options and usage information.
--version
Prints gawk's version number and configuration information to standard output.
DESCRIPTION
gawk (GNU Awk) is a powerful pattern scanning and processing language. It reads input files line by line, attempting to match each line against a set of user-defined patterns. For each line that matches a pattern, gawk executes a corresponding action. This makes it exceptionally useful for data extraction, text transformation, and report generation from structured or semi-structured text files like logs, CSVs, or configuration files.
At its core, gawk operates on "fields" within a line, which are by default separated by whitespace. It provides built-in variables for the current record ($0), the number of fields (NF), the current record number (NR), and individual fields ($1, $2, etc.). It supports regular expressions for sophisticated pattern matching, offers conditional statements, loops, associative arrays, and user-defined functions, making it a complete programming language for text manipulation. gawk is the GNU Project's implementation of the awk programming language, designed to be compatible with the original awk while adding many GNU-specific extensions and features.
CAVEATS
While powerful, gawk can become memory-intensive for extremely large files or when using large associative arrays, potentially impacting performance. Its expressive syntax, while concise, can sometimes lead to less readable scripts for complex logic, especially for those new to awk. Locale settings can significantly affect string and regular expression behavior; using --posix often helps ensure consistent behavior across different environments.
PATTERNS AND ACTIONS
An awk program consists of a series of rules, each defined as pattern { action }. gawk reads each input line, and if the line matches the specified pattern, the action block is executed. If no pattern is given, the action is performed for every line. If no action is given, the line matching the pattern is simply printed. Patterns can be regular expressions, arithmetic relational expressions, or combinations thereof. Actions can include variables, loops, conditionals, and built-in or user-defined functions.
BEGIN AND END BLOCKS
gawk provides special patterns: BEGIN and END. The BEGIN block's action is executed once before any input lines are read and processed. It's typically used for initializing variables or printing headers. The END block's action is executed once after all input lines have been read. It's often used for final calculations, summaries, or printing footers. These blocks do not operate on input lines directly.
BUILT-IN VARIABLES
gawk provides several useful built-in variables that provide information about the input or control program behavior. Key ones include:
NR (Number of Record): The current input record number.
NF (Number of Fields): The total number of fields in the current input record.
FS (Field Separator): The input field separator (default is whitespace).
OFS (Output Field Separator): The output field separator (default is a space).
RS (Record Separator): The input record separator (default is newline).
ORS (Output Record Separator): The output record separator (default is newline).
$0: The entire current input record.
$1, $2, ...: Individual fields of the current input record.
HISTORY
The awk programming language was originally developed at Bell Labs in the 1970s by Alfred Aho, Peter Weinberger, and Brian Kernighan (hence A.W.K.). It was designed for text processing and report generation. gawk, the GNU Project's implementation, was written to be a free software alternative, offering compatibility with the original awk while introducing numerous extensions and enhancements. First released in the mid-1980s, gawk has since become the standard awk implementation on most Linux and Unix-like systems due to its robust feature set, adherence to standards (like POSIX), and continuous development.