Text Processing

Substitute Text

sed applies editing commands to each line. The substitute command s/old/new/ replaces the first match per line; g replaces all matches, I ignores case. -i edits the file in place instead of printing the result.

$ sed 's/old/new/g' [file]

$ sed -i 's/old/new/g' [file]

$ sed -i 's/old/new/gI' [file]

sd does the same job with simpler, regex-by-default syntax.

$ sd "old" "new" [file]

Any character can delimit the s command: s|/usr/bin|/usr/local/bin| avoids escaping slashes in paths.

Delete or Print Specific Lines

d deletes matching lines; -n with p prints only selected lines. Addresses can be patterns, line numbers, or ranges.

$ sed '/pattern/d' [file]

$ sed -i '/^$/d' [file]

$ sed -n '5,10p' [file]

$ sed -n '/pattern/p' [file]

Extract Fields

awk splits every line into fields: $1 is the first field, $0 the whole line, NR the line number. -F changes the field separator from whitespace to anything else.

$ awk '{print $1}' [file]

$ awk -F: '{print $1, $3}' /etc/passwd

$ awk '{print NR, $0}' [file]

cut is the lightweight alternative for simple column extraction, by delimiter (-d, -f) or character position (-c).

$ cut -d: -f1 [file]

$ cut -d',' -f1,3 [file]

$ cut -c1-10 [file]

Filter with Conditions

An awk program is condition { action }: lines matching the condition run the action (default: print the line).

$ awk '$3 > 100' [file]

$ awk '/pattern/ {print $2}' [file]

$ awk 'NR>=5 && NR<=10' [file]

Aggregate across lines with variables and an END block.

$ awk '{sum += $1} END {print sum}' [file]

$ awk '{sum += $1} END {print sum/NR}' [file]

Sort Lines

Alphabetical by default; -n sorts numerically, -r reverses, -u drops duplicates, -t and -k sort by a specific field. -h understands human-readable sizes like 2K and 1G.

$ sort [file]

$ sort -n [file]

$ sort -r [file]

$ sort -t: -k3 -n [file]

$ sort -u [file]

Find Duplicate Lines

uniq only compares neighboring lines, so sort first. -c counts occurrences, -d shows only duplicated lines, -u only unique ones.

$ sort [file] | uniq

$ sort [file] | uniq -c | sort -rn

$ sort [file] | uniq -d

The sort | uniq -c | sort -rn pipeline is the classic frequency counter: it ranks every distinct line by how often it occurs.

Translate or Delete Characters

tr maps characters from one set to another, -d deletes them, -s squeezes repeats into one.

$ tr 'a-z' 'A-Z' < [file]

$ tr -d '[:digit:]' < [file]

$ tr -s ' ' < [file]

$ tr '\n' ' ' < [file]

Compare Files

diff -u is the standard patch-style comparison; -y shows files side by side; cmp compares bytes and is ideal for binary files.

$ diff -u [file1] [file2]

$ diff -y [file1] [file2]

$ cmp [file1] [file2]

comm shows lines unique to each sorted file and lines they share, in three columns. Suppress columns by number.

$ comm [file1] [file2]

$ comm -12 [file1] [file2]

Both comm and join require their input files to be sorted.

Combine Files

paste glues files together line by line; -s joins all lines of one file into a single line. join matches lines from two files on a common field, like a database join.

$ paste [file1] [file2]

$ paste -d',' [file1] [file2]

$ paste -s [file]

$ join [file1] [file2]

$ join -t: -1 1 -2 3 [file1] [file2]