Sed & Awk
As anyone who’s ever tried to write a serious program in shell knows, /bin/sh—or even extended languages like bash and ksh—can’t do very much on its own. Nearly everything in a shell script, besides a smattering of control structures, is a call to another program. Two of the tools on the shell programmer’s palette are sed and awk, grouped together not because they are used in conjunction, but because they do about the same thing. Sed is a descendant of early UNIX line editors and is useful for creating filters for input streams, and awk is a small programming language designed to treat input as a collection of fields and records, useful for processing formatted data sets.
Regular Expressions
Regular expressions are a powerful tool used to match and manipulate strings. Many high-level languages include some regular expression mechanism as part of the language, and sed and awk use them as the basis of their string operations.
Regular expressions match regular languages, and as such are a representation of a deterministic finite automaton. Outside of theory class, though, it’s usually good enough to think of them as able to match any string where you don’t need to count. For example, regular expressions can’t match a S-expression, since that would require it to count parentheses in order to know how many closing parentheses to expect. The same is true of XML, since tags can be nested arbitrarily deep, and since DFAs have no stack, you can’t know how many closing tags to expect at any given point.
This isn’t to say that regular expressions are useless, just limited. One common use for regular expressions is matching lines in log files. For example, the Apache httpd “common” log format is matched by the following:
^([[:alnum:]-_.]+) ([^[:space:]]+) ([^[:space:]]+) (\[[^]]*\]) (”[^"]*”) ([[:digit:]]{3}) ([[:digit:]]*)$
The first character, ^, represents the beginning of the line, so this regular expression must occur at the beginning of a line in the access_log. Likewise, the $ at the end of the expression is the end of the line, so this regular expressions matches an entire line of the logs.
The parentheses are used to group elements and can be used to pull individual elements out of a string. In sed and awk, these parenthesized elements take the form of \number when specifying a replacement text, where number is the number of the opening parentheses. For example, \1 in this example would be the host that made the HTTP request.
The expression inside the square brackets is the first part that actually starts to match characters. Square brackets are used to match a single character, and that character can be any from the list of characters inside the brackets. So, something like [abcd] would match a single a, b, c or d. Ranges can also be used, so [0-9] would match a single digit. The part with the colons is a special extension of this sytnax; [:type:] can be given instead of a character, and it will match any character of the given type in the current locale, using the istype() ctype function. So [[:alnum:]] matches any character where isalnum() would return true. [[:alnum:]-_.] is any of these characters, in addition to hyphens, underscores and periods; i.e., characters that might appear in a hostname.
Now that we have a character to match, we can specify how many times to match it. The + sign after that first square bracket expression says that the bracket expression should be matched one or more times, so the entire thing in the first set of parentheses will be the whole hostname.
The second and third parenthesized expressions are similar to the hostname one, except that they specify what not to match, instead. Using ^ as the first character inside square brackets negates the meaning, so [^[:space:]] matches any character that isn’t a space. Similarly, [^]] matches anything that isn’t a closing square bracket, for the date field. Closing brackets can be included as the last character of a bracket expression, since ]] gives the regular expression parser something special to look for, and, similarly, an opening square bracket can be included as the first character.
Other things of note are {number} to specify an exact number of repetitions, and * to specify zero or more repitions. The curly bracket expression can also be given as a range, so you could also use something like {3, 17} for anywhere between three and seventeen matches of an expression, or {47,}, for forty-seven or more matches. * is equivalent to {0,}, and can be thought of as somewhat similar to the * character in shell globs.
Other special characters in regular expressions are “?”, which is equivalent to {0,1}; “.”, which matches any single character; and “|”, which is used to specify alternation. | is especially useful: (http|ftp) would match one of “http” or “ftp”, and might be seen as part of a speciallized URL parser. “.” is commonly used when you don’t care what is matched, and the expression .* will match any string, including the empty string.
To match any of these special characters as the character itself, just precede it with a backslash. \. would be a literal period, \( is an opening parenthsis, \\ is a backslash, and so on.
Basic vs. Extended Syntax
Just in case things weren’t confusing enough, nearly every program has its own variation on the regular expression syntax. The two major divisions are between “basic” and “extended” POSIX regular expressions. The extended regular expression syntax is that described above, where certain characters like ( and + have special meaning. The basic syntax is different in that it tries to remove most of the backslashes for the case when a constant string is being matched; for every special character except “.”, “[”, “*”, “^” and “$”, the character must be preceded by a backslash to be given special meaning. For example, grouping is instead done withing \(...\). Also, basic regular expressions do not have “+”, “?”, or “|”.
The log example rewritten as a BRE would look like this:
^\([[:alnum:]-_.]\{1,\}\) \([^[:space:]]\{1,\}\) \([^[:space:]]\{1,\}\) \(\[[^]]*\]\) \(”[^"]*\) \([[:digit:]]\{3\}\) \([[:digit:]]*\)$
Sed, as defined by POSIX, uses basic regular expressions, but GNU returns some of the features of EREs through “\+”, “\?” and “\|”, and also provides a “-r” option to switch to extended regular expressions. Awk uses a variant of extended regular expressions which impelement everything except the curly braces.
Everyday sed usage
Sed is a complete, if not particularly expressive, programming language, but most people will only ever use one sed command: s/regexp/replacement/. This will replace any occurence of the regular expression on a given line with the replacment text. \number can be used in the replacement text to reference a parenthsized expression, and “&” can be used to reference the entire matched text. By default, sed only acts on the first match for a given line; add “g” after the final slash to change this.
If not given a filename, sed will read from standard input, and it will always print to standard output. For some examples, the following replaces all the hyphens in input with underscores:
sed 's/-/_/g'
This adds shell-style comments to every line in a file:
sed 's/.*/# &/'
This removes all those carriage return characters from the ends of lines in MSDOS formatted files (works in GNU sed, other seds may need to find a way to specify an actual carriage return character):
sed 's/\r$//'
Slash needn’t be the delimiter used, and, in fact, is an annoying choice when manipulating path names. Whatever character is used as the first character after the “s” will be used as the delimiter. The following example strips the directory components of a path name, like the basename utitilty:
sed 's!.*/!!'
Regular expressions match greedily, so “.*” will match as much as it can, up to the last slash on the line.
The s command can also be used in conjunction with -n to filter out lines that don’t match an expression. “-n” tells sed not to print every line, and a “p” character after the last slash (or other delimiter) tells sed to print any line that matched the expression. This example only prints lines that begin with two floating-point numbers, and also swaps them first:
sed -n 's/^\([+-]\{0,1\}[[:digit:]]\{1,\}\(\.[[:digit:]]\{1,\}\)\{0,1\}\) \([+-]\{0,1\}[[:digit:]]\{1,\}\(\.[[:digit:]]\{1,\}\)\{0,1\}\)/\3 \1/p’
As you can see, doing even seemingly simple thing in sed can quickly result in large, difficult expressions, making most usage of sed hard to understand. sed works well for simple cases of string filtering, but awk can often be a more maintainable choice.
Sed has a couple of other features useful in everyday programming. Commands can be preceded with a line address, which will only run the command for given lines of input. For example, sed '12s/abc/def/' will only run on line 12. The address can be a line number, a regular expression, or a range. Another useful command is “d”, which deletes a line. sed '/^#/d' would delete every line beginning with a # character.
Everyday awk usage
Though awk can be used a powerful text-processing tool, it’s most common use is to perform the same task as cut. You may have seen something like the following:
shell stuff | awk ‘{print $3;}’
This would take the input from the shell code and print the third field of each line. Using awk for this is often more convenient than cut, since awk uses any string whitespace as a delimiter by default instead of requiring single character delimiters, and it doesn’t require large, unwieldy regular expressions like sed.
This print example takes advantage of the way that awk views data: as a stream of records, each of which is a set of fields. By default a record is a line, and a field is any non-whitespace string of characters separated by whitespace. The record can be references as $0, and the individual fields as $field-number.
Awk programs that take up more than a line
Awk is a bit more than a fancy version of cut, of course. Awk is a “data-driven” programming language, where, instead of specifying a list of instructions to execute, you specify a set of rules and procedures to run each time the input matches the given conditions. Each rule is for the form:
PATTERN { ACTION }
The pattern can be any conditional statement, like $2 ~ /^d/ for regular expression matching, $5 >= 47 for integer comparison, or the special patterns “BEGIN” and “END,” executed once at the beginning and end of input, respectively. BEGIN is often used to setup awk’s special variables, like FS, the field separator, and END is often used to print the final output.
Awk is a fairy simple language, and users of perl will probably find much of it familiar, since perl took many of its ideas from awk. Variables are automatically converted between numbers and strings, depending on the context in which they are used. Arithmetic and number comparison uses the familiar +, -, >, == and other operators, and strings can be compared using ==, >, < and such, matched against a regular expression with the ~ operator, or concatenated. There is not concatenation operator, so something like ’str1 str2′ would be the concatenation of the variables str1 and str2.
The following is a simple example to print statistics from maillog on the number of incoming and outgoing emails. Not all awks can sort arrays, so I didn’t, and the results are printed in a fairly random order.
# Match the lines where it acutally delivers the mail
$0 ~ /status=sent/ {
# $1 is the month, $2 is the day
date = $1 " " $2;
# uninitialized variables default to 0 when used as integers, so just
# start adding
if ($0 ~ /relay=local/)
mails_in[date]++;
else
mails_out[date]++;
}
END {
for (x in mails_in) {
# the +0's ensure that empty values show up as 0 and not an empty string
print x ": in=" (mails_in[x]+0) " out=" (mails_out[x]+0)
}
}
The seamy underbelly of sed
Sed can be used as a programming language, though it lacks many things that one might expect in a language, like variables. Sed has two memory spaces, the hold space and the pattern space. For each cycle, the pattern space is cleared, a line of input is read into the pattern space, the program is run, and, if the -n flag was not given, the final contents of the pattern space are written to the output. This repeats until all input is read, or until execution is terminated with the “q” command. Nothing is ever automatically placed in the hold space, but there are several commands to manipulate it.
The “s” command, in addition to being useful for text processing, can also be used for conditional branches. A branch point can be defined using : LABEL, and the command t LABEL will branch to this label if a successful substitution has been made since the last branch or input read. b LABEL is the unconditional counterpart. If no label is given to either t or b, they will start a new cycle.
Using all of this, powerful, incomprehensible programs may be written, like the implementation of the dc calculator shipped with the GNU sed source, or the following very short text adventure:
# Should be runnable either with or without -n
# Only commands supported are directions, since I didn't want this to get
# three miles long
#
# Trying very hard to use only BREs
#
# Look text shamelessly stolen from Infocom's ZORK
# restore state
# x exchanges hold and pattern spaces
# Each room must exchange back to read input
x
s/room0/&/
t room0
s/room1/&/
t room1
s/room2/&/
t room2
# default
b room0
# North goes to room1, south goes back to room0, southeast goes to room2
: room0
x
# i\ outputs text up to first line without trailing '\'
# '{' and '}' commands are used to create groups matched by a
# single address
# expression matches line containing word "look" optionally surrounded by
# whitespace, and nothing else
/^[[:space:]]*look[[:space:]]*$/{
i\
Maze\
You are in a maze of twisty little passages, all alike
b end
}
# Matches optional leading "go" and word "n" or "north"
# directions work by putting room name in pattern space, and if substitution
# was made, the room name is copied to the hold space and the pattern space
# cleared
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Nn]\([Oo][Rr][Tt][Hh]\)\{0,1\}[[:space:]]*$/room1/
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ss]\([Oo][Uu][Tt][Hh]\)\{0,1\}[[:space:]]*$/room0/
# No '|' in BREs, so need two expressions for 'se' and 'southeast'
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ss][Ee][[:space:]]*$/room2/
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ss][Oo][Uu][Tt][Hh][Ee][Aa][Ss][Tt][[:space:]]*$/room2/
t copyend
b badend
# South goes back to room0, North goes to room2
: room1
x
# Matches any line that begins with the word "look
/^[[:space:]]*look[[:space:]]*$/{
i\
West of House\
You are standing in an open field west of a white house, with a boarded\
front door.\
There is a small mailbox here.
b end
}
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Nn]\([Oo][Rr][Tt][Hh]\)\{0,1\}[[:space:]]*$/room2/
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ss]\([Oo][Uu][Tt][Hh]\)\{0,1\}[[:space:]]*$/room0/
t copyend
b badend
# East wins and quits, West goes to room0, South goes to room1
: room2
x
/^[[:space:]]*look[[:space:]]*$/{
i\
Stone Barrow\
You are standing in front of a massive barrow of stone. In the east face is a\
huge stone door which is open. You cannot see into the dark of the tomb.
b end
}
# delete input so not printed when quitting
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ee]\([Aa][Ss][Tt]\)\{0,1\}[[:space:]]*$//
t win
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ww]\([Ee][Ss][Tt]\)\{0,1\}[[:space:]]*$/room0/
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ss]\([Oo][Uu][Tt][Hh]\)\{0,1\}[[:space:]]*$/room1/
t copyend
b badend
: win
i\
You win!
# d starts a new cycle, so is not good for deleting pattern space and quitting
# there will be an extra newline printed out at the end if -n not used
q
: badend
# assumes all unknown commands are directions, for brevity
# strips off leading "go", prints out rest
# does nothing if there is no input
/./s/\(^[[:space:]]*go[[:space:]]*\)\{0,1\}\(.*\)/There is no exit to the \2/p
b end
: copyend
# h replaces the hold space with the contents of the pattern space
h
: end
# delete whatever is left in the pattern space so it is not printed
d
Interaction with this little script may look something like this: (input bold, output italic)
$ sed -f adventure.sed
look
Maze
You are in a maze of twisty little passages, all alike
go southeast
look
Stone Barrow
You are standing in front of a massive barrow of stone. In the east face is a
huge stone door which is open. You cannot see into the dark of the tomb.
e
You win!
$
If you ever find yourself using branches in sed, it may be time to consider another language.