The Citrine Citadel
Posted 2024-06-29
I have successfully ported several nontrivial awk programs to work on quite a lot of real-world awk implementations. If I have learned anything from this process, it is that even though almost every system of interest has a POSIX-like “new awk” implementation, which is a great tool for writing portable scripts, there is quite a lot of variation in details. There is simply no substitute for real-world interoperability testing.
This article describes actual issues I have encountered, and suggests workarounds for each. Specific operating systems and versions are mentioned in examples for the aid of reproducing results. This indicates which systems I actually ran the examples on to demonstrate a particular implementation behaviour. It is not intended to suggest that a particular problem only occurs on that one particular OS version (even though that may indeed be the case).
For discussion of traditional (pre-POSIX “old awk”), the GNU Autoconf manual has a lot of details, but it is relatively sparse in its coverage of portability problems amongst POSIX “new awk” implementations.
POSIX specifies that awk -f -
reads its program from standard input.
However, AIX 7.2 awk reads from a file named -
instead:
aix72% awk -f -
awk: 0602-546 Cannot find or open file -.
To work around this problem, pass the program directly as an argument or write it to a named file first.
Solaris 8 nawk requires a space between the -f
option and its argument:
solaris8% echo 'BEGIN { print "hello"; }' | nawk -f-
nawk: no program filename
solaris8% echo 'BEGIN { print "hello"; }' | nawk -f -
hello
ULTRIX 4.5 nawk does not support the -v
option. You can typically use
an assignment right after the program instead, for example:
ultrix45% echo | nawk '{ print var; }' var=hello
hello
but note that such assignments are performed after BEGIN
actions.
POSIX specifies that backslash escape sequences are evaluated in command-line variable assignments as if they appeared in a string literal in the awk program. However, ULTRIX 4.5 nawk interprets backslashes literally, for example:
gnu% echo | gawk '{ print var; }' var='\\'
\
ultrix45% echo | nawk '{ print var; }' var='\\'
\\
Replace backslashes with some other character(s) to avoid this problem. You can use gsub to restore them inside the awk program (but see notes on substituting literal backslashes, below).
$0
Normally, assignment to $0
recomputes NF
. However, with AIX 7.2 awk,
this does not happen for such assignments in END
actions. In this case,
NF
retains its prior value:
aix72% echo a b c | awk 'END { $0 = "x"; print NF; }'
3
You can use the split
function instead to work around this issue.
ULTRIX 4.5 nawk has a bug where sometimes the wrong number of characters
are copied if $0
is assigned to another variable after it has been
directly modified by the program. For example:
ultrix45% echo x | nawk '{ $0 = "hello"; x = $0; print x "rld"; }'
hrld
ultrix45% echo xx | nawk '{ $0 = "hello"; x = $0; print x "rld"; }'
herld
This bug only occurs with $0
, and can be avoided with an intervening
assignment to one of the field variables, or if the assignment uses
$0
in a slightly more complex expression, such as:
ultrix45% echo x | nawk '{ $0 = "hello"; $1 = $1; x = $0; print x "rld"; }'
hellorld
ultrix45% echo x | nawk '{ $0 = "hello"; x = "" $0; print x "rld"; }'
hellorld
AIX 7.2 awk fails to substitute a replacement string containing "\1"
(start-of-heading) characters with either the sub
or gsub
functions.
Any "\1"
characters in the replacement are silently changed to ampersands
instead:
aix72% awk 'BEGIN { s="x"; sub("x","\1",s); sub("\1","x",s); print s; }'
&
The issue only affects characters in the replacement text. If some other
character can be used instead of "\1"
, there is no problem with (g
)sub
.
Otherwise, use index
, match
and/or substr
instead of (g
)sub
.
ULTRIX 4.5 nawk does not understand octal escapes in ERE literals, but it works as expected when a string containing such characters is converted to a regexp. For example:
ultrix45% echo '\01' | nawk '/\1/ { print "match"; }'
awk: syntax error "number in \[0-9] invalid" in /\1/
Context is:
>>> /\1/ <<<
ultrix45% echo '\01' | nawk '$0 ~ "\1" { print "match"; }'
match
Various awk implementations differ in their handling of backslashes in the
replacement strings passed to sub
or gsub
.
If the replacement string is "\\\\\\\\"
(i.e., contains four consecutive
backslashes), then GNU gawk and ULTRIX 4.5 nawk will substitute two
backslashes, while most other systems substitute four:
gnu% echo /x/ | gawk 'gsub(/x/, "\\\\\\\\")'
/\\/
gnu% echo /x/ | POSIXLY_CORRECT=1 gawk 'gsub(/x/, "\\\\\\\\")'
/\\/
aix72% echo /x/ | awk 'gsub(/x/, "\\\\\\\\")'
/\\\\/
ultrix45% echo /x/ | nawk 'gsub(/x/, "\\\\\\\\")'
/\\/
If the replacement string is "\\\\"
(two backslashes), then GNU gawk
(in POSIX-conforming mode) and ULTRIX 4.5 nawk will substitute one
backslash, while most other systems substitute two:
gnu% echo /x/ | gawk 'gsub(/x/, "\\\\")'
/\\/
gnu% echo /x/ | POSIXLY_CORRECT=1 gawk 'gsub(/x/, "\\\\")'
/\/
aix72% echo /x/ | awk 'gsub(/x/, "\\\\")'
/\\/
ultrix45% echo /x/ | nawk 'gsub(/x/, "\\\\")'
/\/
If the replacement string is "\\"
(one backslash), then most
implementations will substitute a single backslash, except ULTRIX 4.5
nawk will substitute nothing, and then for good measure also deletes
the rest of the input string:
gnu% echo /x/ | gawk 'gsub(/x/, "\\")'
/\/
ultrix45% echo /x/ | nawk 'gsub(/x/, "\\")'
/
To work around all of these differences, construct a replacement string based on a runtime probe of what actually happens:
BEGIN {
bs="x"; sub(/x/, "\\\\", bs);
bs = (length(bs) == 1 ? "\\\\" : "\\" );
}
gsub(/x/, bs) # portably substitute a single backslash
gsub(/y/, bs bs) # portably substitute two consecutive backslashes
POSIX specifies that if a function call has less arguments than the number of parameters in the function definition, then the additional parameters are treated as uninitialized scalars or arrays depending on how they are used in the function body.
However, ULTRIX 4.5 nawk treats all such excess parameters as scalars and using them as arrays in the function body leads to unpredictable results.
To work around this problem, use a global array with a unique name instead,
and explicitly delete all its elements at the beginning or the end of the
function (such as by writing split("",global_array_name)
).
HP-UX 11 awk exits with an error if any input line contains more than 199
fields. If this might be a problem, set FS
to some garbage and, if
necessary, use the split
function.
HP-UX 11 awk exits with an error if any input line exceeds 3070 bytes. This applies to normal input and all variations of the getline function. You might be able to preprocess the input to split long lines before further processing in awk.
ULTRIX 4.5 nawk regular expression matching fails if too much data would be
matched by the *
or +
regex operators. The exact limit varies depending
on the regex, but for example .*
fails to match a substring longer than
4093 bytes:
ultrix45% nawk 'BEGIN { x="x"; for (i = 0; i < 12; i++) x = (x x);
match(x, /...*/); print length(x), RSTART, RLENGTH;
match(x, /....*/); print length(x), RSTART, RLENGTH;
}'
4096 0 0
4096 1 4096
HP-UX 11 awk misparses most expressions where the unary !
operator is
used as the operand of a binary operator, for example:
hpux11% awk 'BEGIN { print !0 + 1; }'
0
hpux11% awk 'BEGIN { print 1 + !0; }'
syntax error The source line is 1.
The error context is
BEGIN { print 1 + >>> ! <<< 0 }
awk: The statement cannot be correctly parsed.
The source line is 1.
Add parentheses to avoid the problem:
hpux11% awk 'BEGIN { print 1 + (!0), (!0) + 1; }'
2 2
POSIX specifies that pattern expressions are boolean context and that in such contexts nonempty strings are true and empty strings are false. However, ULTRIX 4.5 nawk treats pattern expressions as integers, thus strings which convert to a nonzero integer are true and all other strings are false:
gnu% echo x | gawk 'BEGIN { x=0; } $0 { x=1; } END { print x; }'
1
ultrix45% echo x | nawk 'BEGIN { x=0; } $0 { x=1; } END { print x; }'
0
ultrix45% echo 9 | nawk 'BEGIN { x=0; } $0 { x=1; } END { print x; }'
1
The same bug also occurs if a string is used as the operand of any of
the logical operators (!
, &&
or ||
), but the bug does not occur
if a string is used as the first operand of the ?:
operator, or if
a string is used as the conditional expression of an if
, for
or
while
statement.
Use an explicit string comparison (e.g., $0 != ""
) to work around this
problem.
To include a literal closing bracket in a character class, busybox awk
accepts only the form []]
, while [\]]
is interpreted as “backslash,
followed by a closing bracket”. Meanwhile, Solaris 8 nawk accepts only
the form [\]]
and []]
fails to match any input. GNU awk accepts
either form as equivalent. For example:
alpine% printf 'a]\nb\\]\n' | busybox awk '/^.[\]]$/; /^.[]]$/'
a]
b\]
gnu% printf 'a]\nb\\]\n' | gawk '/^.[\]]$/; /^.[]]$/'
a]
a]
solaris8% printf 'a]\nb\\]\n' | nawk '/^.[\]]$/; /^.[]]$/'
a]
For a normal (non-complemented) character class, you can use the
equivalent (]|[xyz])
instead. For a complemented class, in general
there is no equivalent (and portable) regular expression, so the code
must be restructured to avoid using such classes (probably by using
some of awk’s other string manipulation features).
ULTRIX 4.5 nawk will prefer the left alternative of the |
regular
expression operator (rather than the POSIX-specified longest matching
substring) in cases where both alternatives match but the left
alternative is shorter:
ultrix45% echo abcd | nawk 'sub(/a|abc/, "#") { print; }'
#bcd
ultrix45% echo abcd | nawk 'sub(/abc|a/, "#") { print; }'
#d
ultrix45% echo abcd | nawk 'sub(/(abc|a)d/, "#") { print; }'
#
In many situations this does not actually make a difference but it can
affect, for example, the result of the sub
, gsub
and match
functions.
Try to arrange for the alternatives to be mutually exclusive, or for the
left alternative to match at least as much text as the right.
Busybox awk exits with an error if you attempt to use *
in printf
or
sprintf
conversions, for example:
alpine% busybox awk 'BEGIN { printf "%*s\n", 10, "hello"; }'
awk: cmd. line:1: %*x formats are not supported
Generate the format string dynamically to work around this issue.
Busybox awk does not support using getline <"-"
to read a line from
standard input. It will read from a file named -
instead. On the
other hand, specifying a filename of -
on the command line or by
modifying the ARGV
array does work to read from standard input.
Normally it is easy enough to structure the program so that it is not required to read from standard input while the normal awk input is something else, in which case this is not a serious limitation.
If a workaround for this issue is truly needed, "cat" | getline
can be used with busybox awk to read from standard input.
ULTRIX 4.5 nawk behaves unpredictably if the third argument to split
is
an ERE literal. Use a string (which is converted to a regexp) instead.