Perl / Unix One-liner Cage Match, Part 1


A shell (like Bash) provides built-in commands and scripting features to easily solve and automate various tasks. External commands like grep, sed, Awk, sort, find, or parallel can be combined to work with each other. Sometimes you can use Perl either as a single replacement or a complement to them for specific use cases.

Perl is the most robust portable option for text processing needs. Perl has a feature rich regular expression engine, built-in functions, an extensive ecosystem, and is quite portable. However, Perl may have slower performance compared to specialized tools and can be more verbose.

One-liners or scripts?

For assembly-level testing of a digital signal processing (DSP) chip, I had to replicate the same scenario for multiple address ranges. My working knowledge of Linux command line was limited at that time and I didn’t know how to use sed or Awk. I used Vim and Perl for all sorts of text processing needs.

I didn’t know about Perl’s options for one-liners, so I used to modify a script whenever I had to do substitutions for multiple files. Once, I even opened the files as Vim buffers and applied a bufdo command to see if that would make my workflow simpler. If I had known about Perl one-liners, I could have easily utilized find and Bash globs to make my life easier, for example:

$ perl -i -pe 's/0xABCD;/0x1234;/; s/0xDEAD;/0xBEEF;/' *.tests

The -i option will write back the changes to the source files. If needed, I can pass an argument to create a backup of the original files. For example, -i.bkp will create ip.txt.bkp as the backup for ip.txt passed as the input file. I can also put the backups in another existing directory. The * gets expanded to original filename:

$ mkdir backups
$ perl -i'backups/*' -pe 's/SEARCH/REPLACE/g' *.txt

Powerful regexp features

Perl regexps are much more powerful than either basic or extended regular expressions used by utilities. The common features I often use are non-greedy and possessive quantifiers, lookarounds, the /e flag, subexpression calls, and (*SKIP)(*FAIL). Here are some examples from StackOverflow threads that I have answered over the years.

Skip some matches

This question needed to convert avr-asm to arm-gnu comments. The starting file looks like this:

ABC r1,';'
ABC r1,";" ; comment
  ;;;

I need to change ; to @, but ; within single or double quotes shouldn’t be affected. I can match quoted ; in the first branch of the alternation and use (*SKIP)(*F) to not replace those:

$ perl -pe 's/(?:\x27;\x27|";")(*SKIP)(*F)|;/@/' ip.txt
ABC r1,';'
ABC r1,";" @ comment
  @;;

I use (*SKIP)(*F) so often that I wish it had a shorter syntax, (*SF) for example.

Replace a string with an incrementing value

I can replace strings with incrementing value. The /e on a substitution allows me to treat the replacement side as Perl code. Whatever that code evaluates to is the replacement. That can be a variable that I increment:

$ echo 'a | a | a | a | a | a | a | a' | perl -pe 's/ *\| */$i++/ge'
a0a1a2a3a4a5a6a

Reverse a substring

I also used the /e trick to reverse the text matched by a pattern:

$ echo 'romarana789:qwerty12543' | perl -pe 's/\d+$/reverse $&/e'
romarana789:qwerty34521

Do some arithmetic

Adding another /e to get /ee means there are two rounds of Perl code. I evaluate the replacement side to get the string that I’ll evaluate as Perl code. In Arithmetic replacement in a text file, I need to find simple arithmetic, like 25100+10, and replace that with its arithmetic result:

id=25100+10
xyz=1+
abc=123456
conf_string=LMN,J,IP,25100+1,0,3,1

I can do that with one /e by matching the numbers and doing some Perl on the replacement side:

$ perl -pe 's/(\d+)\+(\d+)/$1+$2/ge' ip.txt
id=25110
xyz=1+
abc=123456
conf_string=LMN,J,IP,25101,0,3,1

But instead of matching the numbers separately, I can match the whole expression. The match is in $&, so the first /e interpolates that to 25100+10. The second round runs that as Perl, which is addition:

$ perl -pe 's/\d+\+\d+/$&/gee' ip.txt
id=25110
xyz=1+
abc=123456
conf_string=LMN,J,IP,25101,0,3,1

That would also make it easier to handle a set of operators:

$ echo '2+3 10-3 8*8 11/5' | perl -pe 's|\d+[+/*-]\d+|$&|gee'
5 7 64 2.2

Handling the newline

I want to un-hypenate this text:

Hello there.
It will rain to-
day. Have a safe
and pleasant jou-
rney.

Unlike sed and Awk, you can choose to preserve the record separator in Perl. That makes it easier to solve this problem:

$ perl -pe 's/-\n//' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.

See remove dashes and replace newlines with spaces for a similar problem and to compare the Perl solution with sed/Awk.

Multiline fixed-string substitution

Escaping regexp metacharacters is simpler with built-in features in Perl. Combined with slurping entire input file as a single string, I can easily perform multiline fixed-string substitutions. Consider this sample input:

This is a multiline
sample input with lots
of special characters
like . () * [] $ {}
^ + ? \ and ' and so on.

Say you have a file containing the lines you wish to match:

like . () * [] $ {}
^ + ? \ and ' and so on.

And a file containing the replacement string:

---------------------
$& = $1 + $2 / 3 \ 4
=====================

Here’s one way to do it with Perl:

$ perl -0777 -ne '$#ARGV==1 ? $s=$_ : $#ARGV==0 ? $r=$_ :
                  print s/\Q$s/$r/gr' search.txt replace.txt ip.txt
This is a multiline
sample input with lots
of special characters
---------------------
$& = $1 + $2 / 3 \ 4
=====================

Note that in the above solution, contents of search.txt and replace.txt are also processed by the Perl command. Avoid using shell variables to save their contents, since trailing newlines and ASCII NUL characters will require special attention.

Awk and sed do not have an equivalent option to slurp the entire input file content. Sed is Turing complete and Awk is a programming language, so you can write code for it if you wish, in addition to the code you’d need for escaping the metacharacters.

Better regexp support

Some other regexp libraries have problems tied to whatever they use to implement them. GNU versions, for example, may have some bugs that other implementations may not have. Which version you use can give different results. Perl, however, has the same bugs everywhere.

Back references

There’s a problem with backreferences in glibc that I found and reported for grep. This bug is seen in at least GNU implementations of grep and sed. As far as I know, no implementation of Awk supports backreferences within regexp definition.

I wanted to get words having two occurrences of consecutive repeated characters. This example takes some time and results in no output:

$ grep -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words

It does work when the nesting is unrolled or PCRE is used:

$ grep -xiE '[a-z]*([a-z])\1[a-z]*([a-z])\2[a-z]*' /usr/share/dict/words
Abbott
Annabelle
...

$ grep -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
...

Here’s the Perl, which is the original regexp:

$ perl -ne 'print if /^([a-z]*([a-z])\2[a-z]*){2}$/i' /usr/share/dict/words
Abbott
Annabelle
...

Word boundaries

Why doesn’t this sed command replace the 3rd-to-last “and”? shows another interesting bug when word boundaries and group repetition are involved. This bug is seen in anything using the regexp stuff from glibc (as you would on Linux):

This incorrectly matches because there is no word boundary in the middle of “cocoa”:

$ sed --version
sed (GNU sed) 4.8
$ echo 'cocoa' | sed -nE '/(\bco){2}/p'
cocoa

Without the quantifier, there’s no problem and no matches:

$ echo 'cocoa' | sed -nE '/\bco\bco/p'
$ echo 'cocoa' | perl -ne 'print if /(\bco){2}/'

Here’s another example from GNU sed. This modifies the line because it thinks it finds “it” as a separate word two times after “with”, but the second is really in the middle of “sit”:

$ echo 'it line with it here sit too' | sed -E 's/with(.*\bit\b){2}/XYZ/'
it line XYZ too

Change the pattern to get rid of the quantifier and it works correctly:

$ echo 'it line with it here sit too' | sed -E 's/with.*\bit\b.*\bit\b/XYZ/'
it line with it here sit too
$ echo 'it line with it here sit too it a' | sed -E 's/with.*\bit\b.*\bit\b/XYZ/'
it line XYZ a

# Perl doesn't need such workarounds
$ echo 'it line with it here sit too' | perl -pe 's/with(.*\bit\b){2}/XYZ/'
it line with it here sit too
$ echo 'it line with it here sit too it a' | perl -pe 's/with(.*\bit\b){2}/XYZ/'
it line XYZ a

Stay tuned

I’ll have more in Part 2, where I’ll delve into XML, JSON, and CSV.

Other things to read


[image from Dim Sum! on Flickr, (CC BY-NC-ND 2.0)]

Tags

Sundeep Agarwal

Sundeep Agarwal is addicted to writing books and reading fiction (mostly fantasy and sci-fi).

Browse their articles

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub