How to parse binary data with Perl

Parsing binary data is one of those tasks that seems to come up rarely, but is useful to know. Many common file types like images, music, timestamps, network packets and auth logs all come in binary flavors. Unfortunately it’s nowhere near as exciting as the fictitious depictions from Hackers. The good news though is parsing binary data with Perl is easy using the unpack function. I’m going to walk you through the three steps you’ll need when working with binary data.

1. Open a binary filehandle

Start things off right by opening a filehandle to binary file:

use autodie;
open my $fh, '<:raw', '/usr/share/zoneinfo/America/New_York';

This is a suitably Modern Perlish beginning. I start by importing autodie which ensures the code will die if any function call fails. This avoids repetitive ... or die "IO failed" type coding constructs.

Next I use the :raw IO layer to open a filehandle to a binary file. This will avoid newline translation issues. No need for binmode here. The file I’m opening is a history of New York timezone changes, from the tz database.

2. Read a few bytes

All binary files have a specific format that they follow. In the case of the zoneinfo files, the first 44 bytes/octets are the header, so I’ll grab that:

use autodie;
open my $fh, '<:raw', '/usr/share/zoneinfo/America/New_York';

my $bytes_read = read $fh, my $bytes, 44;
die 'Got $bytes_read but expected 44' unless $bytes_read == 44;

Here I use read to read in 44 bytes of data into the variable $bytes. The read function returns the number of bytes read; it’s good practice to check this as read may not return the expected number of bytes if it reaches the end of the file. In this case, if the file ends before the header does, we know we’ve got bad data and bail out.

3. Unpack bytes into variables

Now comes the fun part. I’ve got to split out the data in $bytes into separate Perl variables. The tzfile man page defines the header format:

Timezone information files begin with the magic characters “TZif” to identify them as timezone information files, followed by a character identifying the version of the file’s format (as of 2005, either an ASCII NUL (’\0’) or a ‘2’) followed by fifteen bytes containing zeros reserved for future use, followed by six four-byte values of type long

Tzfile manual

The unpack function takes a template of the binary data to read (this is defined in the pack documentation) and returns Perl variables. I’m going to match up the header description with the template codes to design the template.

Description Example Type Length Template Code
Magic chars TZif String 4 a4
Version 2 String 1 a
Reserved 0 Ignore 15 x15
Numbers 244 Long 1 N N N N N N

The header begins with the magic chars “TZif”, this is 4 bytes. The template code a4 matches this. Next is the version, this is a single ASCII character matched by a (the strings are not space or null terminated, I could have use A instead). The next 15 bytes are reserved and can be ignored, so I use x15 to skip over them. Finally there are 6 numbers of type long. Each one is separate variable so I must write N 6 times instead of N6.

use autodie;
open my $fh, '<:raw', '/usr/share/zoneinfo/America/New_York';

my $bytes_read = read $fh, my $bytes, 44;
die 'Got $bytes_read but expected 44' unless $bytes_read == 44;

my ($magic, $version, @numbers) = unpack 'a4 a x15 N N N N N N', $bytes;

This code passes my template to unpack and it returns the variables we asked for. Now they’re in Perl variables, the hard part is done. In the case of a tzfile, the header defines the length of the body of the file, so I can use these variables to calculate how much more data to read from the file.

If you’re interested in how to parse the rest of a tzfile, check out the source code of my module Time::Tzfile.

Troubleshooting

Sometimes you’ll unpack some binary data and get garbage. This happens when the template passed to unpack doesn’t match the binary data. The first thing you can do is print the binary data to the terminal with hexdump.

Here are the first 44 bytes of the New York tzfile:

$ hexdump -c -n 44 /usr/share/zoneinfo/America/New_York
0000000   T   Z   i   f   2  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000010  \0  \0  \0  \0  \0  \0  \0 005  \0  \0  \0 005  \0  \0  \0  \0
0000020  \0  \0  \0 354  \0  \0  \0 005  \0  \0  \0 024

This gives you a chance to inspect the data byte by byte and see if it matches your template. To create a template to match binary data, take it one value at a time. Consider the type of value you’re trying to match. Get the right bit length and for numbers, be sure to know if it is signed or unsigned.

The other thing to be aware of is endianness of the data. Often man pages will say a variable is in “standard” or “network” order. This means big endian. Tzfiles have several 32 bit signed integers in big endian order. There is no unpack template code which matches that type. To match it I need to use l>. The l matches signed 32 bit integers and the > is a modifier which tells Perl the value is big endian.

Between Perl’s built-in template types and the modifiers, you can match any binary data.

More binary parsing examples


This article was originally posted on PerlTricks.com.

Tags

David Farrell

David is a professional programmer who regularly tweets and blogs about code and the art of programming.

Browse their articles

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub