Perl Unicode Cookbook: Unicode Text in Stubborn Libraries

Jun 18, 2012 by Tom Christiansen

℞ 42: Unicode text in DBM hashes, the tedious way

While Perl 5 has long been very careful about handling Unicode correctly inside the world of Perl itself, every time you leave the Perl internals, you cross a boundary at which something may need to handle decoding and encoding. This happens when performing IO across a network or to files, when speaking to a database, or even when using XS to use a shared library from Perl.

For example, consider the core module DB_File, which allows you to use Berkeley DB files from Perl—persistent storage for key/value pairs.

Using a regular Perl string as a key or value for a DBM hash will trigger a wide character exception if any codepoints won’t ﬁt into a byte. Here’s how to manually manage the translation: use DB_File; use Encode qw(encode decode); tie %dbhash, “DB_File”, “pathname”;

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    my $enc_key   = encode("UTF-8", $uni_key, 1);
    my $enc_value = encode("UTF-8", $uni_value, 1);
    $dbhash{$enc_key} = $enc_value;

 # FETCH

    # assume $uni_key holds a normal Perl string (abstract Unicode)
    my $enc_key   = encode("UTF-8", $uni_key, 1);
    my $enc_value = $dbhash{$enc_key};
    my $uni_value = decode("UTF-8", $enc_key, 1);

By performing this manual encoding and decoding yourself, you know that your storage file will have a consistent representation of your data. The correct encoding depends on the type of data you store and the capabilities of the external code, of course.

Previous: ℞ 41: Unicode Linebreaking

Series Index: The Standard Preamble

Next: ℞ 43: Unicode Text in DBM Files (the easy way)

Tags

unicode