June 2012 Archives

This series has shown you several features of Unicode by example, as well as several techniques for working with Unicode correctly and easily with recent releases of Perl 5. By now you know more than many programmers do about Unicode, but your journey to mastery continues.

Perl 5 includes several pieces of documentation which explain Unicode and Perl's Unicode support. See perlunicode, perluniprops, perlre, perlrecharclass, perluniintro, perlunitut and perlunifaq.

Perl 5 and the CPAN provide several modules and distributions to allow the effective use of Unicode. As of Perl 5.16, many of these are in the core library. Many of them work just as well with earlier versions of Perl 5, though for the best and most correct support for Unicode as a whole, consider using Perl 5.14 or 5.16.

These modules include:

The CPAN distribution Unicode::Tussle module includes many command-line programs to help with working with Unicode, including these programs to fully or partly replace standard utilities: tcgrep instead of egrep, uniquote instead of cat -v or hexdump, uniwc instead of wc, unilook instead of look, unifmt instead of fmt, and ucsort instead of sort. For exploring Unicode character names and character properties, see its uniprops, unichars, and uninames programs. It also supplies these programs, all of which are general filters that do Unicode-y things: unititle and unicaps; uniwide and uninarrow; unisupers and unisubs; nfd, nfc, nfkd, and nfkc; and uc, lc, and tc.

Finally, see the published Unicode Standard (page numbers are from version 6.0.0), including these specific annexes and technical reports:

Tom Christiansen <tchrist@perl.com> wrote this series, with occasional kibbitzing from Larry Wall and Jeffrey Friedl in the background.

Most of these examples came from the current edition of the "Camel Book"; that is, from the 4th Edition of Programming Perl, Copyright © 2012 Tom Christiansen et al., 2012-02-13 by O'Reilly Media. The code itself is freely redistributable, and you are encouraged to transplant, fold, spindle, and mutilate any of the examples in this series however you please for inclusion into your own programs without any encumbrance whatsoever. Acknowledgement via code comment is polite but not required.

Previous: ℞ 44: Demo of Unicode Collation and Printing

Series Index: The Standard Preamble

℞ 44: PROGRAM: Demo of Unicode collation and printing

The past several weeks of Unicode recipes have explained how Unicode works and shown how to use it in your programs. If you've gone through those recipes, you now understand more than most programmers.

How about putting everything together?

Here's a full program showing how to make use of locale-sensitive sorting, Unicode casing, and managing print widths when some of the characters take up zero or two columns, not just one column each time. When run, the following program produces this nicely aligned output (though the quality of the alignment depends on the quality of your Unicode font, of course):

    Crème Brûlée....... €2.00
    Éclair............. €1.60
    Fideuà............. €4.20
    Hamburger.......... €6.00
    Jamón Serrano...... €4.45
    Linguiça........... €7.00
    Pâté............... €4.15
    Pears.............. €2.00
    Pêches............. €2.25
    Smørbrød........... €5.75
    Spätzle............ €5.50
    Xoriço............. €3.00
    Γύρος.............. €6.50
    막걸리............. €4.00
    おもち............. €2.65
    お好み焼き......... €8.00
    シュークリーム..... €1.85
    寿司............... €9.99
    包子............... €7.50

Here's that program; tested on v5.14.

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting and unicode_strings
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "γύρος"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "linguiça"          => 7.00, # spicy sausage, Portuguese
     "xoriço"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "éclair"            => 1.60, # dessert, French
     "smørbrød"          => 5.75, # sandwiches, Norwegian
     "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
     "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jamón serrano"     => 4.45, # country ham, Spanish
     "pêches"            => 2.25, # peaches, French
     "シュークリーム"    => 1.85, # cream-filled pastry like éclair, Japanese
     "막걸리"            => 4.00, # makgeolli, Korean rice wine
     "寿司"              => 9.99, # sushi, Japanese
     "おもち"            => 2.65, # omochi, rice cakes, Japanese
     "crème brûlée"      => 2.00, # tasty broiled cream, French
     "fideuà"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "pâté"              => 4.15, # gooseliver paste, French
     "お好み焼き"        => 8.00, # okonomiyaki, Japanese
 );

 # find the widest allowed width for the name column
 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = Unicode::Collate::Locale->new( locale => "ja" );

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " €%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str     =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

Simple enough, isn't it? Put together, everything just works nicely.

Previous: ℞ 43: Unicode Text in DBM Files (the easy way)

Series Index: The Standard Preamble

Next: ℞ 45: Further Resources

℞ 43: Unicode text in DBM hashes, the easy way

Some Perl libraries require you to jump through hoops to handle Unicode data. Would that everything worked as easily as Perl's open pragma!

For DBM files, here's how to implicitly manage the translation; all encoding and decoding is done automatically, just as with streams that have a particular encoding attached to them. The DBM_Filter module allows you to apply filters to keys and values to manipulate their contents before storing or fetching. The module includes a "utf8" filter. Use it like:

    use DB_File;
    use DBM_Filter;

    my $dbobj = tie %dbhash, "DB_File", "pathname";
    $dbobj->Filter_Value_Push("utf8");  # this is the magic bit

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    $dbhash{$uni_key} = $uni_value;

  # FETCH

    # $uni_key holds a normal Perl string (abstract Unicode)
    my $uni_value = $dbhash{$uni_key};

Previous: ℞ 42: Unicode Text in Stubborn Libraries

Series Index: The Standard Preamble

Next: ℞ 44: Demo of Unicode Collation and Printing

℞ 42: Unicode text in DBM hashes, the tedious way

While Perl 5 has long been very careful about handling Unicode correctly inside the world of Perl itself, every time you leave the Perl internals, you cross a boundary at which something may need to handle decoding and encoding. This happens when performing IO across a network or to files, when speaking to a database, or even when using XS to use a shared library from Perl.

For example, consider the core module DB_File, which allows you to use Berkeley DB files from Perl—persistent storage for key/value pairs.

Using a regular Perl string as a key or value for a DBM hash will trigger a wide character exception if any codepoints won't fit into a byte. Here's how to manually manage the translation:

    use DB_File;
    use Encode qw(encode decode);
    tie %dbhash, "DB_File", "pathname";

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    my $enc_key   = encode("UTF-8", $uni_key, 1);
    my $enc_value = encode("UTF-8", $uni_value, 1);
    $dbhash{$enc_key} = $enc_value;

 # FETCH

    # assume $uni_key holds a normal Perl string (abstract Unicode)
    my $enc_key   = encode("UTF-8", $uni_key, 1);
    my $enc_value = $dbhash{$enc_key};
    my $uni_value = decode("UTF-8", $enc_key, 1);

By performing this manual encoding and decoding yourself, you know that your storage file will have a consistent representation of your data. The correct encoding depends on the type of data you store and the capabilities of the external code, of course.

Previous: ℞ 41: Unicode Linebreaking

Series Index: The Standard Preamble

Next: ℞ 43: Unicode Text in DBM Files (the easy way)

℞ 41: Unicode linebreaking

If you've ever tried to fit a large amount of text into a display area too narrow for the full width of the text, you've dealt with the joy of linebreaking (or word wrapping). As you may have come to expect from Unicode now, the specification provides a Unicode Line Breaking Algorithm which respects the available line breaking opportunities provided by Unicode text.

Unicode characters, of course, may have properties which influence these rules.

As you have come to expect from Perl, a module implements the Unicode Line Breaking Algorithm. Install Unicode::LineBreak. This module respects direct and indirect break points as well as the grapheme width of the string. Its basic use is simple:

 use Unicode::LineBreak;
 use charnames qw(:full);

 my $para = "This is a super\N{HYPHEN}long string. " x 20;
 my $fmt  = Unicode::LineBreak->new;
 print $fmt->break($para), "\n";

The result of its break() method is an array of lines broken at valid points. (The default maximum number of columns is 76, so this example works well for email and console use. See the module's documentation for other configuration options.)

Previous: ℞ 40: Case- and Accent-insensitive Locale Comparisons

Series Index: The Standard Preamble

Next: ℞ 42: Unicode Text in Stubborn Libraries

℞ 40: Case- and accent-insensitive locale comparisons

You now know how to compare Unicode strings while ignoring case and accent differences. This approach uses the standard Unicode collation algorithm. To perform a similar comparison while respecting a specific locale's rules, use Unicode::Collate::Locale:

 my $de = Unicode::Collate::Locale->new(
            locale => "de__phonebook",
          );

 # now this is true:
 $de->eq("tschüß", "TSCHUESS");  # notice ü => UE, ß => SS

Previous: ℞ 39: Case- and Accent-insensitive Comparison

Series Index: The Standard Preamble

Next: ℞ 41: Unicode Linebreaking

℞ 39: Case- and accent-insensitive comparisons

As you've noticed by now, many Unicode strings have multiple possible representations. Comparing two Unicode strings for equality requires far more than merely comparing their codepoints. Not only must you account for multiple representations, you must decide which types of differences are significant: do you care about the case of individual characters? How about the presence or absence of accents?

Use a collator object to compare Unicode text by character instead of by codepoint. To perform comparisions without regard for case or accent differences, choose the appropriate comparison level. Unicode::Collate's eq() method offers customizable Unicode-aware equality:

 use Unicode::Collate;
 my $es = Unicode::Collate->new(
     level         => 1,
     normalization => undef
 );

  # now both are true:
 $es->eq("García",  "GARCIA" );
 $es->eq("Márquez", "MARQUEZ");

Previous: ℞ 38: Make cmp Work on Text instead of Codepoints

Series Index: The Standard Preamble

Next: ℞ 40: Case- and Accent-insensitive Locale Comparisons

℞ 38: Making cmp work on text instead of codepoints

Even with Perl 5.12's "unicode_strings" feature, some of Perl's core operations do not perform as expected on Unicode strings by default. For example, how is the cmp operator to know whether its arguments are octets, larger codepoints, or graphemes, or whether a specific collation should be in effect?

Where you might write:

 @srecs = sort {
     $b->{AGE}   <=>  $a->{AGE}
                 ||
     $a->{NAME}  cmp  $b->{NAME}
 } @recs;

... a Unicode-aware comparison should instead use Unicode::Collate:

 my $coll = Unicode::Collate->new();
 for my $rec (@recs) {
     $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
 }
 @srecs = sort {
     $b->{AGE}       <=>  $a->{AGE}
                     ||
     $a->{NAME_key}  cmp  $b->{NAME_key}
 } @recs;

This module's getSortKey() method returns an appropriate form sort key respecting the appropriate collation (and collation level) for a given Unicode string. cmp can handle these keys effectively.

Previous: ℞ 37: Unicode Locale Collation

Series Index: The Standard Preamble

Next: ℞ 39: Case- and Accent-insensitive Comparison

℞ 37: Unicode locale collation

As you've already seen, Unicode-aware sorting respects Unicode character properties. You can't sort by codepoint and expect to get accurate results, not even if you stick with pure ASCII.

The world is a complicated place. Some locales have their own special sorting rules.

The module Unicode::Collate::Locale provides a sort() method which supports locale-specific rules:

 use Unicode::Collate::Locale;

 my $col  = Unicode::Collate::Locale->new(locale => "de__phonebook");
 my @list = $col->sort(@old_list);

This module is part of the Perl 5 core distribution as of Perl 5.12. If you're using an older version of Perl, install the Unicode::Collate distribution to take advantage of it.

The ucsort program mentioned in Perl Unicode recipe 35 accepts a --locale parameter.

Previous: ℞ 36: Case- and Accent-insensitive Sorting

Series Index: The Standard Preamble

Next: ℞ 38: Make cmp Work on Text instead of Codepoints

℞ 36: Case- and accent-insensitive Unicode sort

The Unicode Collation Algorithm defines several levels of collation strength by which you can specify certain character properties as relevant or irrelevant to the collation ordering. In simple terms, you can use collation strength to tell a UCA-aware sort to ignore case or diacritics.

In Perl, use the Unicode::Collate module to perform your sorting. To sort Unicode strings while ignoring case and diacritics—to examine only the basic characters— use a collation strength of level 1:

 use Unicode::Collate;
 my $col = Unicode::Collate->new(level => 1);
 my @list = $col->sort(@old_list);

Level 2 adds diacritic comparisons to the ordering algorithm. Level 3 adds case ordering. Level 4 adds a tiebreaking comparison of probably more detail than most people will ever care to know. Level 4 is the default.

Previous: ℞ 35: Unicode Collation

Series Index: The Standard Preamble

Next: ℞ 37: Unicode Locale Collation

℞ 35: Unicode collation

Sorting—even pure ASCII—seems easy, at least if you know the alphabet song. Yet even something this simple gets complicated if you sort merely by codepoint. You get numbers coming in the midst of letters. You get "ZZZ" coming before "aaa". You get much worse problems, too. (How do you sort puncutation, for example?)

Sorting Unicode data seems much more difficult: the rules for each character specify its relationship to other characters. These collation rules guide the sorting and comparison of data with respect to case sensitivity, accent marks, character width, and other Unicode properties.

A simple sort of Unicode data—based on codepoint—produces nothing in a sensible alphabetic order. A sensible sorting must respect the Unicode Collation Algorithm (UCA) instead. The CPAN module Unicode::Collate implements UCA. Its simple use is:

 use Unicode::Collate;
 my $col  = Unicode::Collate->new();
 my @list = $col->sort(@old_list);

See also the ucsort program from the Unicode::Tussle CPAN module for a convenient command-line interface to this module.

In fact, sort aware of UCA sorts ASCII text better than simple ASCII sorts sort ASCII text, because UCA accounts for numbers, punctuation, and other non-alphanumerics.

Previous: ℞ 34: Unicode Column Width for Printing

Series Index: The Standard Preamble

Next: ℞ 36: Case- and Accent-insensitive Sorting

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en