Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

℞ 24: Disabling Unicode-awareness in builtin charclasses

Many regex tutorials gloss over the fact that builtin character classes include far more than ASCII characters. In particular, classes such as "word character" (\w), "word boundary" (\b), "whitespace" (\s), and "digit" (\d) respect Unicode.

Perl 5.14 added the /a regex modifier to disable \w, \b, \s, \d, and the POSIX classes from working correctly on Unicode. This restricts these classes to mach only ASCII characters. Use the re pragma to restrict these claracter classes in a lexical scope:

 use v5.14;
 use re "/a";

... or use the /a modifier to affect a single regex:

 my($num) = $str =~ /(\d+)/a;

You may always use specific un-Unicode properties, such \p{ahex} and \p{POSIX_Digit}. Properties still work normally no matter what charset modifiers (/d /u /l /a /aa) are in effect.

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

℞ 23: Get character category

Unicode is a set of characters and a list of rules and properties applied to those characters. The Unicode Character Database collects those properties. The core module Unicode::UCD provides access to these properties.

These general properties group characters into groups, such as upper- or lowercase characters, punctuation symbols, math symbols, and more. (See Unicode::UCD's general_categories() for more information.)

The charinfo() function returns a hash reference containing a wealth of information about the Unicode character in question. In particular, its category value contains the short name of a character's category.

To find the general category of a numeric codepoint:

 use Unicode::UCD qw(charinfo);
 my $cat = charinfo(0x3A3)->{category};  # "Lu"

To translate this category into something more human friendly:

 use Unicode::UCD qw( charinfo general_categories );
 my $categories = general_categories();
 my $cat        = charinfo(0x3A3)->{category};  # "Lu"
 my $full_cat   = $categories{ $cat }; # "UppercaseLetter"

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

℞ 22: Match Unicode linebreak sequence in regex

Unicode defines several characters as providing vertical whitespace, like the carriage return or newline characters. Unicode also gathers several characters under the banner of a linebreak sequence. A Unicode linebreak matches the two-character CRLF grapheme or any of the seven vertical whitespace characters.

As documented in perldoc perlrebackslash, the \R regex backslash sequence matches any Unicode linebreak sequence. (Similarly, the \v sequence matches any single character of vertical whitespace.)

This is useful for dealing with textfiles coming from different operating systems:

 s/\R/\n/g;  # normalize all linebreaks to \n
Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en