May 2012 Archives

℞ 34: Unicode column-width for printing

Perl's printf, sprintf, and format think all codepoints take up 1 print column, but many codepoints take 0 or 2. If you use any of these builtins to align text, you may find that Perl's idea of the width of any codepoint doesn't match what you think it ought to.

The Unicode::GCString module's columns() method considers the width of each codepoint and returns the number of columns the string will occupy. Use this to determine the display width of a Unicode string.

To show that normalization makes no difference to the number of columns of a string, we print out both forms:

 # cpan -i Unicode::GCString
 use Unicode::GCString;
 use Unicode::Normalize;

 my @words = qw/crème brûlée/;
 @words    = map { NFC($_), NFD($_) } @words;

 for my $str (@words) {
     my $gcs  = Unicode::GCString->new($str);
     my $cols = $gcs->columns;
     my $pad  = " " x (10 - $cols);
     say str, $pad, " |";
 }

... generates this to show that it pads correctly no matter the normalization:

 crème      |
 crème      |
 brûlée     |
 brûlée     |

Previous: ℞ 33: String Length in Graphemes

Series Index: The Standard Preamble

Next: ℞ 35: Unicode Collation

℞ 33: String length in graphemes

If you learn nothing else about Unicode, remember this: characters are not bytes are not graphemes are not codepoints. A user-visible symbol (a grapheme) may be composed of multiple codepoints. Multiple combinations of codepoints may produce the same user-visible graphemes.

To keep all of these entities clear in your mind, be careful and specific about what you're trying to do at which level.

As a concrete example, the string brûlée has six graphemes but up to eight codepoints. Now suppose you want to get its length. What does length mean? If your string has been normalized to a one-grapheme-per-codepoint form, length() is one and the same, but consider:

 use Unicode::Normalize;
 my $str = "brûlée";
 say length $str;
 say length NFD( $str );

To measure the length of a string by counts by grapheme, not by codepoint:

 my $str   = "brûlée";
 my $count = 0;
 while ($str =~ /\X/g) { $count++ }

Alternately (or on older versions of Perl), the CPAN module Unicode::GCString is useful:

 use Unicode::GCString;
 my $gcs   = Unicode::GCString->new($str);
 my $count = $gcs->length;

Previous: ℞ 32: Reverse String by Grapheme

Series Index: The Standard Preamble

Next: ℞ 34: Unicode Column Width for Printing

℞ 32: Reverse string by grapheme

Because bytes and characters are not isomorphic in Unicode—and what you may see as a user-visible character (a grapheme) is not necessarily a single codepoint in a Unicode string—every string operation must be aware of the difference between codepoints and graphemes.

Consider the Perl builtin reverse. Reversing a string by codepoints messes up diacritics, mistakenly converting crème brûlée into éel̂urb em̀erc instead of into eélûrb emèrc; so reverse by grapheme instead.

As one option, use Perl's \X regex metacharacter to extract graphemes from a string, then reverse that list:

 $str = join("", reverse $str =~ /\X/g);

As another option, use Unicode::GCString to treat a string as a sequence of graphemes, not codepoints:

 use Unicode::GCString;
 $str = reverse Unicode::GCString->new($str);

Both these approaches work correctly no matter what normalization the string is in. Remember that \X is most reliable only as of and after Perl 5.12.

Previous: ℞ 31: Extract by Grapheme Instead of Codepoint (substr)

Series Index: The Standard Preamble

Next: ℞ 33: String Length in Graphemes

℞ 31: Extract by grapheme instead of by codepoint (substr)

The Unicode Standard Annex #29 discusses the boundaries between grapheme clusters—what users might perceive as "characters". The CPAN module Unicode::GCString allows you to treat a Unicode string as a sequence of these grapheme clusters.

While you may use \X to extract graphemes within a regex, Unicode::GCString provides a substr() method to extract a series of grapheme clusters:

 # cpan -i Unicode::GCString
 use Unicode::GCString;

 my $gcs        = Unicode::GCString->new($str);
 my $first_five = $gcs->substr(0, 5);

The module also provides an iterator interface to grapheme clusters within a string.

Previous: ℞ 30: Extract by Grapheme Instead of Codepoint (regex)

Series Index: The Standard Preamble

Next: ℞ 32: Reverse String by Grapheme

℞ 30: Extract by grapheme instead of by codepoint (regex)

Remember that Unicode defines a grapheme as "what a user thinks of as a character". A codepoint is an integer value in the Unicode codespace. While ASCII conflates the two, effective Unicode use respects the difference between user-visible characters and their representations.

Use the \X regex metacharacter when you need to extract graphemes from a string instead of codepoints:

 # match and grab five first graphemes
 my ($first_five) = $str =~ /^ ( \X{5} ) /x;

Previous: ℞ 29: Match Unicode Grapheme Cluster in Regex

Series Index: The Standard Preamble

Next: ℞ 31: Extract by Grapheme Instead of Codepoint (substr)

℞ 29: Match Unicode grapheme cluster in regex

In the days of ASCII, we spoke of characters and bytes. We saw few differences between them. In the Unicode world, characters are far more than seven bits of data. Far better to speak of collections of raw bytes and characters—or even Unicode codepoints.

Programmer-visible "characters" are codepoints matched by /./s, but user-visible "characters" are graphemes matched by /\X/.

That is to say, the \X regex metacharacter matches what Unicode calls an "extended grapheme cluster". Where the user may see a single character (such as a consonant with an accent), the Unicode representation may be that consonant plus combining characters plus the accent mark. Use \X to match the entire sequence:

 # Find vowel *plus* any combining diacritics,underlining,etc.
 my $nfd = NFD($orig);
 $nfd =~ / (?=[aeiou]) \X /xi

Previous: ℞ 28: Convert non-ASCII Unicode Numerics

Series Index: The Standard Preamble

Next: ℞ 30: Extract by Grapheme Instead of Codepoint (regex)

℞ 28: Convert non-ASCII Unicode numerics

Unicode digits encompass far more than the ASCII characters 0 - 9.

Unless you've used /a or /aa, \d matches more than ASCII digits only. That's good! Unfortunately, Perl's implicit string-to-number conversion does not currently recognize Unicode digits. Here's how to convert such strings manually.

As usual, the Unicode::UCD module provides access to the Unicode character database. Its num() function can numify Unicode digits—and strings of Unicode digits.

 use v5.14;  # needed for num() function
 use Unicode::UCD qw(num);
 my $str = "got Ⅻ and ४५६७ and ⅞ and here";
 my @nums = ();
 while (/$str =~ (\d+|\N)/g) {  # not just ASCII!
    push @nums, num($1);
 }
 say "@nums";   #     12      4567      0.875

 use charnames qw(:full);
 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");

As num()'s documentation warns, the function errs on the side of safety. Not all collections of Unicode digits form valid numbers. As well, you may consider normalizing complex Unicode strings before performing numification.

Previous: ℞ 27: Unicode Normalization

Series Index: The Standard Preamble

Next: ℞ 29: Match Unicode Grapheme Cluster in Regex

℞ 27: Unicode normalization

Prescription one reminded you to always decompose and recompose Unicode data at the boundaries of your application. Unicode::Normalize can do much more for you. It supports multiple Unicode Normalization Forms.

Normalization, of course, takes Unicode data of arbitrary forms and canonicalizes it to a standard representation. (Where a composite character may be composed of multiple characters, normalized decomposition arranges those characters in a canonical order. Normalized composition combines those characters to a single composite character, where possible. Without this normalization, you can imagine the difficulty of determining whether one string is logically equivalent to another.)

Typically, you should render your data into NFD (the canonical decomposition form) on input and NFC (canonical decomposition followed by canonical composition) on output. Using NFKC or NFKD functions improves recall on searches, assuming you've already done the same normalization to the text to be searched.

Note that this normalization is about much more than just splitting or joining pre-combined compatibility glyphs; it also reorders marks according to their canonical combining classes and weeds out singletons.

 use Unicode::Normalize;
 my $nfd  = NFD($orig);
 my $nfc  = NFC($orig);
 my $nfkd = NFKD($orig);
 my $nfkc = NFKC($orig);

Previous: ℞ 26: Custom Character Properties

Series Index: The Standard Preamble

Next: ℞ 28: Convert non-ASCII Unicode Numerics

℞ 26: Custom character properties

Match Unicode Properties in Regex explained that ever Unicode character has one or more properties, specified by the Unicode consortium. You may extend these rule to define your own properties such that Perl can use them.

A custom property is a function given a name beginning with In or Is which returns a string conforming to a special format. The "User-Defined Character Properties" section of perldoc perlunicode describes this format in more detail.

To define at compile-time your own custom character properties for use in regexes:

 # using private-use characters
 sub In_Tengwar { "E000\tE07F\n" }

 if (/\p{In_Tengwar}/) { ... }

 # blending existing properties
 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
 +utf8::IsLatin
 +utf8::IsGreek
 &utf8::IsTitle
 END_OF_SET

 if (/\p{Is_GraecoRoman_Title}/ { ... }

Previous: ℞ 25: Match Unicode Properties in Regex

Series Index: The Standard Preamble

Next: ℞ 27: Unicode Normalization

℞ 25: Match Unicode properties in regex with \p, \P

Every Unicode codepoint has one or more properties, indicating the rules which apply to that codepoint. Perl's regex engine is aware of these properties; use the \p{} metacharacter sequence to match a codepoint possessing that property and its inverse, \P{} to match a codepoint lacking that property.

Each property has a short name and a long name. For example, to match any codepoint which has the Letter property, you may use \p{Letter} or \p{L}. Similarly, you may use \P{Uppercase} or \P{Upper}. perldoc perlunicode's "Unicode Character Properties" section describes these properties in greater detail.

Examples of these properties useful in regex include:

 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
 \p{Sk}, \p{Ps}, \p{Lt}
 \p{alpha}, \p{upper}, \p{lower}
 \p{Latin}, \p{Greek}
 \p{script=Latin}, \p{script=Greek}
 \p{East_Asian_Width=Wide}, \p{EA=W}
 \p{Line_Break=Hyphen}, \p{LB=HY}
 \p{Numeric_Value=4}, \p{NV=4}

Previous: ℞ 24: Disable Unicode-awareness in Builtin Character Classes

Series Index: The Standard Preamble

Next: ℞ 26: Custom Character Properties

℞ 24: Disabling Unicode-awareness in builtin charclasses

Many regex tutorials gloss over the fact that builtin character classes include far more than ASCII characters. In particular, classes such as "word character" (\w), "word boundary" (\b), "whitespace" (\s), and "digit" (\d) respect Unicode.

Perl 5.14 added the /a regex modifier to disable \w, \b, \s, \d, and the POSIX classes from working correctly on Unicode. This restricts these classes to mach only ASCII characters. Use the re pragma to restrict these claracter classes in a lexical scope:

 use v5.14;
 use re "/a";

... or use the /a modifier to affect a single regex:

 my($num) = $str =~ /(\d+)/a;

You may always use specific un-Unicode properties, such \p{ahex} and \p{POSIX_Digit}. Properties still work normally no matter what charset modifiers (/d /u /l /a /aa) are in effect.

Previous: ℞ 23: Get Character Categories

Series Index: The Standard Preamble

Next: ℞ 25: Match Unicode Properties in Regex

℞ 23: Get character category

Unicode is a set of characters and a list of rules and properties applied to those characters. The Unicode Character Database collects those properties. The core module Unicode::UCD provides access to these properties.

These general properties group characters into groups, such as upper- or lowercase characters, punctuation symbols, math symbols, and more. (See Unicode::UCD's general_categories() for more information.)

The charinfo() function returns a hash reference containing a wealth of information about the Unicode character in question. In particular, its category value contains the short name of a character's category.

To find the general category of a numeric codepoint:

 use Unicode::UCD qw(charinfo);
 my $cat = charinfo(0x3A3)->{category};  # "Lu"

To translate this category into something more human friendly:

 use Unicode::UCD qw( charinfo general_categories );
 my $categories = general_categories();
 my $cat        = charinfo(0x3A3)->{category};  # "Lu"
 my $full_cat   = $categories{ $cat }; # "UppercaseLetter"

Previous: ℞ 22: Match Unicode Linebreak Sequence

Series Index: The Standard Preamble

Next: ℞ 24: Disable Unicode-awareness in Builtin Character Classes

℞ 22: Match Unicode linebreak sequence in regex

Unicode defines several characters as providing vertical whitespace, like the carriage return or newline characters. Unicode also gathers several characters under the banner of a linebreak sequence. A Unicode linebreak matches the two-character CRLF grapheme or any of the seven vertical whitespace characters.

As documented in perldoc perlrebackslash, the \R regex backslash sequence matches any Unicode linebreak sequence. (Similarly, the \v sequence matches any single character of vertical whitespace.)

This is useful for dealing with textfiles coming from different operating systems:

 s/\R/\n/g;  # normalize all linebreaks to \n

Previous: ℞ 21: Case-insensitive Comparisons

Series Index: The Standard Preamble

Next: ℞ 23: Get Character Categories

℞ 21: Unicode case-insensitive comparisons

Unicode is more than an expanded character set. Unicode is a set of rules about how characters behave and a set of properties about each character.

Comparing strings for equivalence often requires normalizing them to a standard form. That normalized form often requires that all characters be in a specific case. ℞ 20: Unicode casing demonstrated that converting between upper- and lower-case Unicode characters is more complicated than simply mapping [A-Z] to [a-z]. (Remember also that many characters have a title case form!)

The proper solution for normalized comparisons is to perform casefolding instead of mapping a subset of some characters to another. Perl 5.16 added a new feature fc(), or "foldcase", to perform Unicode casefolding as the /i pattern modifier has always provided. This feature is available for other Perls thanks to the CPAN module Unicode::CaseFold:

 use feature "fc"; # fc() function is from v5.16
 # OR
 use Unicode::CaseFold;

 # sort case-insensitively
 my @sorted = sort { fc($a) cmp fc($b) } @list;

 # both are true:
 fc("tschüß")  eq fc("TSCHÜSS")
 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")

Fold cases properly goes into more detail about case folding in Perl.

Previous: ℞ 20: Unicode Casing

Series Index: The Standard Preamble

Next: ℞ 22: Match Unicode Linebreak Sequence

℞ 20: Unicode casing

Unicode casing is very different from ASCII casing. Some of the complexity of Unicode comes about because Unicode characters may change dramatically when changing from upper to lower case and back. For example, the Greek language has two characters for the lower case sigma, depending on whether the letter is in a medial (σ) or final (ς) position in a word. Greek only has a single upper case sigma (Σ). (Some classical Greek texts from the Hellenistic period use a crescent-shaped variant of the sigma called the lunate sigma, or ϲ.)

Unicode casing is important for changing case and for performing case-insensitive matching:

 uc("henry ⅷ")  # "HENRY Ⅷ"
 uc("tschüß")   # "TSCHÜSS"  notice ß => SS

 # both are true:
 "tschüß"  =~ /TSCHÜSS/i   # notice ß => SS
 "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i   # notice Σ,σ,ς sameness

Previous: ℞ 19: Specify a File's Encoding

Series Index: The Standard Preamble

Next: ℞ 21: Case-insensitive Comparisons

℞ 19: Open file with specific encoding

While setting the default Unicode encoding for IO is sensible, sometimes the default encoding is not correct. In this case, specify the encoding for a filehandle manually in the mode option to open or with the binmode operator. Perl's IO layers will handle encoding and decoding for you. This is the normal way to deal with encoded text, not by calling low-level functions.

To specify the encoding of a filehandle opened for input:

    open(my $in_file, "< :encoding(UTF-16)", "wintext");
     # OR
     open(my $in_file, "<", "wintext");
     binmode($in_file, ":encoding(UTF-16)");

     # ...
     my $line = <$in_file>;

To specify the encoding of a filehandle opened for output:

     open($out_file, "> :encoding(cp1252)", "wintext");
     # OR
     open(my $out_file, ">", "wintext");
     binmode($out_file, ":encoding(cp1252)");

     # ...
     print $out_file "some text\n";

More layers than just the encoding can be specified here. For example, the incantation ":raw :encoding(UTF-16LE) :crlf" includes implicit CRLF handling. See PerlIO for more details.

Previous: ℞ 18: Make All I/O Default to UTF-8

Series Index: The Standard Preamble

Next: ℞ 20: Unicode Casing

℞ 18: Make all I/O and args default to utf8

The core rule of Unicode handling in Perl is "always encode and decode at the edges of your program".

If you've configured everything such that all incoming and outgoing data uses the UTF-8 encoding, you can make Perl perform the appropriate encoding and decoding for you. As documented in perldoc perlrun, the -C flag and the PERL_UNICODE environment variable are available. Use the S option to make the standard input, output, and error filehandles use UTF-8 encoding. Use the D option to make all other filehandles use UTF-8 encoding. Use the A option to decode @ARGV elements as UTF-8:

     $ perl -CSDA ...
# or
     $ export PERL_UNICODE=SDA

Within your program, you can achieve the same effects with the open pragma to set default encodings on filehandles and the Encode module to decode the elements of @ARGV:

     use open qw(:std :utf8);
     use Encode qw(decode_utf8);
     @ARGV = map { decode_utf8($_, 1) } @ARGV;

Previous: ℞ 17: Make File I/O Default to UTF-8

Series Index: The Standard Preamble

Next: ℞ 19: Specify a File's Encoding

℞ 17: Make file I/O default to utf8

If you've ever had the misfortune of seeing the Unicode warning "wide character in print", you may have realized that something forgot to set the appropriate Unicode-capable encoding on a filehandle somewhere in your program. Remember that the rule of Unicode handling in Perl is "always encode and decode at the edges of your program".

You can easily Decode STDIN, STDOUT, and STDERR as UTF-8 by default or Decode STDIN, STDOUT, and STDERR per local settings as a default, or you can use binmode to set the encoding on a specific filehandle.

Alternately, you can set the default encoding on all filehandles through the entire program, or on a lexical basis. As documented in perldoc perlrun, the -C flag and the PERL_UNICODE environment variable are available. Use the D option to make all filehandles default to UTF-8 encoding. That is, files opened without an encoding argument will be in UTF-8:

     $ perl -CD ...
     # or
     $ export PERL_UNICODE=D

The open pragma configures the default encoding of all filehandle operations in its lexical scope:

     use open qw(:utf8);

Note that the open pragma is currently incompatible with the autodie pragma.

Previous: ℞ 16: Decode Standard Filehandles as Locale Encoding

Series Index: The Standard Preamble

Next: ℞ 18: Make All I/O Default to UTF-8

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en