Perl Unicode Cookbook: Get Character Categories

May 11, 2012 by Tom Christiansen

℞ 23: Get character category

Unicode is a set of characters and a list of rules and properties applied to those characters. The Unicode Character Database collects those properties. The core module Unicode::UCD provides access to these properties.

These general properties group characters into groups, such as upper- or lowercase characters, punctuation symbols, math symbols, and more. (See Unicode::UCD’s general_categories() for more information.)

The charinfo() function returns a hash reference containing a wealth of information about the Unicode character in question. In particular, its category value contains the short name of a character’s category.

To find the general category of a numeric codepoint:

 use Unicode::UCD qw(charinfo);
 my $cat = charinfo(0x3A3)->{category};  # "Lu"

To translate this category into something more human friendly:

 use Unicode::UCD qw( charinfo general_categories );
 my $categories = general_categories();
 my $cat        = charinfo(0x3A3)->{category};  # "Lu"
 my $full_cat   = $categories{ $cat }; # "UppercaseLetter"

Previous: ℞ 22: Match Unicode Linebreak Sequence

Series Index: The Standard Preamble

Next: ℞ 24: Disable Unicode-awareness in Builtin Character Classes

Tags

unicode