Perl Unicode Cookbook: Case-insensitive Comparisons

℞ 21: Unicode case-insensitive comparisons

Unicode is more than an expanded character set. Unicode is a set of rules about how characters behave and a set of properties about each character.

Comparing strings for equivalence often requires normalizing them to a standard form. That normalized form often requires that all characters be in a specific case. ℞ 20: Unicode casing demonstrated that converting between upper- and lower-case Unicode characters is more complicated than simply mapping [A-Z] to [a-z]. (Remember also that many characters have a title case form!)

The proper solution for normalized comparisons is to perform casefolding instead of mapping a subset of some characters to another. Perl 5.16 added a new feature fc(), or "foldcase", to perform Unicode casefolding as the /i pattern modifier has always provided. This feature is available for other Perls thanks to the CPAN module Unicode::CaseFold:

 use feature "fc"; # fc() function is from v5.16
 # OR
 use Unicode::CaseFold;

 # sort case-insensitively
 my @sorted = sort { fc($a) cmp fc($b) } @list;

 # both are true:
 fc("tschüß")  eq fc("TSCHÜSS")
 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")

Fold cases properly goes into more detail about case folding in Perl.

Previous: ℞ 20: Unicode Casing

Series Index: The Standard Preamble

Next: ℞ 22: Match Unicode Linebreak Sequence

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en