Perl Unicode Cookbook: Always Decompose and Recompose

Apr 3, 2012 by Tom Christiansen

℞ 1: Generic Unicode-savvy ﬁlter

Unicode allows multiple representations of the same characters. Comparing such strings for equivalence (sorting, searching, exact matching) requires care—including a coherent and consistent strategy of normalizing these representations to well-understood forms. Enter Unicode::Normalize.

To handle Unicode effectively, always decompose on the way in, then recompose on the way out.

 use Unicode::Normalize;

 while (<>) {
     $_ = NFD($_);   # decompose + reorder canonically
     ...
 } continue {
     print NFC($_);  # recompose (where possible) + reorder canonically
 }

See the Unicode Normalization FAQ for more details.

Series Index: The Standard Preamble

Next: ℞ 2: Fine-Tuning Unicode Warnings

Tags

Tom Christiansen

Browse their articles

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor

TPRF Silver Sponsor

TPRF Bronze Sponsor

TPRF Bronze Sponsor

TPRF Bronze Sponsor

TPRF Bronze Sponsor

PERL ADS

Perl Resources

Site Map

Home

About

Authors

Categories

Tags

Contact Us

To get in touch, submit an issue to perladvent/perldotcom on GitHub.

License

This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Legal

Perl.com and the authors make no representations with respect to the accuracy or completeness of the contents of all work on this website and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. The information published on this website may not be suitable for every situation. All work on this website is provided with the understanding that Perl.com and the authors are not engaged in rendering professional services. Neither Perl.com nor the authors shall be liable for damages arising herefrom.