Perl Unicode Cookbook: Further Resources

Jun 29, 2012 by Tom Christiansen

This series has shown you several features of Unicode by example, as well as several techniques for working with Unicode correctly and easily with recent releases of Perl 5. By now you know more than many programmers do about Unicode, but your journey to mastery continues.

Perl 5 includes several pieces of documentation which explain Unicode and Perl’s Unicode support. See perlunicode, perluniprops, perlre, perlrecharclass, perluniintro, perlunitut and perlunifaq.

Perl 5 and the CPAN provide several modules and distributions to allow the effective use of Unicode. As of Perl 5.16, many of these are in the core library. Many of them work just as well with earlier versions of Perl 5, though for the best and most correct support for Unicode as a whole, consider using Perl 5.14 or 5.16.

These modules include:

The CPAN distribution Unicode::Tussle module includes many command-line programs to help with working with Unicode, including these programs to fully or partly replace standard utilities: tcgrep instead of egrep, uniquote instead of cat -v or hexdump, uniwc instead of wc, unilook instead of look, unifmt instead of fmt, and ucsort instead of sort. For exploring Unicode character names and character properties, see its uniprops, unichars, and uninames programs. It also supplies these programs, all of which are general ﬁlters that do Unicode-y things: unititle and unicaps; uniwide and uninarrow; unisupers and unisubs; nfd, nfc, nfkd, and nfkc; and uc, lc, and tc.

Finally, see the published Unicode Standard (page numbers are from version 6.0.0), including these speciﬁc annexes and technical reports:

§3.13 Default Case Algorithms, page 113
§4.2 Case, pages 120-122
Case Mappings, page 166-172, especially Caseless Matching starting on page 170
UAX #44: Unicode Character Database
UTS #18: Unicode Regular Expressions
UAX #15: Unicode Normalization Forms
UTS #10: Unicode Collation Algorithm
UAX #29: Unicode Text Segmentation
UAX #14: Unicode Line Breaking Algorithm
UAX #11: East Asian Width

Tom Christiansen <tchrist@perl.com> wrote this series, with occasional kibbitzing from Larry Wall and Jeﬀrey Friedl in the background.

Most of these examples came from the current edition of the “Camel Book”; that is, from the 4^th Edition of Programming Perl, Copyright © 2012 Tom Christiansen et al., 2012-02-13 by O’Reilly Media. The code itself is freely redistributable, and you are encouraged to transplant, fold, spindle, and mutilate any of the examples in this series however you please for inclusion into your own programs without any encumbrance whatsoever. Acknowledgement via code comment is polite but not required.

Previous: ℞ 44: Demo of Unicode Collation and Printing

Series Index: The Standard Preamble

Tags

unicode