Perl Unicode Cookbook: Disable Unicode-awareness in Builtin Character Classes

℞ 24: Disabling Unicode-awareness in builtin charclasses

Many regex tutorials gloss over the fact that builtin character classes include far more than ASCII characters. In particular, classes such as "word character" (\w), "word boundary" (\b), "whitespace" (\s), and "digit" (\d) respect Unicode.

Perl 5.14 added the /a regex modifier to disable \w, \b, \s, \d, and the POSIX classes from working correctly on Unicode. This restricts these classes to mach only ASCII characters. Use the re pragma to restrict these claracter classes in a lexical scope:

 use v5.14;
 use re "/a";

... or use the /a modifier to affect a single regex:

 my($num) = $str =~ /(\d+)/a;

You may always use specific un-Unicode properties, such \p{ahex} and \p{POSIX_Digit}. Properties still work normally no matter what charset modifiers (/d /u /l /a /aa) are in effect.

Previous: ℞ 23: Get Character Categories

Series Index: The Standard Preamble

Next: ℞ 25: Match Unicode Properties in Regex

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en