Perl Unicode Cookbook: Match Unicode Properties in Regex

May 16, 2012 by Tom Christiansen

℞ 25: Match Unicode properties in regex with `\p`, `\P`

Every Unicode codepoint has one or more properties, indicating the rules which apply to that codepoint. Perl’s regex engine is aware of these properties; use the \p{} metacharacter sequence to match a codepoint possessing that property and its inverse, \P{} to match a codepoint lacking that property.

Each property has a short name and a long name. For example, to match any codepoint which has the Letter property, you may use \p{Letter} or \p{L}. Similarly, you may use \P{Uppercase} or \P{Upper}. perldoc perlunicode’s “Unicode Character Properties” section describes these properties in greater detail.

Examples of these properties useful in regex include:

 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
 \p{Sk}, \p{Ps}, \p{Lt}
 \p{alpha}, \p{upper}, \p{lower}
 \p{Latin}, \p{Greek}
 \p{script=Latin}, \p{script=Greek}
 \p{East_Asian_Width=Wide}, \p{EA=W}
 \p{Line_Break=Hyphen}, \p{LB=HY}
 \p{Numeric_Value=4}, \p{NV=4}

Previous: ℞ 24: Disable Unicode-awareness in Builtin Character Classes

Series Index: The Standard Preamble

Next: ℞ 26: Custom Character Properties

Tags

unicode