Perl Unicode Cookbook: Match Unicode Properties in Regex

℞ 25: Match Unicode properties in regex with \p, \P

Every Unicode codepoint has one or more properties, indicating the rules which apply to that codepoint. Perl’s regex engine is aware of these properties; use the \p{} metacharacter sequence to match a codepoint possessing that property and its inverse, \P{} to match a codepoint lacking that property.

Each property has a short name and a long name. For example, to match any codepoint which has the Letter property, you may use \p{Letter} or \p{L}. Similarly, you may use \P{Uppercase} or \P{Upper}. perldoc perlunicode’s “Unicode Character Properties” section describes these properties in greater detail.

Examples of these properties useful in regex include:

 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
 \p{Sk}, \p{Ps}, \p{Lt}
 \p{alpha}, \p{upper}, \p{lower}
 \p{Latin}, \p{Greek}
 \p{script=Latin}, \p{script=Greek}
 \p{East_Asian_Width=Wide}, \p{EA=W}
 \p{Line_Break=Hyphen}, \p{LB=HY}
 \p{Numeric_Value=4}, \p{NV=4}

Previous: ℞ 24: Disable Unicode-awareness in Builtin Character Classes

Series Index: The Standard Preamble

Next: ℞ 26: Custom Character Properties

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub