Perl Unicode Cookbook: Match Unicode Grapheme Cluster in Regex

May 22, 2012 by Tom Christiansen

℞ 29: Match Unicode grapheme cluster in regex

In the days of ASCII, we spoke of characters and bytes. We saw few differences between them. In the Unicode world, characters are far more than seven bits of data. Far better to speak of collections of raw bytes and characters—or even Unicode codepoints.

Programmer-visible “characters” are codepoints matched by /./s, but user-visible “characters” are graphemes matched by /\X/.

That is to say, the \X regex metacharacter matches what Unicode calls an “extended grapheme cluster”. Where the user may see a single character (such as a consonant with an accent), the Unicode representation may be that consonant plus combining characters plus the accent mark. Use \X to match the entire sequence:

 # Find vowel *plus* any combining diacritics,underlining,etc.
 my $nfd = NFD($orig);
 $nfd =~ / (?=[aeiou]) \X /xi

Previous: ℞ 28: Convert non-ASCII Unicode Numerics

Series Index: The Standard Preamble

Next: ℞ 30: Extract by Grapheme Instead of Codepoint (regex)

Tags

unicode

Tom Christiansen

Browse their articles

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor

TPRF Silver Sponsor

Perl Resources

Perl Unicode Cookbook: Match Unicode Grapheme Cluster in Regex

℞ 29: Match Unicode grapheme cluster in regex

Tom Christiansen

Browse their articles

Feedback

Site Map

Contact Us

License

Legal