Perl Unicode Cookbook: Extract by Grapheme Instead of Codepoint (substr)

℞ 31: Extract by grapheme instead of by codepoint (substr)

The Unicode Standard Annex #29 discusses the boundaries between grapheme clusters—what users might perceive as “characters”. The CPAN module Unicode::GCString allows you to treat a Unicode string as a sequence of these grapheme clusters.

While you may use \X to extract graphemes within a regex, Unicode::GCString provides a substr() method to extract a series of grapheme clusters:

 # cpan -i Unicode::GCString
 use Unicode::GCString;

 my $gcs        = Unicode::GCString->new($str);
 my $first_five = $gcs->substr(0, 5);

The module also provides an iterator interface to grapheme clusters within a string.

Previous: ℞ 30: Extract by Grapheme Instead of Codepoint (regex)

Series Index: The Standard Preamble

Next: ℞ 32: Reverse String by Grapheme

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub