Perl Unicode Cookbook: Extract by Grapheme Instead of Codepoint (substr)
℞ 31: Extract by grapheme instead of by codepoint (substr)
The Unicode Standard Annex #29 discusses the boundaries between grapheme clusters—what users might perceive as “characters”. The CPAN module Unicode::GCString allows you to treat a Unicode string as a sequence of these grapheme clusters.
While you may use \X
to extract graphemes within a regex, Unicode::GCString
provides a substr()
method to extract a series of grapheme clusters:
# cpan -i Unicode::GCString
use Unicode::GCString;
my $gcs = Unicode::GCString->new($str);
my $first_five = $gcs->substr(0, 5);
The module also provides an iterator interface to grapheme clusters within a string.
Previous: ℞ 30: Extract by Grapheme Instead of Codepoint (regex)
Series Index: The Standard Preamble
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub