℞ 31: Extract by grapheme instead of by codepoint (substr)
The Unicode Standard Annex #29 discusses the boundaries between grapheme clusters—what users might perceive as "characters". The CPAN module Unicode::GCString allows you to treat a Unicode string as a sequence of these grapheme clusters.
While you may use
\X to extract graphemes within a regex,
Unicode::GCString provides a
substr() method to
extract a series of grapheme clusters:
# cpan -i Unicode::GCString use Unicode::GCString; my $gcs = Unicode::GCString->new($str); my $first_five = $gcs->substr(0, 5);
The module also provides an iterator interface to grapheme clusters within a string.
Series Index: The Standard Preamble