Perl Unicode Cookbook: Extract by Grapheme Instead of Codepoint (substr)

℞ 31: Extract by grapheme instead of by codepoint (substr)

The Unicode Standard Annex #29 discusses the boundaries between grapheme clusters—what users might perceive as "characters". The CPAN module Unicode::GCString allows you to treat a Unicode string as a sequence of these grapheme clusters.

While you may use \X to extract graphemes within a regex, Unicode::GCString provides a substr() method to extract a series of grapheme clusters:

 # cpan -i Unicode::GCString
 use Unicode::GCString;

 my $gcs        = Unicode::GCString->new($str);
 my $first_five = $gcs->substr(0, 5);

The module also provides an iterator interface to grapheme clusters within a string.

Previous: ℞ 30: Extract by Grapheme Instead of Codepoint (regex)

Series Index: The Standard Preamble

Next: ℞ 32: Reverse String by Grapheme

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en