Perl Unicode Cookbook: String Length in Graphemes

℞ 33: String length in graphemes

If you learn nothing else about Unicode, remember this: characters are not bytes are not graphemes are not codepoints. A user-visible symbol (a grapheme) may be composed of multiple codepoints. Multiple combinations of codepoints may produce the same user-visible graphemes.

To keep all of these entities clear in your mind, be careful and specific about what you’re trying to do at which level.

As a concrete example, the string brûlée has six graphemes but up to eight codepoints. Now suppose you want to get its length. What does length mean? If your string has been normalized to a one-grapheme-per-codepoint form, length() is one and the same, but consider:

 use Unicode::Normalize;
 my $str = "brûlée";
 say length $str;
 say length NFD( $str );

To measure the length of a string by counts by grapheme, not by codepoint:

 my $str   = "brûlée";
 my $count = 0;
 while ($str =~ /\X/g) { $count++ }

Alternately (or on older versions of Perl), the CPAN module Unicode::GCString is useful:

 use Unicode::GCString;
 my $gcs   = Unicode::GCString->new($str);
 my $count = $gcs->length;

Previous: ℞ 32: Reverse String by Grapheme

Series Index: The Standard Preamble

Next: ℞ 34: Unicode Column Width for Printing

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub