Perl Unicode Cookbook: String Length in Graphemes

May 30, 2012 by Tom Christiansen

℞ 33: String length in graphemes

If you learn nothing else about Unicode, remember this: characters are not bytes are not graphemes are not codepoints. A user-visible symbol (a grapheme) may be composed of multiple codepoints. Multiple combinations of codepoints may produce the same user-visible graphemes.

To keep all of these entities clear in your mind, be careful and specific about what you’re trying to do at which level.

As a concrete example, the string brûlée has six graphemes but up to eight codepoints. Now suppose you want to get its length. What does length mean? If your string has been normalized to a one-grapheme-per-codepoint form, length() is one and the same, but consider:

 use Unicode::Normalize;
 my $str = "brûlée";
 say length $str;
 say length NFD( $str );

To measure the length of a string by counts by grapheme, not by codepoint:

 my $str   = "brûlée";
 my $count = 0;
 while ($str =~ /\X/g) { $count++ }

Alternately (or on older versions of Perl), the CPAN module Unicode::GCString is useful:

 use Unicode::GCString;
 my $gcs   = Unicode::GCString->new($str);
 my $count = $gcs->length;

Previous: ℞ 32: Reverse String by Grapheme

Series Index: The Standard Preamble

Next: ℞ 34: Unicode Column Width for Printing

Tags

unicode

Tom Christiansen

Browse their articles

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

Tweets by perlfoundation

Perl Unicode Cookbook: String Length in Graphemes

℞ 33: String length in graphemes

Tom Christiansen

Browse their articles

Feedback

Site Map

Contact Us

License

Legal