Perl Unicode Cookbook: Reverse String by Grapheme

℞ 32: Reverse string by grapheme

Because bytes and characters are not isomorphic in Unicode—and what you may see as a user-visible character (a grapheme) is not necessarily a single codepoint in a Unicode string—every string operation must be aware of the difference between codepoints and graphemes.

Consider the Perl builtin reverse. Reversing a string by codepoints messes up diacritics, mistakenly converting crème brûlée into éel̂urb em̀erc instead of into eélûrb emèrc; so reverse by grapheme instead.

As one option, use Perl's \X regex metacharacter to extract graphemes from a string, then reverse that list:

 $str = join("", reverse $str =~ /\X/g);

As another option, use Unicode::GCString to treat a string as a sequence of graphemes, not codepoints:

 use Unicode::GCString;
 $str = reverse Unicode::GCString->new($str);

Both these approaches work correctly no matter what normalization the string is in. Remember that \X is most reliable only as of and after Perl 5.12.

Previous: ℞ 31: Extract by Grapheme Instead of Codepoint (substr)

Series Index: The Standard Preamble

Next: ℞ 33: String Length in Graphemes

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en