What's Wrong with sort and How to Fix It

Aug 31, 2011 by Tom Christiansen

By now, you may have read Considerations on Using Unicode Properly in Modern Perl Applications. Still think doing things correctly is easy? Tom Christiansen demonstrates that even sorting can be trickier than you think.

NOTE: The following is an excerpt from the draft manuscript of Programming Perl, 4ᵗʰ edition

Calling sort without a comparison function is quite often the wrong thing to do, even on plain text. That’s because if you use a bare sort, you can get really strange results. It’s not just Perl either: almost all programming languages work this way, even the shell command. You might be surprised to find that with this sort of nonsense sort, ‹B› comes before ‹a› not after it, ‹é› comes before ‹ｄ›, and ‹ﬀ› comes after ‹zz›. There’s no end to such silliness, either; see the default sort tables at the end of this article to see what I mean.

There are situations when a bare sort is appropriate, but fewer than you think. One scenario is when every string you’re sorting contains nothing but the 26 lowercase (or uppercase, but not both) Latin letters from ‹a-z›, without any whitespace or punctuation.

Another occasion when a simple, unadorned sort is appropriate is when you have no other goal but to iterate in an order that is merely repeatable, even if that order should happen to be completely arbitrary. In other words, yes, it’s garbage, but it’s the same garbage this time as it was last time. That’s because the default sort resorts to an unmediated cmp operator, which has the “predictable garbage” characteristics I just mentioned.

The last situation is much less frequent than the first two. It requires that the things you’re sorting be special‐purpose, dedicated binary keys whose bit sequences have with excruciating care been arranged to sort in some prescribed fashion. This is also the strategy for any reasonable use of the cmp operator.

So what’s wrong with `sort` anyway?

I know, I know. I can hear everyone saying, “But it’s called sort, so how could that ever be wrong?” Sure it’s called sort, but you still have to know how to use it to get useful results out. Probably the most surprising thing about sort is that it does not by default do an alphabetic, an alphanumeric, or a numeric sort. What it actually does is something else altogether, and that something else is of surprisingly limited usefulness.

Imagine you have an array of records. It does you virtually no good to write:

@sorted_recs = sort @recs;

Because Perl’s cmp operator does only a bit comparison not an alphabetic one, it does nearly as little good to write your record sort this way:

@srecs = sort {
    $b->{AGE}      <=>  $b->{AGE}
                   ||
    $a->{SURNAME}  cmp  $b->{SURNAME}
} @recs;

The problem is that that cmp for the record’s SURNAME field is not an alphabetic comparison. It’s merely a code point comparison. That means it works like C’s strcmp function or Java’s String.compareTo method. Although commonly referred to as a “lexicographic” comparison, this is a gross misnomer: it’s about as far away from the way real lexicographers sort dictionary entries as you can get without flipping a coin.

Fortunately, you don’t have to come up with your own algorithm for dictionary sorting, because Perl provides a standard class to do this for you: Unicode::Collate. Don’t let the name throw you, because while it was first invented for Unicode, it works great on regular ASCII text, too, and does a better job at making lexicographers happy than a plain old sort ever manages.

If you have code that purports to sort text that looks like this:

@sorted_lines = sort @lines;

Then all you have to get a dictionary sort is write instead:

use Unicode::Collate;
@sorted_lines = Unicode::Collate::->new->sort(@lines);

For structured records, like those with ages and surnames in them, you have to be a bit fancier. One way to fix it would be to use the class’s own cmp operator instead of the built‐in one.

use Unicode::Collate;
my $collator = Unicode::Collate::->new();
@srecs = sort {
    $b->{AGE}  <=>  $b->{AGE}
          ||
    $collator->cmp( $a->{SURNAME}, $b->{SURNAME} )
} @recs;

However, that makes a fairly expensive method call for every possible comparison. Because Perl’s adaptive merge sort algorithm usually runs in O(n · log n) time given n items, and because each comparison requires two different computed keys, that can be a lot of duplicate effort. Our sorting class therefore provide a convenient getSortKey method that calculates a special binary key which you can cache and later pass to the normal cmp operator on your own. This trick lets you use cmp yet get a truly alphabetic sort out of it for a change.

Here is a simple but sufficient example of how to do that:

use Unicode::Collate;
my $collator = Unicode::Collate::->new();

# first calculate the magic sort key for each text field, and cache it
for my $rec (@recs) {
    $rec->{SURNAME_key} = $collator->getSortKey( $rec->{SURNAME} );
} 

# now sort the records as before, but for the surname field,
# use the cached sort key instead
@srecs = sort {
    $b->{AGE}          <=>  $b->{AGE}
                      ||
    $a->{SURNAME_key}  cmp  $b->{SURNAME_key}
} @recs;

That’s what I meant about very carefully preparing a mediated sort key that contains the precomputed binary key.

English Card Catalogue Sorts

The simple code just demonstrated assumes you want to sort names the same way you do regular text. That isn’t a good assumption, however. Many countries, languages, institutions, and sometimes even librarians have their own notions about how a card catalogue or a phonebook ought to be sorted.

For example, in the English language, surnames with Scottish patronymics starting with ‹Mc› or ‹Mac›, like MacKinley and McKinley, not only count as completely identical synonyms for sorting purposes, they go before any other surname that begins with ‹M›, and so precede surnames like Mables or Machado.

Yes, really.

That means that the following names are sorted correctly – for English:

Lewis, C.S.
McKinley, Bill
MacKinley, Ron
Mables, Martha
Machado, José
Macon, Bacon

Yes, it’s true. Check out your local large English‐language bookseller or library – presuming you can find one. If you do, best make sure to blow the dust off first.

Sorting Spanish Names

It’s a good thing those names follow English rules for sorting names. If this were Spanish, we would have to deal with double‐barrelled surnames, where the patronym sorts before the matronym, which in turn sorts before any given names. That means that if Señor Machado’s full name were, like the poet’s, Antonio Cipriano José María y Francisco de Santa Ana Machado y Ruiz, then you would have to sort him with the other Machados but then consider Ruiz before Antonio if there were any other Machados. Similarly, the poet Federico del Sagrado Corazón de Jesús García Lorca sorts before the writer Gabriel José de la Concordia García Márquez.

On the other hand, if your records are not full multifield hashes but only simple text that don’t happen to be surnames, your task is a lot simpler, since now all you have to is get the cmp operator to behave sensibly. That you can do easily enough this way:

use Unicode::Collate;
@sorted_text = Unicode::Collate::->new->sort(@text);

Sorting Text, Not Binary

Imagine you had this list of German‐language authors:

@germans = qw{
    Böll
    Born
    Böhme
    Bodmer
    Brandis
    Böttcher
    Borchert
    Bobrowski
};

If you just sorted them with an unmediated sort operator, you would get this utter nonsense:

Bobrowski
Bodmer
Borchert
Born
Brandis
Brant
Böhme
Böll
Böttcher

Or maybe this equally nonsensical answer:

Bobrowski
Bodmer
Borchert
Born
Böll
Brandis
Brant
Böhme
Böttcher

Or even this still completely nonsensical answer:

Bobrowski
Bodmer
Borchert
Born
Böhme
Böll
Brandis
Brant
Böttcher

The crucial point to all that is that it’s text not binary, so not only can you never judge what its bit patterns hold just by eyeballing it, more importantly, it has special rules to make it sort alphabetically (some might say sanely), an ordering no naïve code‐point sort will never come even close to getting right, especially on Unicode.

The correct ordering is:

Bobrowski
Bodmer
Böhme
Böll
Borchert
Born
Böttcher
Brandis
Brant

And that is precisely what

use Unicode::Collate;
@sorted_germans = Unicode::Collate::->new->sort(@german_names);

gives you: a correctly sorted list of those Germans’ names.

Sorting German Names

Hold on, though.

Correct in what language? In English, yes, the order given is now correct. But considering that these authors wrote in the German language, it is quite conceivable that you should be following the rules for ordering German names in German, not in English. That produces this ordering:

Bobrowski
Bodmer
Böhme
Böll
Böttcher
Borchert
Born
Brandis
Brant

How come Böttcher now came before Borchert? Because Böttcher is supposed to be the same as Boettcher. In a German phonebook or other German list of German names, things like ‹ö› and ‹oe› are considered synonyms, which is not at all how it works in English. To get the German phonebook sort, you merely have to modify your constructor this way:

use Unicode::Collate::Locale;
@sorted_germans = Unicode::Collate::Locale::
                      ->new(locale => "de_phonebook")
                      ->sort(@german_names);

Isn’t this fun?

Be glad you’re not sorting names. Sorting names is hard.

Default Sort Tables

Here are most of the Latin letters, ordered using the default sort:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j 
k l m n o p q r s t u v w x y z ª º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ 
Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö 
ø ù ú û ü ý þ ÿ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě 
Ĝ ĝ Ğ ğ Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į İ ı Ĳ ĳ Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ 
ŀ Ł ł Ń ń Ņ ņ Ň ň Ŋ ŋ Ō ō Ŏ ŏ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š š Ţ ţ Ť 
ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž ſ ƀ Ɓ Ƃ ƃ Ƈ ƈ Ɖ Ɗ Ƌ 
ƌ ƍ Ǝ Ə Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ƥ ƥ Ʀ ƫ Ƭ ƭ Ʈ Ư ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ 
ƹ ƺ ƾ ƿ Ǆ ǅ ǆ Ǉ ǈ ǉ Ǌ ǋ ǌ Ǎ ǎ Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ Ǡ ǡ Ǣ ǣ 
Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ ǭ Ǯ ǯ ǰ Ǳ ǲ ǳ Ǵ ǵ Ƿ Ǹ ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ Ȁ ȁ Ȃ ȃ Ȅ ȅ Ȇ ȇ Ȉ 
ȉ Ȋ ȋ Ȍ ȍ Ȏ ȏ Ȑ ȑ Ȓ ȓ Ȕ ȕ Ȗ ȗ Ș ș Ț ț Ȝ ȝ Ȟ ȟ Ƞ ȡ Ȥ ȥ Ȧ ȧ Ȩ ȩ Ȫ ȫ Ȭ ȭ Ȯ 
ȯ Ȱ ȱ Ȳ ȳ ȴ ȵ ȶ ȷ Ⱥ Ȼ ȼ Ƚ Ⱦ ɐ ɑ ɒ ɓ ɕ ɖ ɗ ɘ ə ɚ ɛ ɜ ɝ ɞ ɟ ɠ ɡ ɢ ɣ ɤ ɥ ɦ 
ɧ ɨ ɩ ɪ ɫ ɬ ɭ ɮ ɯ ɰ ɱ ɲ ɳ ɴ ɶ ɹ ɺ ɻ ɼ ɽ ɾ ɿ ʀ ʁ ʂ ʃ ʄ ʅ ʆ ʇ ʈ ʉ ʊ ʋ ʌ ʍ 
ʎ ʏ ʐ ʑ ʒ ʓ ʙ ʚ ʛ ʜ ʝ ʞ ʟ ʠ ʣ ʤ ʥ ʦ ʧ ʨ ʩ ʪ ʫ ˡ ˢ ˣ ᴀ ᴁ ᴂ ᴃ ᴄ ᴅ ᴆ ᴇ ᴈ ᴉ 
ᴊ ᴋ ᴌ ᴍ ᴎ ᴏ ᴑ ᴓ ᴔ ᴘ ᴙ ᴚ ᴛ ᴜ ᴝ ᴞ ᴟ ᴠ ᴡ ᴢ ᴣ ᴬ ᴭ ᴮ ᴯ ᴰ ᴱ ᴲ ᴳ ᴴ ᴵ ᴶ ᴷ ᴸ ᴹ ᴺ 
ᴻ ᴼ ᴾ ᴿ ᵀ ᵁ ᵂ ᵃ ᵄ ᵅ ᵆ ᵇ ᵈ ᵉ ᵊ ᵋ ᵌ ᵍ ᵎ ᵏ ᵐ ᵑ ᵒ ᵖ ᵗ ᵘ ᵙ ᵚ ᵛ ᵢ ᵣ ᵤ ᵥ ᵫ ᵬ ᵭ 
ᵮ ᵯ ᵰ ᵱ ᵲ ᵳ ᵴ ᵵ ᵶ Ḁ ḁ Ḃ ḃ Ḅ ḅ Ḇ ḇ Ḉ ḉ Ḋ ḋ Ḍ ḍ Ḏ ḏ Ḑ ḑ Ḓ ḓ Ḕ ḕ Ḗ ḗ Ḙ ḙ Ḛ 
ḛ Ḝ ḝ Ḟ ḟ Ḡ ḡ Ḣ ḣ Ḥ ḥ Ḧ ḧ Ḩ ḩ Ḫ ḫ Ḭ ḭ Ḯ ḯ Ḱ ḱ Ḳ ḳ Ḵ ḵ Ḷ ḷ Ḹ ḹ Ḻ ḻ Ḽ ḽ Ḿ 
ḿ Ṁ ṁ Ṃ ṃ Ṅ ṅ Ṇ ṇ Ṉ ṉ Ṋ ṋ Ṍ ṍ Ṏ ṏ Ṑ ṑ Ṓ ṓ Ṕ ṕ Ṗ ṗ Ṙ ṙ Ṛ ṛ Ṝ ṝ Ṟ ṟ Ṡ ṡ Ṣ 
ṣ Ṥ ṥ Ṧ ṧ Ṩ ṩ Ṫ ṫ Ṭ ṭ Ṯ ṯ Ṱ ṱ Ṳ ṳ Ṵ ṵ Ṷ ṷ Ṹ ṹ Ṻ ṻ Ṽ ṽ Ṿ ṿ Ẁ ẁ Ẃ ẃ Ẅ ẅ Ẇ 
ẇ Ẉ ẉ Ẋ ẋ Ẍ ẍ Ẏ ẏ Ẑ ẑ Ẓ ẓ Ẕ ẕ ẖ ẗ ẘ ẙ ẚ ẛ ẞ ẟ Ạ ạ Ả ả Ấ ấ Ầ ầ Ẩ ẩ Ẫ ẫ Ậ 
ậ Ắ ắ Ằ ằ Ẳ ẳ Ẵ ẵ Ặ ặ Ẹ ẹ Ẻ ẻ Ẽ ẽ Ế ế Ề ề Ể ể Ễ ễ Ệ ệ Ỉ ỉ Ị ị Ọ ọ Ỏ ỏ Ố 
ố Ồ ồ Ổ ổ Ỗ ỗ Ộ ộ Ớ ớ Ờ ờ Ở ở Ỡ ỡ Ợ ợ Ụ ụ Ủ ủ Ứ ứ Ừ ừ Ử ử Ữ ữ Ự ự Ỳ ỳ Ỵ 
ỵ Ỷ ỷ Ỹ ỹ K Å Ⅎ ⅎ Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ Ⅼ Ⅽ Ⅾ Ⅿ ⅰ ⅱ ⅲ ⅳ ⅴ 
ⅵ ⅶ ⅷ ⅸ ⅹ ⅺ ⅻ ⅼ ⅽ ⅾ ⅿ ﬀ ﬁ ﬂ ﬃ ﬄ ﬅ ﬆ Ａ Ｂ Ｃ Ｄ Ｅ Ｆ Ｇ Ｈ Ｉ
Ｊ Ｋ Ｌ Ｍ Ｎ Ｏ Ｐ Ｑ Ｒ Ｓ Ｔ Ｕ Ｖ Ｗ Ｘ Ｙ Ｚ ａ ｂ ｃ ｄ ｅ ｆ ｇ ｈ ｉ
ｊ ｋ ｌ ｍ ｎ ｏ ｐ ｑ ｒ ｓ ｔ ｕ ｖ ｗ ｘ ｙ ｚ

As you can see, those letters are scattered all over the place. Sure, it’s not completely random, but it’s not useful either, because it is full of arbitrary placement that makes no alphabetical sense. That’s because it is not an alphabetic sort at all. However, with the special kind of sort I’ve just shown you above, the ones that call the sort method from the Unicode::Collate class, you do get an alphabetic sort. Using that method, the Latin letters I just showed you now come out in alphabetical order, which is like this:

a ａ A Ａ ª ᵃ ᴬ á Á à À ă Ă ắ Ắ ằ Ằ ẵ Ẵ ẳ Ẳ â Â ấ Ấ ầ Ầ ẫ Ẫ ẩ Ẩ ǎ Ǎ å Å 
Å ǻ Ǻ ä Ä ǟ Ǟ ã Ã ȧ Ȧ ǡ Ǡ ą Ą ā Ā ả Ả ȁ Ȁ ȃ Ȃ ạ Ạ ặ Ặ ậ Ậ ḁ Ḁ æ Æ ᴭ ǽ Ǽ 
ǣ Ǣ ẚ ᴀ Ⱥ ᴁ ᴂ ᵆ ɐ ᵄ ɑ ᵅ ɒ b ｂ B Ｂ ᵇ ᴮ ḃ Ḃ ḅ Ḅ ḇ Ḇ ʙ ƀ ᴯ ᴃ ᵬ ɓ Ɓ ƃ Ƃ c 
ｃ ⅽ C Ｃ Ⅽ ć Ć ĉ Ĉ č Č ċ Ċ ç Ç ḉ Ḉ ᴄ ȼ Ȼ ƈ Ƈ ɕ d ｄ ⅾ D Ｄ Ⅾ ᵈ ᴰ ď Ď ḋ 
Ḋ ḑ Ḑ ḍ Ḍ ḓ Ḓ ḏ Ḏ đ Đ ð Ð ǳ ʣ ǲ Ǳ ǆ ǅ Ǆ ʥ ʤ ᴅ ᴆ ᵭ ɖ Ɖ ɗ Ɗ ƌ Ƌ ȡ ẟ e ｅ E 
Ｅ ᵉ ᴱ é É è È ĕ Ĕ ê Ê ế Ế ề Ề ễ Ễ ể Ể ě Ě ë Ë ẽ Ẽ ė Ė ȩ Ȩ ḝ Ḝ ę Ę ē Ē ḗ 
Ḗ ḕ Ḕ ẻ Ẻ ȅ Ȅ ȇ Ȇ ẹ Ẹ ệ Ệ ḙ Ḙ ḛ Ḛ ᴇ ǝ Ǝ ᴲ ə Ə ᵊ ɛ Ɛ ᵋ ɘ ɚ ɜ ᴈ ᵌ ɝ ɞ ʚ ɤ 
f ｆ F Ｆ ḟ Ḟ ﬀ ﬃ ﬄ ﬁ ﬂ ʩ ᵮ ƒ Ƒ ⅎ Ⅎ g ｇ G Ｇ ᵍ ᴳ ǵ Ǵ ğ Ğ ĝ Ĝ ǧ Ǧ ġ Ġ ģ 
Ģ ḡ Ḡ ɡ ɢ ǥ Ǥ ɠ Ɠ ʛ ɣ Ɣ h ｈ H Ｈ ᴴ ĥ Ĥ ȟ Ȟ ḧ Ḧ ḣ Ḣ ḩ Ḩ ḥ Ḥ ḫ Ḫ ẖ ħ Ħ ʜ 
ƕ ɦ ɧ i ｉ ⅰ I Ｉ Ⅰ ᵢ ᴵ í Í ì Ì ĭ Ĭ î Î ǐ Ǐ ï Ï ḯ Ḯ ĩ Ĩ İ į Į ī Ī ỉ Ỉ ȉ 
Ȉ ȋ Ȋ ị Ị ḭ Ḭ ⅱ Ⅱ ⅲ Ⅲ ĳ Ĳ ⅳ Ⅳ ⅸ Ⅸ ı ɪ ᴉ ᵎ ɨ Ɨ ɩ Ɩ j ｊ J Ｊ ᴶ ĵ Ĵ ǰ ȷ ᴊ 
ʝ ɟ ʄ k ｋ K K Ｋ ᵏ ᴷ ḱ Ḱ ǩ Ǩ ķ Ķ ḳ Ḳ ḵ Ḵ ᴋ ƙ Ƙ ʞ l ｌ ⅼ L Ｌ Ⅼ ˡ ᴸ ĺ Ĺ 
ľ Ľ ļ Ļ ḷ Ḷ ḹ Ḹ ḽ Ḽ ḻ Ḻ ł Ł ŀ Ŀ ǉ ǈ Ǉ ʪ ʫ ʟ ᴌ ƚ Ƚ ɫ ɬ ɭ ȴ ɮ ƛ ʎ m ｍ ⅿ M 
Ｍ Ⅿ ᵐ ᴹ ḿ Ḿ ṁ Ṁ ṃ Ṃ ᴍ ᵯ ɱ n ｎ N Ｎ ᴺ ń Ń ǹ Ǹ ň Ň ñ Ñ ṅ Ṅ ņ Ņ ṇ Ṇ ṋ Ṋ ṉ 
Ṉ ǌ ǋ Ǌ ɴ ᴻ ᴎ ᵰ ɲ Ɲ ƞ Ƞ ɳ ȵ ŋ Ŋ ᵑ o ｏ O Ｏ º ᵒ ᴼ ó Ó ò Ò ŏ Ŏ ô Ô ố Ố ồ 
Ồ ỗ Ỗ ổ Ổ ǒ Ǒ ö Ö ȫ Ȫ ő Ő õ Õ ṍ Ṍ ṏ Ṏ ȭ Ȭ ȯ Ȯ ȱ Ȱ ø Ø ǿ Ǿ ǫ Ǫ ǭ Ǭ ō Ō ṓ 
Ṓ ṑ Ṑ ỏ Ỏ ȍ Ȍ ȏ Ȏ ớ Ớ ờ Ờ ỡ Ỡ ở Ở ợ Ợ ọ Ọ ộ Ộ œ Œ ᴏ ᴑ ɶ ᴔ ᴓ p ｐ P Ｐ ᵖ 
ᴾ ṕ Ṕ ṗ Ṗ ᴘ ᵱ ƥ Ƥ q ｑ Q Ｑ ʠ ĸ r ｒ R Ｒ ᵣ ᴿ ŕ Ŕ ř Ř ṙ Ṙ ŗ Ŗ ȑ Ȑ ȓ Ȓ ṛ 
Ṛ ṝ Ṝ ṟ Ṟ ʀ Ʀ ᴙ ᵲ ɹ ᴚ ɺ ɻ ɼ ɽ ɾ ᵳ ɿ ʁ s ｓ S Ｓ ˢ ś Ś ṥ Ṥ ŝ Ŝ š Š ṧ Ṧ ṡ 
Ṡ ş Ş ṣ Ṣ ṩ Ṩ ș Ș ſ ẛ ß ẞ ﬆ ﬅ ᵴ ʂ ʃ ʅ ʆ t ｔ T Ｔ ᵗ ᵀ ť Ť ẗ ṫ Ṫ ţ Ţ ṭ Ṭ 
ț Ț ṱ Ṱ ṯ Ṯ ʨ ƾ ʦ ʧ ᴛ ŧ Ŧ Ⱦ ᵵ ƫ ƭ Ƭ ʈ Ʈ ȶ ʇ u ｕ U Ｕ ᵘ ᵤ ᵁ ú Ú ù Ù ŭ Ŭ 
û Û ǔ Ǔ ů Ů ü Ü ǘ Ǘ ǜ Ǜ ǚ Ǚ ǖ Ǖ ű Ű ũ Ũ ṹ Ṹ ų Ų ū Ū ṻ Ṻ ủ Ủ ȕ Ȕ ȗ Ȗ ư Ư 
ứ Ứ ừ Ừ ữ Ữ ử Ử ự Ự ụ Ụ ṳ Ṳ ṷ Ṷ ṵ Ṵ ᴜ ᴝ ᵙ ᴞ ᵫ ʉ ɥ ɯ Ɯ ᵚ ᴟ ɰ ʊ Ʊ v ｖ ⅴ V 
Ｖ Ⅴ ᵛ ᵥ ṽ Ṽ ṿ Ṿ ⅵ Ⅵ ⅶ Ⅶ ⅷ Ⅷ ᴠ ʋ Ʋ ʌ w ｗ W Ｗ ᵂ ẃ Ẃ ẁ Ẁ ŵ Ŵ ẘ ẅ Ẅ ẇ Ẇ ẉ 
Ẉ ᴡ ʍ x ｘ ⅹ X Ｘ Ⅹ ˣ ẍ Ẍ ẋ Ẋ ⅺ Ⅺ ⅻ Ⅻ y ｙ Y Ｙ ý Ý ỳ Ỳ ŷ Ŷ ẙ ÿ Ÿ ỹ Ỹ ẏ 
Ẏ ȳ Ȳ ỷ Ỷ ỵ Ỵ ʏ ƴ Ƴ z ｚ Z Ｚ ź Ź ẑ Ẑ ž Ž ż Ż ẓ Ẓ ẕ Ẕ ƍ ᴢ ƶ Ƶ ᵶ ȥ Ȥ ʐ ʑ 
ʒ Ʒ ǯ Ǯ ᴣ ƹ Ƹ ƺ ʓ ȝ Ȝ þ Þ ƿ Ƿ

Isn’t that much nicer?

Romani Ite Domum

In case you’re wondering what that last row of distinctly un‐Roman Latin letters might possibly be, they’re called respectively ezh ʒ, yogh ȝ, thorn þ, and wynn ƿ. They had to go somewhere, so they ended up getting stuck after ‹z›

Some are still used in certain non‐English (but still Latin) alphabets today, such as Icelandic, and even though you probably won’t bump into them in contemporary English texts, you might see some if you’re reading the original texts of famous medieval English poems like Beowulf, Sir Gawain and the Green Knight, or Brut.

The last of those, Brut, was written by a fellow named Laȝamon, a name whose third letter is a yogh. Famous though he was, I wouldn’t suggest changing your name to ‹Laȝamon› in his honor, as I doubt the phone company would be amused.

Tags

unicode

sorting

unicode

Tom Christiansen

Browse their articles

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor

TPRF Silver Sponsor

Perl Resources

What's Wrong with sort and How to Fix It

So what’s wrong with `sort` anyway?

English Card Catalogue Sorts

Sorting Spanish Names

Sorting Text, Not Binary

Sorting German Names

Default Sort Tables

Romani Ite Domum

Tom Christiansen

Browse their articles

Feedback

Site Map

Contact Us

License

Legal

What's Wrong with sort and How to Fix It

So what’s wrong with sort anyway?

English Card Catalogue Sorts

Sorting Spanish Names

Sorting Text, Not Binary

Sorting German Names

Default Sort Tables

Romani Ite Domum

Tom Christiansen

Browse their articles

Feedback

Site Map

Contact Us

License

Legal

So what’s wrong with `sort` anyway?