Making Dictionaries with Perl
by Sean M. Burke
|
Pages: 1, 2, 3
Sorting and Duplicate Headwords
So how do we take entries in whatever order, and put them into alphabetical order? A first hack is something like this:
my %headword2entry;
foreach my $entry ($lex->entries) {
my %e = $entry->as_list;
$headword2entry{ $e{'hw'} } = \%e;
}
foreach my $headword (sort keys %headword2entry) {
my %e = %{ $headword2entry{$headword} };
...and print it here...
}
And that indeed works fine. But suppose one of the linguists comes by and adds these three entries into our little database:
\hw gíi
\pos auxiliary verb
\engl already; always; often
\hw gu
\pos postposition
\engl there
\hw gíi
\pos verb
\engl swim away [of fish]
When we run our program, there's trouble with the output:

First off, the second "gíi" (the verb for fish swimming away) was stored as $headword2entry{'gíi'} and that overwrote the first "gíi" entry (the one that means already, always, or often). And secondly, "gíi" got sorted after "gu"!
The first problem can be solved by changing from the current data structure, which is like this:
$headword2entry{ 'gíi' } = ...one_entry...;
over to a new data structure, which is like this:
$headword2entries{ 'gíi' } =
[ ...one_entry... , ...another_entry..., ...maybe_even_another... ];
...even though in most cases that list will hold just one entry.
That's simple to graft into our program, even if the syntax for dereferencing gets a bit thick:
my %headword2entries;
foreach my $entry ($lex->entries) {
my %e = $entry->as_list;
push @{ $headword2entries{ $e{'hw'} } }, \%e;
}
foreach my $headword (sort keys %headword2entries) {
foreach my $entry ( @{ $headword2entries{$headword} } ) {
...code to print the entry...
}
}
And that works just right: both "gíi" entries show up.
Now how to get sort keys %headword2entries to sort "gíi" before "gu"? The default sort() that Perl uses just sorts ASCIIbetically, where "í" comes not just after "u", but actually after all the unaccented letters. We can get Perl to use a smarter sort() if we add a "use locale;" line and see about changing our current locale to French or German or something that'd know that "í" sorts before "u". This approach works in some cases, but suppose that you're dealing with a language that uses "dh" as a combined letter that comes after "d". You'd be out of luck, since there aren't any existing locales that (as far as I know) that have "dh" as a letter after "d", and since under most operating systems you can't define you own locales.
But CPAN, once again, comes to the rescue. The CPAN module Sort::ArbBiLex lets you state a sort order and get back a function that sorts according to that order. We can just pull this example from the docs:
use Sort::ArbBiLex (
'custom_sort' => # that's the function name to define
"
a A à À á Á â Â ã Ã ä Ä å Å æ Æ
b B
c C ç Ç
d D ð Ð
e E è È é É ê Ê ë Ë
f F
g G
h H
i I ì Ì í Í î Î ï Ï
j J
k K
l L
m M
n N ñ Ñ
o O ò Ò ó Ó ô Ô õ Õ ö Ö ø Ø
p P
q Q
r R
s S ß
t T þ Þ
u U ù Ù ú Ú û Û ü Ü
v V
w W
x X
y Y ý Ý ÿ
z Z
"
);
And if we need that "dh" to be a new letter between "d" and "e", it's a simple matter of adding a line to the above code:
...
d D ð Ð
dh Dh
e E è È é É ê Ê ë Ë
...
And if the above sort order isn't right, we can fix this by just moving things around. For example, a few Haida words use an x-circumflex character for an odd pharyngeal sound, and since that character isn't in Latin-1, the folks working on Haida use a special font that replaces the Latin-1 þ character with the x-circumflex. To have that sort as a letter after x, we'd rearrange the end of the above sort-order to read like this:
...
t T
u U ù Ù ú Ú û Û ü Ü
v V
w W
x X
þ Þ
y Y ý Ý ÿ
z Z
Once we get the big use Sort::ArbBiLex (...); statement set up just the way we like it, we can just replace the "sort" in our "sort keys" with "custom_sort", like so:
foreach my $headword (custom_sort keys %headword2entries) {
foreach my $entry ( @{ $headword2entries{$headword} } ) {
...code to print the entry...
}
}
With that in place, our entries sort just right:

Reverse Indexing
The last thing anyone wants to do when they've finished working on a dictionary, is to turn right around and write another one -- but that's exactly the problem that comes up in lexicography: you've been compiling a Haida-to-English dictionary, and then someone says "Gee, it'd be really handy to have an English-to-Haida one, too!"
In the bad old days before people used databases for their lexicons, this process of "reversing the dictionary" was manual. Now that we have databases, we just need a way to see the entry that expresses "gu" = "there" in our main lexicon, and then make an entry in a reverse lexicon that expresses "there" = "gu".
The reverse lexicon could be just %english2native with entries like:
$english2native{'there'} = "gu";
But there could be several words that mean "there" -- like "gyaasdáan" -- so we'd have to use an array here, just as we did in %headword2entries, like this:
$english2native{'there'} = [ "gu", "gyaasdáan" ];
We can implement this by changing our initial lexicon-scanning routine to add a line to push to @{$english2native{each_english_bit}}, like so:
foreach my $entry ($lex->entries) {
my %e = $entry->as_list;
push @{ $headword2entries{ $e{'hw'} } }, \%e;
foreach my $engl ( reversables( $e{'engl'} ) ) {
push @{ $english2native{ $engl } }, $e{'hw'}
}
}
And later on, we can spit out the contents of %english2native after the main dictionary:
$rtf->paragraph( "\n\nEnglish to Haida Index\n" );
foreach my $engl ( custom_sort keys %english2native) {
my $n = join "; ", custom_sort @{ $english2native{ $engl } };
$rtf->paragraph( "$engl: $n" );
}
All we need now is a routine, reversables(), that can take the string "already; always; often" (from the gíi entry) and turn it into the list ("already," "always," "often"), and to take the string "the shelter of a tree" and turn it into the one-item list ("shelter of a tree"). (If we left the "the" on there, we'd have a huge bunch of entries under "the"!)
This function is a decent first hack:
sub reversables {
my $in = shift || return();
my @english;
foreach my $term ( split /\s*;\s*/, $in ) {
$term =~ s/^(a|an|the|to)\s+//i;
# Take the "to" off of "to swim away [of fish]",
# and the "the" off of "the shelter of a tree"
push @english, $term;
}
return @english;
}
However, consider the entry anáa: "inside a house; at home" -- our reversables() function will return this as the list ("inside a house", "at home"). That seems passable, but if I were looking for a word like this in the English end of the dictionary, I'd probably want it to be under "home" and "house" as well.
Now, there are four alternatives here for how to have finer control over the reversing:
- Just don't bother, and instead just do this all manually in the editing of the final draft.
- This is a bad approach because, in my experience, the people working on the lexicon get so used to the just-passable reversing algorithm that they end up thinking it's no big deal, and so in the end its effects never get fixed.
- Don't do automatic reversing, but have a mandatory field in each entry that says what English headword(s) should point to this native entry.
- For example, if we call the field "ehw" (for "English headword"), then for "at home; inside a house" could say something like: "\ehw home, at; house, inside a". However, having this be mandatory is a real drag for simple entries like "gu," where you'd have to do:
\hw gu \engl there \ehw there - * Make an "ehw" field optional, and when it's absent, use a smart reversing algorithm.
- So when we have an entry like "\hw gu \engl there", of course the reversing algorithm would know to infer a "\ehw there." And it would somehow be smart enough to know to index "wave a piece of cloth" under "wave" and "cloth" but not under "a," "piece," or "of." The problem with very smart fallback algorithms like this is that people have to understand them completely, so that they can know whether the result is good enough or whether it should be overridden with a default "\ehw" field. But since nobody can remember all the hacks that get built into the smart algorithm, they either err on the side of doubt by always putting a "\ehw" field (thus making the whole algorithm pointless), or by never putting a "\ehw" field, or, worse some unpredictable and headachy mix of the two. So ironically, a smart fallback algorithm is often a bad idea. That leads us to the final alternative:
- * Make an "ehw" field optional, and when it's absent, use a dumb reversing algorithm.
- By "dumb," I mean a maximum of two rules -- if it's any more complex than that, people will forget how it works and won't know when they should key in an explicit "\ehw" field.
So while we could add more and more things to our reversables() algorithm, it seems wisest to refrain from doing this, to be content with our one s/^(a|an|the|to)\s+//i rule, and instead just add support for an "\ehw" field. We can do that simply by changing the call to reversables(), from this:
foreach my $engl ( reversables( $e{'engl'} ) ) {
push @{ $english2native{ $engl } }, $e{'hw'}
}
to this:
my @reversed = $e{'ehw'} ? split( m/\s*;\s*/, $e{'ehw'} )
: reversables( $e{'engl'} );
foreach my $engl ( @reversed ) {
push @{ $english2native{ $engl } }, $e{'hw'}
}
With that in place (and with a "\ehw home, at; house, inside a" line in the "anáa" entry just to get the ball rolling), our program runs and spits out an English index after the Haida dictionary:


