Announcing Geo::libpostal

libpostal is a C library for normalizing and parsing international street addresses. It’s built from OpenStreetMap data, supports normalization in over 60 languages and can parse addresses from over 100 countries. It’s blindingly fast and now you can use it with Perl using Geo::libpostal, a new module I wrote.

Normalizing an address

Let’s say you support an application with a customer sign up process where the customer provides their address. One way to prevent duplicate sign-ups is by allowing only one customer per address. But how do you handle the scenario where the customer types their address slightly differently every time?

One answer is to use libpostal’s normalization capability to expand single address string into valid variants. If you already have a customer whose address matches one of the variants, you know you’ve got a duplicate sign-up. Let’s say you have a customer with the address “216 Park Avenue Apt 17D, New York, NY 10022”. Then another customer comes along with the ever-so-similar address “216 Park Ave Apt 17D, New York, NY 10022”. Here’s how you can test for that with Perl:

use Geo::libpostal 'expand_address';

my @original_variants = expand_address("216 Park Avenue Apt 17D, New York, NY 10022");

# @original_variants contains:
#   216 park avenue apartment 17d new york new york 10022
#   216 park avenue apartment 17d new york ny 10022

my @new_variants = expand_address("216 Park Ave Apt 17D, New York, NY 10022");

for my $address (@new_variants) {
  if (grep { $address eq $_ } @original_variants) {
    print "Duplicate address found!\n";
  }
}

expand_address() supports a ton of options: including returning results in multiple languages, expanding only certain components of an address, and the format of the expanded addresses.

Parsing an address

libpostal can also parse an address string into its constituent parts using such as house name, number, city and postcode. This can be useful for all sorts of things from information extraction to simplifying web forms. This is how to parse an address string with Perl:

use Geo::libpostal 'parse_address';

my %address = parse_address("216 Park Avenue Apt 17D, New York, NY 10022");

# %address contains:
#    road         => 'park avenue apt 17d',
#    city         => 'new york',
#    postcode     => '10022',
#    state        => 'ny',
#    house_number => '216'

A slow starter

To be as fast as possible, libpostal uses setup functions to create lookup tables in memory. These can take several seconds to construct, so under the hood Geo::libpostal lazily calls the setup functions for you. This means that the first call to expand_address or parse_address is a lot slower than usual as the setup functions are running as well:

use Geo::libpostal 'expand_address';

# this is slow
@addresses = expand_address("216 Park Avenue Apt 17D, New York, NY 10022");

# this is fast!
@addresses = expand_address("76 Ninth Avenue, New York, NY 10111");

Similarly, libpostal has teardown functions which unload the lookup tables. Geo::libpostal has an internal function, _teardown that is automatically called in an END block, but you can call it directly too. The only effect will be that the subsequent call to expand_address or parse_address will be slower, as the setup functions are called again. With the latest version of libpostal it is safe to call setup or teardown multiple times in a process.

References


This article was originally posted on PerlTricks.com.

Tags

David Farrell

David is the editor of Perl.com. An organizer of the New York Perl Meetup, he works for ZipRecruiter as a software developer, and sometimes tweets about Perl and Open Source.

Browse their articles

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub