Sign In/My Account | View Cart  
advertisement


Listen Print

Optimizing Your Perl
by Robert Spier | Pages: 1, 2

The Importance of Good Interfaces

One of the things that made this fiddling possible was a good design for the Graph module's interface. The interface hid all the details of the internals from me, so that different implementations could be plugged in. (This is actually how I did the testing, I had a regular Graph, a Graph::Cached, and a Graph::Fast. Once I decided that I liked Graph::Fast the best, I renamed it to just Graph.)

I'm glad I designed Graph the way I did. I put simplicity and design first. There is a phrase you might have heard: "Premature optimization is the root of all evil." If I had optimized the Graph module first, then I might have ended up with more complicated code that didn't work in all cases. My initial cut of Graph was slow, but worked consistently. Optimization happened later, after I had a baseline I could return to, to guarantee proper operation.

Example #2: Duplicate Message Filter

You don't always need to try and optimize code to the fullest extent. Optimizing code takes your time and energy, and it isn't always worth it. If you spend two days to shave two seconds off a program, then it's probably not worth it. (Unless of course, that program is running hundreds of times each day.) If a program is "fast enough", then making it faster isn't worth it.

Our second example is a duplicate message filter for e-mail. This filter is useful when you receive cc's of messages also sent to mailing lists you are on, but you don't want to see both.

The general idea behind this filter is simple.


  skip $message if seen $message->id;

It's only the seen function that interests us. It searches some form of cache of id's. How we implement it can have a large effect on the speed of our mailfilter. (For more information on filtering mail with perl, take a look at Mail::Audit.)

The two most obvious methods that come to mind for implementing seen are a simple linear search (O(n)) or the use of a persistent database/hash table to do the lookup. O(1). I'll ignore the cost of maintaining the database, eliminating old Message-Id's, etc.

A linear approach writes out the message id's with some sort of separator. Some programs use null bytes, others use newlines.

The database/hash method stores the message id's in a binary database like a DBM or Berkeley DB file. At the expense of using more disk space, lookups are generally faster.

But - there's one more factor to take into account - overhead. The linear search has little overhead, while the DB file has a large overhead. (Overhead here means time to open the file and initialize the appropriate data structures.)


 TABLE: Comparison
 
  Records | Comment
  -----------------
    100   | linear is 415% faster
    250   | linear is 430% faster
    500   | hash is 350% faster
   1000   | hash is 690% faster

I performed a set of timings with a rough example implementation on my P2/400, and received the above results. The hash lookup speeds remained constant, while the linear lookup dropped off rapidly after 250 items in the cache.

Returning to the big picture, we need to take our message-id cache size into account when determining the method to use. For this kind of cache, there's usually not a reason to keep a huge amount of entries around, only two or three days worth should catch most duplicate messages. (And probably even one day's worth in most cases.)

For this problem, an O(1) solution is possible, but the constant time it takes might be higher than an O(n) solution. Both cases are trivial to implement, but in some situations it will be difficult to implement O(1) - and not worth it either.

Conclusion

Perl is a powerful language, but it doesn't prevent the programmer from making poor decisions and shooting him/herself in the foot. One mistake is to choose the wrong data structure or algorithm for the job. By using a little bit of complexity theory, it's possible to optimize your code and get huge speed gains.

I've only touched the surface of big-O notation and complexity theory. It's a highly researched topic that delves into the deep dark corners of math and computers. But, hopefully there's enough here to give you a basic tool set.

As the second example showed, you don't necessarily want to spend too much time thinking about optimization. Over-optimization isn't always worth it.

For More Information

Google
 http://www.google.com/search?q=big-O+complexity
Introduction to the Theory of Computation
Part Three of Sipser, Michael. "Introduction to the Theory of Computation". PWS Publishing Company. Boston, MA. 1997.

Algorithms and Complexity, Internet Edition
http://www.cis.upenn.edu/~wilf/AlgComp3.html

Special Thanks

Walt Mankowski