Optimizing Your Perl
by Robert Spier
|
Pages: 1, 2
The Importance of Good Interfaces
One of the things that made this fiddling possible was a good design for the Graph module's interface. The interface hid all the details of the internals from me, so that different implementations could be plugged in. (This is actually how I did the testing, I had a regular Graph, a Graph::Cached, and a Graph::Fast. Once I decided that I liked Graph::Fast the best, I renamed it to just Graph.)
I'm glad I designed Graph the way I did. I put simplicity and design first. There is a phrase you might have heard: "Premature optimization is the root of all evil." If I had optimized the Graph module first, then I might have ended up with more complicated code that didn't work in all cases. My initial cut of Graph was slow, but worked consistently. Optimization happened later, after I had a baseline I could return to, to guarantee proper operation.
Example #2: Duplicate Message Filter
You don't always need to try and optimize code to the fullest extent. Optimizing code takes your time and energy, and it isn't always worth it. If you spend two days to shave two seconds off a program, then it's probably not worth it. (Unless of course, that program is running hundreds of times each day.) If a program is "fast enough", then making it faster isn't worth it.
Our second example is a duplicate message filter for e-mail. This filter is useful when you receive cc's of messages also sent to mailing lists you are on, but you don't want to see both.
The general idea behind this filter is simple.
skip $message if seen $message->id;
It's only the seen function that interests us. It searches some form of cache of id's. How we implement it can have a large effect on the speed of our mailfilter. (For more information on filtering mail with perl, take a look at Mail::Audit.)
The two most obvious methods that come to mind for implementing seen are a simple linear search (O(n)) or the use of a persistent database/hash table to do the lookup. O(1). I'll ignore the cost of maintaining the database, eliminating old Message-Id's, etc.
A linear approach writes out the message id's with some sort of separator. Some programs use null bytes, others use newlines.
The database/hash method stores the message id's in a binary database like a DBM or Berkeley DB file. At the expense of using more disk space, lookups are generally faster.
But - there's one more factor to take into account - overhead. The linear search has little overhead, while the DB file has a large overhead. (Overhead here means time to open the file and initialize the appropriate data structures.)
TABLE: Comparison
Records | Comment
-----------------
100 | linear is 415% faster
250 | linear is 430% faster
500 | hash is 350% faster
1000 | hash is 690% faster
I performed a set of timings with a rough example implementation on my P2/400, and received the above results. The hash lookup speeds remained constant, while the linear lookup dropped off rapidly after 250 items in the cache.
Returning to the big picture, we need to take our message-id cache size into account when determining the method to use. For this kind of cache, there's usually not a reason to keep a huge amount of entries around, only two or three days worth should catch most duplicate messages. (And probably even one day's worth in most cases.)
For this problem, an O(1) solution is possible, but the constant time
it takes might be higher than an O(n) solution. Both cases are
trivial to implement, but in some situations it will be difficult to
implement O(1) - and not worth it either.
Conclusion
Perl is a powerful language, but it doesn't prevent the programmer from making poor decisions and shooting him/herself in the foot. One mistake is to choose the wrong data structure or algorithm for the job. By using a little bit of complexity theory, it's possible to optimize your code and get huge speed gains.
I've only touched the surface of big-O notation and complexity theory. It's a highly researched topic that delves into the deep dark corners of math and computers. But, hopefully there's enough here to give you a basic tool set.
As the second example showed, you don't necessarily want to spend too much time thinking about optimization. Over-optimization isn't always worth it.
For More Information
- Google
-
http://www.google.com/search?q=big-O+complexity
- Introduction to the Theory of Computation
- Part Three of Sipser, Michael. "Introduction to the Theory of Computation". PWS Publishing Company. Boston, MA. 1997.
- Algorithms and Complexity, Internet Edition
- http://www.cis.upenn.edu/~wilf/AlgComp3.html
Special Thanks
Walt Mankowski

