May 2000 Archives

This Week on p5p 2000/05/28



Notes

You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to mjd-perl-thisweek-YYYYMM@plover.com where YYYYMM is the current year and month.

This week's report is a little early, because tomorrow I have to leave to go on the Perl Whirl. Next week's report will be late, for the same reason.

Quite a lot of discussion this month, much of it rather pointless. Not one of our better weeks, I'm afraid.

Regex Engine Enhancements

Ben Tilly, Ilya Zakharevich, and François Désarménien had a discussion about an alternative implementation of the regex engine. I'm going to try to summarize the necessary background as briefly as possible.

The regex engine is a state machine. The engine looks through the characters in the target string one at a time and makes a state transition on each one; at the end of the string it looks to see if it is in an `accepting state' and if so, the pattern matches. What regex you get depends on how the states are arranged and what are the transitions between them.

The basic problem that all regex engines face is that a certain state might have two different transitions on the same condition. For example, suppose you are matching /foo(b*)bar/ and you have seen foo already. You are now in a state that expects to see an upcoming b. When you see the b, however, you get two choices: You can go back to the same state and look for another b, or you can go on to the next state and look for an a. If the string is foobbbar then the first choice is correct; if the string is foobar then the second choice is correct.

There are basically two ways to deal with this. One way is to use a representation of the state machine that keeps track of all the states that the machine could be in at once. In the example above, it would note that the machine might be in either of the two possible states. Future transitions might lead to more uncertainty about the state that the machine was actually in. At the end of the string, the engine just checks to see if any of the possible result states are acceptingstates, and if so, it reports that there is a match. This is called the 'DFA' approach.

The other approach is to take one of the two choices arbitrarily and remember that if the match does not work out, you can backtrack and try the other alternative instead. This is called the 'NFA' approach, and it is what Perl does. It choses to go back and try the current state first, before it tries going on to the next state; that is what makes Perl's * operator greedy. For the non-greedy *? operator it just chooses the other alternative first.

Both approaches have upsides and downsides. Downsides of the NFA approach: It is generally slower at run-time. It is more difficult to handle non-greedy matching and backreferences. Downsides of the DFA approach: It is prone to take a very long time for certain regexes, because of its habit of trying many equivalent alternatives one at a time. Also, it is very hard to specify that you want the longest possible match---consider writing a Perl regex that matches whichever of /elsif|\w+/ is longer. This is easy to implement if you use an NFA.

Ben's idea is that you would start with an NFA and then apply a well-known transformation to turn it into a DFA. The well-known transformation is that each state in the DFA corresponds to a set of states that the original NFA could have been in; Ben's idea is that you can retain the information about the order in which the states would have been visited, and use those ordered sets of NFA states as single DFA states. The problem with this sort of construction is that under some circumstances the number of states explodes exponentially---if the original NFA had n states then the resulting DFA could have up to 2n states. With Ben's idea this is evenworse, because the resulting DFA might have up to (n+1)! states. But as Ben points out, in practice it is usually well-behaved. You would be trading off an unusual run-time bad behavior for a (probably) more unusual compile-time bad behavior. On the other hand, if Ben's scheme went bad at compile time, it would go really bad and eat all your memory. Also, it is really unclear how to handle backreferences with his scheme. Ben waved his hands over this and Ilya did not have any ideas about how to do it.

There was some discussion of various other regex matters. Ben had formerly suggested a (?\/) operator that matches the empty string but which inhibits backtracking; if the engine tries to backtrack past it, it fails. SNOBOL had something like this, called FENCE. Ilya said he did not want to do this because he thought all the useful uses were already covered by (?>...) and because he did not want to implement a feature that was only explainable in terms of backtracking. He also said you could get this behavior by doing

        (?:|(?{die}))

and putting the regex match into an eval block. (Did you get that? It first tries to match the empty string, to the left of the |, and if that doesn't work, it tries to match (?{die}), which throws an exception.)

Ben suggested that the regex engine could have a hook in it that would call an alternative regex engine, handing it the string to be matched and the current position, and the subengine would return a value saying how far it had matched up to; this would facilitate trying out alternative implementations or new features.

Ilya spent some time discussion new features he thought might be useful. One such message.

When Ilya mentioned SNOBOL is went into SNOBOL-berserk mode and posted a twelve-page excerpt from the SNOBOL book about SNOBOL pattern-matching optimizations, which was not particularly relevant to the rest of the discussion.

Root of this thread

Perl in the News

The News

Dick Hardt was quoted out of context. This led to a really boring and offtopic advocacy discussion. Several people were asked to take the discussion to the advocacy mailing list. This meant that they cc'ed the discussion to both lists. Brilliant.

Folks, I love the advocacy list because then people who want to have interminable discussions about why Perl is considered slow and bloated have somewhere to do it where I don't have to hear it. Maybe someday I will commit a heinous crime and be sentenced to produce a weekly summary of discussion on the Perl Advocacy mailing list, and I will be unable to plea-bargain my way to a lesser offense that carries the death penalty. Until then, please do me a favor and keep advocacy discussion out of p5p.

Doctor, Doctor, it Hurts When I Do This!

Garrett Goebel:
my $val = shift; 
substr($bitstring,2,4) = pack('p',$val);

Ilya Zakharevich:

Do not.
Hope this helps,
Ilya

eq and UTF8 operator

Clark Cooper asked why a UTF8 string will not compare eq to a non-UTF8 string whose bytes are identical.

Ilya replied that the short answer is that they should not compare eq because they represent different sequences of characters.

He then elaborated and said that the internal representation does not matter; that a string is a sequence of characters, and a character is just an integer in some range. In old Perls the range was 0..255; now the range is 0..(2**64-1), and the details about how these integers are actually represented is not part of the application's business.

Caching of the get*by* Functions

Last week Steven Parkes complained that when he used the LWP::UserAgent module, each new agent caused a call to one of the getprotoby* functions, which opened the /etc/protocols file and searched it. Many agents, many searches. He pointed out that there is no way to get LWP::protocol::http or IO::Socket::inet::_sock_info, the culprit functions, to cache the protocol information.

Ben Tilly suggested adding a caching layer to those functions, or having the standard modules use a cached version. Tom pointed out that the uncached call only takes 1/6000 second on his machine, so it is unlikely to be a real problem in practice, and that the caching is hard to get right in general. Russ pointed out that it is a very bad idea to have application-level caching of gethostby* and getaddrby* calls, because such caching ignores the TTL in the DNS record. Other points raised: Caching of DNS information is already done in the named, and correctly. Any caching of DNS information done at the gethostby* level is guaranteed to be wrong.

Forbidden Declarations Continued

Last week I complained about the error

        In string, @X::Y must now be written as \@X::Y...

In the ensuing discussion, I suggested that this error message be removed. We have been threatening since 1993 that

Someday it will simply assume that an unbackslashed@ interpolates an array.

Sarathy said that he thought this should have appened two years ago, so I provided a patch. Interpolating an undeclared array into a string is no longer a fatal error; instead, it will raise the warning

        Array @X::Y will be interpolated in string

if you have warnings enabled.

The patch.

This article has a detailed explanation of the history of this error message.

Previous Discussion

readonly Pragma Continues

Previous Discussion

Mark Summerfield does not like any of the alternatives that were proposed. There are several tie-based solutions which are too slow. He does not like William Setzer's Const module because at happens at run time so you have to declare the readonly variable separately from the place you make it constant. (I wonder if

        const my $x = 12;

would work?)

He also complained that even if you mark a scalar as readonly, someone else can go into the glob and replace the entire scalar, with

        *pi = \3;

A couple of people replied that if someone wants to change your read-only value so badly that they are willing to hack the symbol table to do it, then they should be allowed to do so.

Tom posted what I think was a quote from Larry saying that

        my $PI : constant = 4 * atan2(1,1);

was coming up eventually.

Magical Autoincrement

Vadim Konovalov complained that

        $a = 'a';
        $a==5;
        $a++;

Increments $a to 1 instead of to b. This is as documented:

The auto-increment operator has a little extra builtin magic to it. If you increment a variable that is numeric, or that has ever been used in a numeric context, you get a normal increment.

use strict 'formatting'

Ben Tilly suggested a use strict 'formatting' declaration that would tell Perl to issue a diagnostic whenever the indentation and the braces were inconsistent. However, he did not provide a patch.

Complex Expressions in Formats

H. Merijn Brand points out that complex variable references, such as $e[1]{101} seem to be illegal in formats. However, he did not provide a patch.

pack("U")

Meng Wong pointed out that pack("U"), which is documented as

        [pack] A Unicode character number.  Encodes to UTF-8
        internally.  Works even if C<use utf8> is not in effect.

does not produce a UTF8 string. Simon provided a patch for that. Then a discussion between Ilya, Sarathy, and Gisle Aas ensued about whether this was the right thing to do. Gidle said it was not, and asked what pack("UI") should produce. Sarathy said that "U" packs a character in the UTF-8 encoding, and UTF-8 is encoded with bytes; therefore the result of the pack should be bytes, not another UTF-8 string. Ilya asked then what should happen if someone did

        $a = pack "A20 I", $string, $number;

where $string contained characters with values larger than 255. Sarathy said that "A" is defined to be ASCII, so it should die. Then Ilya pointed out that if it did that there would be no way to take a UTF8 string and insert it into a 20-byte fixed-width field. Discussion went on, and I was not able to discern a clear conclusion.

Various

A large collection of bug reports, bug fixes, non-bug reports, questions, answers, and a small amount and spam. No serious flamage however.

Until next week I remain, your humble and obedient servant,


Mark-Jason Dominus

This Week on p5p 2000/05/21



Notes

You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to mjd-perl-thisweek-YYYYMM@plover.com where YYYYMM is the current year and month.

my $x if 0; Trick

This week's big big thread (81 messages) had the unfortunate title 'Is perlbug broken', because Ben Tilly's original message to perlbug bounced and he resent it with that new subject. The subject should have been 'Improperly persistent lexicals'.

The original question concerned the following example:

        use strict;
        &persistent_lex() foreach 1..5;

        sub persistent_lex {
            my @array = () if 0;
            push @array, "x";
            print @array, "\n";
        }

Here the array behaves like a static variable, and is not created afresh each time persistent_lex is called; it accumulates more and more xes each time. A number of people said that this was a feature, and that its behavior is obvious if you understand the implementation. Several explanations were advanced, but most of them were wrong. I think Nat Torkington got it right; earlier respondents didn't.

The obvious explanation is: my creates a lexical variable, and then at run time, initializes it each time the block is entered; the if 0 suppresses the runtime initialization so that the variable retains its value. To see why this is not a sufficient explanation, consider the following example, which creates two closures:

        use strict;
        my $a =  make_persistent_lex();
        $a->()  foreach 1..5;
        my $b =  make_persistent_lex();
        $b->()  foreach 1..5;

        sub make_persistent_lex {
            return sub {
              my @array if 0;
              push @array, "x";
              print @array, "\n";
            };
        }

Here the variable @arrayis shared between the two closures. If you omit the if 0; then $a and $b get separate variables.

Nat Torkington did provide an explanation which I think is correct, with Jan Dubois filling in some of the technical details. Jan's description of the guts.

Some people suggested documenting this 'feature'. Tom said that Larry had told him not to do that because he didn't want to reply on supporting it. Larry affirmed this:

Jan Dubois: I think this behavior is a side effect of the implementation and shouldn't be documented as a feature. It should at best be "undefined" behavior that may change in the future (IMHO).

Larry: In Ada parlance, use of this feature would be considered erroneous.

Simon Cozens said to me later that if p5p can't understand the 'feature', it should not be documented as a feature. I think there's some truth in that. Barrie Slaymaker suggested that Perl issue a warning in this case.

The thread also contained a side discussion about whether or not 5.6.0 is stable and reliable.

Zero-padded numbers in formats

John Peacock provided a patch that allows you to specify that numbers displayed in formats can be zero-filled. (Remember formats? You print a formatted report with the write function.) Read about it.

Tom asked why not use sprintf. John replied that the patch to the format code is simple and that the sprintf solution requires a lot more Perl code than just using a zero-filled format picture.

Matt Sergeant asked why formats hadn't been spun off into an external module. Damian Conway apparently has a paper on that. Dan Sugalski replied that they probably need some access to the lexer. I suspect that the real reason is that nobody really cared very much. Dan did add that formats need an overhaul:

Dan Sugalski: Formats could certainly use an overhaul--a footer capability'd be nice, as would the ability to use lexicals. Between that and a rewrite of BigInt and BigFloat for speed would allow you to grab an awful lot of the Cobol Cattle Crowd...

Forbidden Declarations

I pointed out that if you get the error

        In string, @X::Y must now be written as \@X::Y...

you cannot fix it by declaring the array with our as the manual suggests, because there is an arbitrary restriction on our. I supplied a patch to remove this, but I expect it will not go in, because the arbitrary restriction was put there on purpose several months ago.

Tom replied that

        { package X; our @Y; }

will work.

Port to Windows/CE

Jarkko Hietaniemi forwarded an article that Ned Konz had posted on comp.lang.perl.moderated, saying that he was working on a port of Perl to Windows/CE, which I gather is a version of Windows that runs on a palmtop. Ned mentioned a number of potential problems he foresaw. Ned's Article

Simon Cozens referred him to 'microperl', which he said had been created with the intention of porting Perl to small operating systems.

Simon: It does remove a lot of the Unixisms, plus it takes the size of the core down to the bare essentials. I'd start there.

Nat Torkington mentioned in passing that a Windows/CE port would probably go some way toward making a Palm Pilot port possible, and Simon protested that he would probably have one ready `in a couple of weeks'. He says that the hardest problem was getting the lexer to fit into the Palm's small code segments.

chat2.pl is Still There

Peter Scott mentioned chat2.pl in connection with something in the FAQ, and Randal Schwartz, the original author, replied that it should probably never have been included in the first place; he wrote it for a particular purpose back in the mists of time, posted it to comp.lang.perl a couple of times in response to people who seemed to want to do something similar, and Larry included it into the Perl 4 distribution without asking him if it was OK. Randal says that if he had been planning to release it, he would have written it differently, and that that's why it's never been really complete. He also said he was glad it had been dropped from the distribution. Then he discovered that it hadn't been dropped, and seemed rather shocked.

Summary: Don't use chat2.pl.

pod2latex

Tim Jenness suggested replacing the old pod2latex with his new version based on POD::LaTeX. There was no discussion.

Read about it.

Long Regexes

Michael Shields demonstrated a sample email message that made the Mail::Header module take an unusually long time to parse the header. He then supplied a patch, suggested by Ronald Kimball, that made the relevant regex finish more quickly. (Ronald's original suggestion wasn't quite correct; he supplied a second alternative later.)

Mike Guy pointed out that even without the patch the regex completed quickly under 5.6.0, because the regex optimizations are better. However, Ilya says that this particular optimization is buggy: It only works under certain circumstances, and he did not implement a check for those circumstances. But he could not make a test case where the optimization fails. He asked for help doing this.

Ilya's request for a test case.

Ben Tilly and Ilya had long and extremely interesting discussion of next-character-peeking optimiztions that is required reading for anyone who is interested in the guts of the regex engine or who might someday want to be the Regex Engine Pumpking.

The root of this discussion.

There was a sidetrack: John Macdonald suggested that there could be a standard module that would provide efficient, correct regexes for common cases such as matching a string that contains balanced parentheses. Damian Conway pointed out that his Text::Balanced module does precisely this.

UTF8 Hash Keys

M.J.T. Guy asked what the intentions were regarding UTF8 hash keys. Nick Ing-Simmons replied that he recalled that they would be enabled per-hash, and that each hash would either have all UTF8 keys or no UTF8 keys. Andreas pointed out that there had been an extensive discussion of this in the past.

Extensive Discussion

UTF8 String Patches

Simon discovered that single-quoted strings were not having their UTF8-ness handled properly, and provided a patch. Short summary: The SV has a UTF8 flag on it, and if you extract the C string part of the UTF8 SV and make a new SV with the same string, you also have to set the UTF8 flag on the new SV or Perl won't treat it the same. Simon also provided a patch to perlguts that discusses this.

mktables.pl has Been Addressed

James Bence stepped up to answer Larry's request that mktables.pl be refurbished.

Earlier Summary

More Environmental Problems

Jonathan Leffler forwarded a problem that had come up on the DBI mailing list relating to the way Perl uses the environment; if some other library also tries to modify the environment, this messes up Perl and can cause core dumps. Jonathan forwarded an archive of some discussion of this same problem in 1997, and I remember it also came up last November.

Jonathan's message.

Summary of November discussion.

Several solutions of various types were proposed.

readonly Pragma

Mark Summerfield wants to make a pragma called readonly that declares a read-only scalar variable. This is different from what use constant does because the 'constants' generated by use constant have a special syntax that doesn't always work. For example:

        use constant PI => 3;
        $hash{PI} = 'pi';

The key here is PI, not 3. Several alternative suggestions were advanced; the nicer ones seemed to be William Setzer's Const module which is a tiny XS that just sets the READONLY flag in the SV. Unfortunately, it doesn't seem to be on CPAN. Tom did post the entire code, which is only a couple of paragraphs long.

Const.pm

Tutorials

Ken Rietz offered to write more tutorials or to coordinate the writing of tutorials, and asked for suggestions. See the message if you want to make a suggestion.

Brad Appleton replies to my remarks about Pod::Parser

Two weeks ago I posted a lot of stuff about Pod and the Pod-translating software, and Brad Appleton, the author of Pod::Parser, felt that some of it was unfair, and also pointed out a number of real errors that I made. Brad has been kind enough to provide an article correcting my mistakes.

Brad's reply.

Various

        p5p: B+++(--) F+ Q+++ A+++ F+ S+ J++ L+

Until next week I remain, your humble and obedient servant,


Mark-Jason Dominus

Pod::Parser Notes



Some of my co-workers noticed the p5p weekly summary discussing (among other things) Pod::Parser. They mentioned it to me, and said they thought it cast me in an unfavorable light. So I'd like to clear up a few things that may have been missing or misunderstood from the summary....

I freely admit Pod::Parser has had very little performance optimization optimization attempted. (I've mentioned this before on p5p and pod-people and have asked for help). I certainly agree with many of the issues of POD format that Mark raised. But I think it is very important to note that most of the slowness of Pod::Parser has less to do with POD format itself, and more to do with the cost of creating an flexible & extensible O-O framework to meet the needs of pod-parsing in general (not just tasks specific to translating POD to another output format).

Most of the overhead in Pod::Parser is from parsing line-by-line (to track line numbers and give better diagnostics) and from providing numerous places in the processing pipeline for callbacks and "hook" methods. Since Pod::Parser uses O-O methods to provide lots of pre/post processing hooks at line and file and command/sequence level granularity, the overhead from method-lookup resolution is quite high. (In fact I'd lay odds that at least a 10X performance speedup could be had optimizing away the method lookups at run time to precompute them once at the beginning.)

Regarding the "benchmark" of 185X as a prospective performance target for podselect (which uses Pod::Select which uses Pod::Parser :-), please realize that podselect's purpose in life is to do a whole lot more than what Tom's lean-and-mean POD dumper script does. It so happens that podselect will do this same task if given no arguments. But its real reason for existence is to select specific sections of the PODs to be spit out based on matching criteria specified by the user. This is what Pod::Usage employs in order to format only the usage-msg-related sections of a POD.

When I mentioned podselect in this thread on p5p, I was just pointing out that existing code - which is already designed to have hooks for reuse, can fulfill the same functional task - I didn't intend to claim it was comparable in performance. I don't think that the "185X" figure is reasonable to achieve for Pod::Parser. Not only is there imore parsing that Pod::Parser has to do, but most of that overhead comes from enabling better diagnostics and extensibility for purposes above and beyond what Tom's script implements for one very specific and limited purpose.

185X may be a fair benchmark for something whose purpose is limited in scope to that particular thing, but I think not so for something with much broader scope and applicability and use like Pod::Parser. Not that speed improvement still isn't needed - but I think a 50X improvement would a more reasonable benchmark, and IMHO within the realm of what's reasonably possible.

In another place - I think the summary may have missed a reply I made to p5p about the structure of Pod::Parser "output". Its not just spitting out tags or a linear stream, and it will spit-out parse-trees if you ask it to.

I agree that there is a need for a module to impose more structure, but the notion that Pod::Parser must somehow be the module that does this is a misconception IMHO. Pod::Parser was deliberately created to be a flexible framework to build on top of and there is nothing to stop someone from creating a module" on top of Pod::Parser to do all the nicer stuff.

But much of that "nicer" stuff will break a lot of existing code if its added into Pod::Parser - because Pod::Parser is used for more than just translation to another output format. I've recommended many times on p5p and elsewhere that someone create a Pod::Compiler or Pod::Translator to impose this added structure (and the existing Pod::Checker module might be a good start).

Also, the summary suggests that Pod::Parser and Russ' POD modules have been "under development for years." I think maybe Mark meant to write that various POD-related parsing and translating/formatting modules have been under development for years. In particular, I believe Russ only just started on his "pod-lators" modules in the past year.

Pod::Parser development started years earlier, but it's "gestation period" was only about 6 months before a useful and working version was available. Since then, I've done bugfixes and enhancements over the last 2-3 years. The main addition of significant functionality was adding the capability for parse-trees, (and the development of a test-suite for pod "stuff", which was too long in coming :-). It didn't become part of the core perl distribution until v5.6 because it was necessary to wait until some kind folks (like Russ') took the time to re-write the most common pod2xxx modules to use the same base parsing code provided in Pod::Parser.

Now - I'm not claiming Pod::Parser is perfect - but I felt the summary left out some important points that add more balance to the discussion. Could Pod::Parser be faster? You betcha! Could it be lots faster? Sure. Is it unusable for its most common purpose? Not at all IMHO. Is it unusable for processing large numbers of PODs? Quite likely. But as I said, thats not because of POD, thats because of the need for designed in flexibility.

At least now there is a common base of POD parsing code to focus our collective optimizing efforts upon instead of lots of parsing engines from disparate pod2xxx modules. Now that it's in the core, maybe it will encourage more people to focus on optimizing the common base parser for POD-stuff (which I've been wanting help with for years :-)

-- 
Brad Appleton <bradapp@enteract.com>  http://www.bradapp.net/
  "And miles to go before I sleep." -- Robert Frost

Perl Meets COBOL


A few weeks ago I went to do a four-day beginning Perltraining at a local utility company. It was quite different from other training classes I'd given. Typically, my students have some Unix and C experience. These students, however, had never seen Unix before -- they were mostly COBOL programmers, working in the MVS operating system for IBM mainframe computers.

Fortunately, I have had some experience programming IBM mainframes, so I wasn't completely unprepared for this. When we contract classes, we always make sure that the clients understand that the classes are for experienced programmers. In this case that backfired a little bit -- we were asking the wrong question. Almost all my students were experienced programmers, but not in the way I expected. After several years of teaching Perl classes, I have some idea of what to expect, and the COBOL folks turned all my expectations upside down.

For example, when I teach a programming class to experienced programmers, I take for granted that everyone understands the notion of block structure, in C or Pascal perhaps. People familiar with this idea look at the code and see the blocks automatically. But the COBOL folks had not seen this before and had to learn it from scratch. Several times I was showing an example program, and one of the students would ask what part of the code was controlled by a while clause.

This tended to put me into I-am-talking-to-novices mode, and my automatic reaction was to slow down and explain every little thing, such as what a variable is. But this response was inappropriate and incorrect, and I had to consciously suppress it. The COBOL programmers were not novices; they were, as promised, experienced programmers, and it would have been patronizing (and a waste of time) to explain to them what a variable was; of course they already know about variables. In fact, they often surprised me with the extent of their knowledge. To explain the $| variable, I started to talk about I/O buffering, and suddenly I realized that everyone was bored. "Do you already know about this?'' I asked. Yes, everyone was already familiar with buffering. That was a first for me; I've never had a class that knew about buffering already.

I didn't have to explain filehandles; they already knew about filehandles. But they used jargon to talk about them that wasn't the jargon I was familiar with. "Oh, you're establishing addressibility on the file,'' someone said. They seemed pleased at how easy it was in Perl to establish addressibility on a file.

That reminded me of a story about the mainframe people seeing Unix for the first time back in the 1970s. They asked how you allocate a file on Unix. Dennis Ritchie explained that you didn't have to allocate a file, you just create it, but they didn't get it. Finally he showed them:

        cat > file

and the mainframe people were astounded. "You mean that's all?'' This was a little like that sometimes. They were impressed by things I had always taken for granted, and underwhelmed by things that often impress C programmers.

Some things they picked up on much better than other programmers I have taught. As soon as I explained the pattern /cat$/, someone pointed out that that if you read in a record from a file, the record shouldn't match the pattern even if it does appear to end with cat, because the record will have a newline character on the end, and thus will end with cat\n, not cat. I had to explain what $ really does: It matches at the end of the string, or just before the newline if the string ends with a newline. I had never had to explain that before, because I had never met anyone before who had picked up on that so fast. Usually Perl programmers remain blissfully unaware of this problem for years until someone points it out to them. At Perl conferences, when I explain what $ really does, about half the audience is thunderstruck.

I'm not sure why it was that the COBOL folks realized so quickly that there was something fishy about $. But I have an idea: The IBM mainframe file model is totally different from the Unix or NT model. Instead of being a stream of bytes, an IBM file is a sequence of fixed-size records, and there is good OS support for reading a single record. Any program can instantly and efficiently retrieve record #3497 from a file; doing this on a Unix system requires toil. On a mainframe, records aren't terminated with \n or anything else; they're just records, and when you ask the OS for a record, you get it, with no \n, or additional cruft. So to the COBOL programmers the idea of variable-length, \n-terminated records was new and strange, and they must have been constantly aware of that ever-present, irritating \n, the way you're aware of a stone in your boot when you're hiking. They couldn't forget about it the way seasoned Unix programmers do, so they saw how it sticks out of the explanation of $ in a way that a Unix programmer doesn't notice.

After I got back I talked to Clinton Pierce, who has a lot more experience training COBOL programmers than I do. Clinton says that his students also find the file-as-stream notion strange and new:

Reading lines of text (records) and taking time to parse them into fields seems redundant. This usually starts a long discussion about why Perl programs bother to construct formatted text files -- only to have to parse them apart again, instead of simply letting the record-reading library routines parse things for you.

This is an interesting contrast to the Unix point of view, which is that the OS needs to support record-oriented I/O about as much as it needs to support trigonometric functions -- that is, not at all.

The different file models led to some other surprises. One of the exercises in the class asks the students to make a copy of a file in which all the lines that begin with # have an extra # inserted on the front. The sample solution looks like this:

        while (<>) {
          print '#' if substr($_, 0, 1) eq '#';
          print;
        }

The COBOL programmers found this bizarre. One of them asked if a beginner could reasonably be expected to come up with something like that. At first I wasn't sure, and then I thought back and remembered that in the past, many students had come up with that solution. In fact, I had taught some classes in which every student decided to do it that way. Then I thought some more and realized that those were classes full of C programmers, and that it was a very natural way to solve the problem -- if you were already a C programmer.

Clinton points out that on the mainframe, there's an extra step between linking an application and running it: You have to explicitly load it into memory. The edit-compile-run cycle on System/370 is a lot longer than in the Unix world, partly because Unix has such a lightweight process creation model. Clinton says that the one thing mainframe programmers find strangest is the way Perl collapses the edit-compile-link-load-run-debug cycle down to only two steps.

Clinton also says that his students tend to have trouble with inter-process communication, such as as backticks or system(). I don't have enough experience to understand the details here, but I can verify that it was hard to get the point of this across to people who didn't have prior experience in the Unix "tools" environment. Opening a pipe to a separate program doesn't make a lot of sense unless you live in a world like Unix where you have a lot of useful little programs lying around that you can open pipes to. To explain piped open, I had to concoct an example about a data gathering program that writes its output directly to the report generator program without using an intermediate temporary file. The students appreciated the benefit of avoiding the temporary file, but the example really misses the point, because it omits the entire Unix "tools" philosophy. But I was taken by surprise and wasn't prepared to evangelize Unix to folks who had never seen a pipe before. However, Clinton does report that his COBOL students take naturally to the idea of modules and of loading some other program's functionality into your own as a library routine.

It was an eye-opening experience, and it uncovered a lot of assumptions I didn't realize I had. I hope I get to do it again sometime.

This Week on p5p 2000/05/14



Notes

You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to mjd-perl-thisweek-YYYYMM@plover.com where YYYYMM is the current year and month.

It was tempting to just post `nothing happened this week'. Lots of small patches to fix various 5.6.0 misbehaviors, and reports of new 5.6.0 misbehaviors. An unusual number of people who should have known better posting requests for dumb trivial features. Low traffic. Few big discussions.

Regex Stress Testing

Jarkko Hietaniemi wrote a module called Regex::PreSuf which accepts a list of words and builds a regex that recognizes only those words. Mike Giroux suggested that the large regexes generated by this module could be used to stress-test the regex engine. (Avi Finkel's String::REPartition module might be used similarly.)

Ilya said it would be better to add the trie code from Regex::PreSuf to the regex engine. Jarkko replied that he would prefer to eat live rattlesnakes.

Another Thread-Safing Patch

Dan Sugalski sent a replacement for his patch that makes the lock() function thread-safe. It also exposes lock() functionality to XS subroutines, and some other things.

See the patch.

Also in thread news: A user posted a message asking why threads weren't fully supported, and Dan replied at some length. (Summary: Because it's hard.)

Dan's reply.

Marek Rouchal: Bottom line: I'd be very happyabout fully working threads in Perl 5.6.1

Dan Sugalski:You and me both. :) I wouldn't hold my breath.

Enormous perldoc discussion winds up

As quickly as it arrived, last week's discussion of docmuentation issues has ended. The only discussion of note this week was a message from Mark Fisher with a reference to a 1971 paper that compared automatic indexing with manual indexing. Would-be indexers should probably take a look at this.

Mark's message.

Gerald Salton's paper.

Build patches for OS/2

Rocco Caputo made some changes to fix the build process on OS/2.

Read about it.

Regex Engine

In the course of trying to investigate a bug in the regex engine, Hugo van der Sanden critiqued the code style and comments. This led to a brief but interesting discussion about the code there.

Original bug report

Hugo's critique message

Method calls on unblessed references

John Tobey submitted a patch to enable

        $r->method(arg, ...)

when $r is unblessed. If I understand correctly, the method is looked up in a package named HASH or ARRAY or whatever.

Randal said that this had come up some years ago, and the consensus was not to do it, since it would make erroneous method calls harder to catch.

I was not able to find this discussion. John found a 1996 message from Tim Bunce that referred to an even earlier discussion of this same idea. (Tim submitted a patch similar to John's.) If anyone can dig up a pointer to the original discussion, please let me know.

Tim's really old message.

Version Tuples Broken?

Ian Phillipps pointed out two problems with version tuples. But Ilya said that no, the problem was not with the tuples, but with the results produced by the unpack and print functions when extracting the result. Mike Guy then pointed out that

        (256.255.254 . 257.258.259) eq (256.255.254.257.258.259)

is false.

Sarathy said that this is because eq is broken; it'll be fixed in 5.6.1.

Negative Subscripts for Tied Arrays

Michael Schwern complained that if you have a tied array, and you do

        $array[-1]

Perl does not call $o->FETCH(-1), but rather instead it calls $o->FETCHSIZE() to find out how long the array is, say 80 elements, and then invokes $o->FETCH(79) to get the last element. Nick Ing-Simmons, the author of the tied array implementation, said that that was how it was supposed to work, and changing it will break existing code.

Upcoming corrections

Brad Appleton, the author of the Pod::Parser module suite, felt that some of my comments from the previous report were unfair. In particular, he says that the current Pod::Parser implementation is incomplete, was designed for flexibility rather than speed, and he is sure that large speed gains could be easily had. Brad is preparing a long and informative reply, which I expect to include in next week's report.

Various

A large collection of bug reports, bug fixes, non-bug reports, questions, answers, and a small amaount of flamage and spam.

Until next week I remain, your humble and obedient servant,


Mark-Jason Dominus

This Week on p5p 2000/05/07

Notes

You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to mjd-perl-thisweek-YYYYMM@plover.com where YYYYMM is the current year and month.

Moderation is Imminent

As I mentioned a few weeks ago, Sarathy suggested a light-handed moderation scheme back in March, in the wake of a number of gigantic flame wars that resulted in the departure of several important people from the list.

The discussion of Damian Conway's overloading module (see below) sparked a return to this topic.

This is very important, and you be sure to read Sarathy's actual proposal for the details. The proposal.

Earlier summary

Sarathy said that last time he had mentioned it, he had received suggestions that the following people be referees:

  • Nathan Torkington
  • Kurt Starsinic
  • Chip Salzenberg
  • Mark-Jason Dominus
  • Simon Cozens
  • Damian Conway
  • Russ Allbery

He then asked thse people if they wanted to be referees. As far as I know, only Simon and I have accepted so far.

There was some discussion of the technical mechanics of the refereeing, and it looks like it is going to happen.

Simon's Guide to p5p

In the course of this, Simon posted a document he had written about how to use p5p.

Simon's guide to p5p.

Big Discussion of perldoc and Indexing

Forty-eight percent of this week's 350 messages were related to perldoc in one way or another.

The discussion started when Johan Vromans suggested that perldoc be extended to do something reasonable with perldoc -f foreach, analogous to the way perldoc -f print is presently treated. Sarathy agreed with this; Tom Christiansen objected strenuously. I think that the point of Tom's objection was that it is very easy to just grep the entire manual for the term you want to find, that even on non-unix crippleware platforms, grep can be implemented as a one-line Perl program, and that people whould be encouraged to understand their own power to search the manuals, rathern than being encouraged to depend on a canned solution like perldoc. I think there's something to be said for this, but Tom did not seem to find much agreement among the other p5pers.

Root of the thread

Several interesting topics developed from this. First, Tom announced that he had been writing a new manual page, perlrtfm, which would explain how the manual was organized and how to use it effectively.

Draft version of perlrtfm.

Ilya said that the mapping from keywords to manual sections should not be hardwired into perldoc, but rather should be in an index file somewhere. Nick Ing-Simmons agreed, and I've believed this for a long time. grep is very nice, and works well much of the time, but as almost any user of a web search engine can tell you, sometimes the document you're looking for doesn't happen to contain the keyword it should. I went looking for an example of this and found one right away: If you want to find the remainder after division, and grep for remainder, you do not locate the section in perlop about the modulus operator. A well-constructed index would fix this.

Tom pointed out that to construct and maintain an index would be a tremendous amount of labor.

Ilya talked about the IBM-Book format documents for Perl on OS/2, which has a command that does a full-text search on the documentation and yields the best match.

Matthias Neeracher said that 'shuck', the Macintosh documentation browser, made use of an index, and would work better if the index data were better to begin with.

The indexing project needs a champion and an army of volunteers. If you're interested in applying for either position, drop me a note and I'll try to match up interested parties.

X<>

Pod has always documented an X<> tag 'for indexing'. But the documents never said how it worked or what the format of the contents should be, and it was never really used. In 5.6, it appears exactly twice in the entire documentation set, and the documentation for X<> itself says only:

         X<index>        An index entry

Tom said that if an index were contructed, it should use the X<> tag to mark up the pods with index entries. I pointed out that the big downside of that is that if there are many X<> tags, they render the pod text itself less readable. But I don't think anyone advanced a better suggestion. I sent some mail about possible semantics for X<>, based on my (limited) indexing experience. (Among other things, I am writing a book about Perl in a Pod-like markup language and I am using X<> to indicate an index entry.

Indexing notes.

More indexing notes.

Tom also pointed out that

        =for index

could be used to indicate that the following chunk of text was a sequence of index entries.

splitpod

An entry in the software-nobody-knows-about category: In the Perl pod directory is a program called splitpod which will break a single pod file into multiple files. For example,

        splitpod perlfunc.pod

creates many pod files named abs.pod, accept.pod, ..., y.pod. You can then pod2man these separately or whatever.

Aliases

Ilya suggested that Pod support an alias directive to define a new escape sequence that would be equivalent to some other escape sequence. For example:

        =for macro B <= Y

makes Y<some text> synonymous with B<some text>. That way you could use Y to indicate bold sans-serif text (for example) but the standard translators such as pod2text would still know to render it the same as B<...>.

perldoc Wishlist

Ben Tilly posted a wishlist for Perl's documentation, including that the output of perldoc should include the name of the file that it had found the documentation in. This would help remind people that the documentation is actually in a file somewhere and is not available only from the perldoc magic genie.

Ben's other wishes

Tom's Plan

While people were posting all sorts of ideas for enhancements to perldoc, Tom Christiansen posted his own ideas about how to solve some of the deeper underlying problems of Perl's documentation:

Tom: First, dramatically reorganize the documentation,more or less along the lines that Sarathy has alluded to:reference, tutorials, internals.

Second, throw the wholeperldoc code out and start from scratch with a real design specthat does not psychotically deviate from its purpose.

Third, provide real tools for unrelated things, likeidentifying where Perl finds its standard module path. Modulesneed tools.

He also showed a demonstration of some simpler tools that might replace the large, bloated perldoc.

Here's the demo

Later demo

Tom also pointed out later that unlike perldoc, these tools don't have to parse Pod except in a very simple and rudimentary way. Brad Appleton put in that the Pod::Selectmodule would do the same sort of parsing, and Tom replied that it was entirely useless for disgorging documentation, because it is between fifty and a hundred times slower than the naive approach, and nobody wants to use a documentation program that takes ten seconds to cough up the documentation.

Brad says that Pod::Parser could be made faster, but I wonder if it can really be made faster enough. On one of my tests, Tom's little cheap script outperformed Pod::Select by a factor of 185. Brad also asked for help on this; people interested in speeding up Pod::Parser should contact him.

Brad also asked for help in extending the test suite for these modules.

Brad's call for assistance.

Editorial Opinion Section

Discussions like this last make me think there's something seriously wrong with Pod. It was designed as a simple, readable format. But if it takes that long to parse it fully, then the plan has failed, because the parsing should not be so difficult.

The new batch of Pod translating modules that Brad Appleton and Russ Allbery have been working on have been under development for years. That shouldn't have happened. And I don't think it's the fault of Brad and Russ; I think it's that Pod itself is badly designed, and turns out to be much harder to handle than it looks at first. I've tried on several occasions to write Pod and Pod-like translators, and that's what I've found.

Pod is very nice in some ways, but it has severe problems. The goal was to have a documenation system that was easy for people to read, easy for people to write, and easy for programs to handle and translate. It wins on the first two; Pod is much easier to read or write than anything comparable. But for algorithmic processing, it seems that there are two options: You can run quick and dirty and ignore most of the details, or you can get everything right at the expense of using a surprisingly large amount of code and running extremely slowly.

Ben: perldoc needs anoverhaul.

Tom: It shall be completely obliterated,replaced by a façade.

Mark Fisher's man Replacement

Mark Fisher announced that he had written a man-like tool for Perl. Unfortunately, it is not available yet.

Mark's announcement

Randal Schwartz suggested that whatever becomes of perldoc, that perl -man invoke the documentation system. He points out that many corporate IT folks get the perl binary installed correctly, and omit all the support programs like perldoc.

Pod::Parser Output Model

Ton Hospel asked if Pod::Parser was doing all it needed to. At present, it just parses up the pod at the lowselt level and gives you back a list of tags. Then it is up to the translator program to decide what to do with them. Ton asks if perhaps Pod::Parser should be more involved with policy. For example, consider this:

        =begin comment
     
        foo
        =head 1
        =end comment

Everything between the =begin comment and =end comment directives is ignored. Now consider this:

        =begin comment
     
        foo
        =cut
        sub foo { ... }
        =head 1
        =end comment

Does the comment section continue all the way to the =end comment directive, or does it stop at the =cut directive?

Ton suggests that Pod::Parser might generate a more abstract data tree representation of the document structure, and that the translators would work from that, instead of from the current lower-level representation, which is essentially a token stream. That way different translators would not make different policy decisions about these sorts of issues.

Larry said he had thought for some time that there should be a canonical Pod-to-XML translator, and said that people kept agreeing to write one eventually.

roffitall

A recent change to Pod::Man broke the roffitall program, which is another entry in the software-nobody-knows-about category. roffitall lives in the pod/ directory in the source tree, and when you run it, it takes all the pods and turns them into one monster 1,400-page postscript file with all the documentation in the world, including documentation for the standard modules, and a table of contents that it generated. Did you know that? I didn't.

That is the end of the report on the gigantic multithread about perldoc and related matters.

Patches to perlre

Tom Christiansen submitted a major update to perlre.

The patch is here.

mktables.PL Needs Work

unicode/lib/mktables.PL is the program that generates the code-number-to-name mapping tables for unicode, so that you can say \N{BLACK SMILING FACE} and get the black smiling face character; it also generates the property lists so that you can say \p{IsUpper} to indicate any uppercase unicode character. Larry identified a number of problems with this program that need to be addressed before 5.6.1.

Nobody replied, so if you're interested in helping, here's a chance to be a hero.

Larry: Anyway, anybody have any tuits thisweek? I don't, and this really needs to get straightened outsoon. Besides it's Perl hacking, not C hacking, and that'ssupposed to be fun.

Read about it.

Jarkko is Still Trying to Give Away the Configure Pumpkin

Would you like become an Important Person? Here's your opportunity.

Jarkko provides some details about what is required.

Pushing into Hashes

Brett Denner suggested that

        push %hash, key => value, key2 => value2;

be allowed.

Response was generally negative. Some points people brought up: There is already an easy way to do that:

        @hash{key1, key2} = (value1, value2)

It creates an unfulfilled expectation that pop, shift, and unshift will also work on hashes. It creates an unfulfilled expectation that

        %hash = (apple => 'red');
        push %hash, apple => 'green';

will not overwrite 'red'as in the array case. Whatever meaning you want it to have is easily provided by a subroutine:

        sub hashpush(\%@) {
          my $hash = shift;
          while (@_) {
            my $key = shift;
            $hash->{$key} = shift;
          }
        }

This has come up before, and Yitzchak Scott-Thoennes reminded us that the outcome then was that

        %hash = (%hash, key => value, ...);

should be optimized. Any volunteers?

Damian's Assignment Overloader

Damian Conway has a module that is supposed to provide a simple interface to overloading assignment semantics, and in particular to provide typed variables by preventing assignment of anything other than a certain kind of object to a particular variable. For example:

        use Class::ifiedVars;
        classify $var => 'ClassName';

Now if you try to assign a value to $var that is not an object of type ClassName or one of its subclasses, you get a run-time error.

There are a lot of other features also. Damian plans to change the name to something less abnormal.

Details here.

Ilya asked how this was different from the PREPARE method of

        my Dog $snoopy;

Damian replied that it was orthogonal to it. For example:

        my Dog $snoopy;
        $snoopy = Alligator->new();
        $snoopy->pat();      # Invokes Alligator::pat
        $snoopy->{age}++;    # Might update the wrong field

Someone asked where PREPARE was documented. Ilya replied that due to a bug in 5.6.0, it was unimplemented.

Previous discussion of PREPARE.

Damian and Ilya got into an extended exchange about whether or not the module was a good idea. This resulted in Peter Scott asking when the refereeing would be put in place.

Various

A large collection of bug reports, bug fixes, non-bug reports, questions, answers, and a small amaount of flamage and spam.

Until next week I remain, your humble and obedient servant,


Mark-Jason Dominus

Program Repair Shop and Red Flags


Someone recently asked me to take a look at his report-generating program because he wasn't able to get the final report sorted the way he wanted.

The program turned out to require only a minor change to get the report in order, but it also turned out to be a trove of common mistakes--which is wonderful, because I can use one program to show how to identify and fix all the common mistakes at once!

First I'll show the program, and then I'll show how to make it better. Here's the original. (Note: Lines of code may be broken for display purposes.)

     1  #!/usr/bin/perl
     2  use Getopt::Std;
     3  getopt('dV');
     4  $xferlog="./xferlog";
     5  $\ = "\n";
     6  $i=0;
     7  open XFERLOG, $xferlog or die "Cant't find file $xferlog";
     8  
     9  foreach $line (<XFERLOG>) {
    10        chomp($line);
    11        if (( $line =~ /$opt_d/i) && ( $line !~ /_ o r/)) 
    12           {
    13           ($Fld1,$Fld2,$Fld3,$Fld4,$Fld5,$Fld6,$Fld7,$Fld8,
               $Fld9,$Fld10,$Fld11,$Fld12,$Fld13,$Fld14,$Fld15) = split(' ',$line);
 
    14            $uplist[$i] = join ' ',$Fld6, $Fld8, $Fld9, $Fld14, $Fld15;
    15            $time[$i]=$Fld6; $size[$i]=$Fld8; $file[$i]=$Fld9; 
               $user[$i]=$Fld14; $group[$i]=$Fld15;
      
    16            $username= join '@', $user[$i], $group[$i];
    17            push @{$table{$username}}, $uplist[$i];
    18            $i++;     
    19      }
    20  }
    21  close XFERLOG;
    22  
    23  undef %saw;
    24  # @newuser = grep(!$saw{$_}++, @user);
    25  $j=0;
    26  foreach  $username ( sort keys %table )
    27          {
    28          my @mylist = @{$table{$username}};
    29          $m=0;
    30          $totalsize=0;
    31          $totaltime=0;
    32          $gtotal=0;
    33          $x=0;
    34          $x=@mylist;
    35          for ($m = 0 ; $m < ($x); $m++)
    36          {
    37                  ( $seconds, $size, $file, $user, $group) = split(' ', $mylist[$m]);
    38                  $totaltime = ($totaltime + $seconds);
    39                  $totalsize = ($totalsize + $size);
    40          }
    41          if ($totaltime==0) { $totaltime=1; }
    42          if ($totalsize==0) { $totalsize=1; }
    43          $avgtr = (($totalsize/$totaltime)/1024);
    44          $gtotal=($totalsize+$gtotal);
    45          $finale[$j]= join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
    46  #       print $finale[$j];
    47          $j++;
    48  }
    49  @realfinal =  sort @finale;
    50  #print @finale;
    51  $p=0;
    52  $w=0;
    53  $w=@realfinal;
    54  #print $w;
    55  for ($p=($w-1) ; $p>=0; $p--)
    56  {
    57          ($Size, $User, $Files, $Time, $AvgSpeed)= split " ", $realfinal[$p];
    58          $position= ($w-$p);
    59          $percent=(($Size/($gtotal/(1024*1024)))*100);
    60          printf ("$position. $User $Files files ");
    61          printf("%.2fMB", $Size) ;
    62          printf " $Time(s) ";
    63          printf ("%.2f% ", $percent);
    64          printf("%.2fK/s", $AvgSpeed);
    65          print " ";
    66  }
    67          

Let's start at the top, with the argument and file handling.

     1  #!/usr/bin/perl
     2  use Getopt::Std;
     3  getopt('dV');
     4  $xferlog="./xferlog";
     5  $\ = "\n";
     6  $i=0;
     7  open XFERLOG, $xferlog or die "Cant't find file $xferlog";
     8  
     9  foreach $line (<XFERLOG>) {
        ...
    20  }
    21  close XFERLOG;

The name of the input file is hardwired on line 4. Getting the filename from the command line is more flexible. We can leave the old filename in place as a default, retaining compatibility with the original version. I've also added error handling to the getopt argument parsing.

 
       getopt('dV') or die "Usage: $0 [-d] [-V] [filename]\n";
        @ARGV = ('./xferlog') unless @ARGV;
        while (<>) {
          ...
        }
getopt removes the options from @ARGV, leaving only the filenames, if any. If there weren't any, we put the default filename, /.xferlog, into @ARGV as if the user had supplied it themselves. Since the <> operator reads from the files named in @ARGV, it will read from ./xferlog if no file was specified on the command line. Using <> handles open errors for us automatically, and we can omit the close call because it's already taken care of for us.

Line 5 is superfluous, because $\ already defaults to "\n". Line 6 is superfluous, since $i would be implicitly initialized to 0, but it won't matter because we're going to get rid of $i anyway.

I replaced the foreach loop with a while loop. The foreach loaded the entire file into memory at once, then iterated over the list of lines. while reads one line at a time into $_, discarding each line after it has been examined. If the input file is large, this will save a huge amount of memory. If available memory is small, the original program might have run very slowly because of thrashing problems; the new program is unlikely to have the same trouble, and might run many times faster as a result.


     9  foreach $line (<XFERLOG>) {
    10        chomp($line);
    11        if (( $line =~ /$opt_d/i) && ( $line !~ /_ o r/)) 
    12           {
    13           ($Fld1,$Fld2,$Fld3,$Fld4,$Fld5,$Fld6,$Fld7,$Fld8,
               $Fld9,$Fld10,$Fld11,$Fld12,$Fld13,$Fld14,$Fld15) = split(' ',$line);
      
    14            $uplist[$i] = join ' ',$Fld6, $Fld8, $Fld9, $Fld14, $Fld15;
    15            $time[$i]=$Fld6; $size[$i]=$Fld8; $file[$i]=$Fld9; 
               $user[$i]=$Fld14; $group[$i]=$Fld15;
      
    16            $username= join '@', $user[$i], $group[$i];
    17            push @{$table{$username}}, $uplist[$i];
    18            $i++;     
    19      }
    20  }

Here's my replacement:

        while (<>) {
          chomp;
          if (/$opt_d/oi &&  ! /_ o r/) {
            my @Fld = split;
            my $uplist = {time => $Fld[5],      size => $Fld[7],
                          file => $Fld[8],      user => $Fld[13],
                          group => $Fld[14],
                         };
            my $username = "$Fld[13]\@$Fld[14]";
            push @{$table{$username}}, $uplist;
          }
        }

Because the current line is in $_ now instead of $line, we can use the argumentless versions of chomp and split and the unbound version of the pattern match operators, which apply to $_ by default. I added the /o option on the first pattern match to tell Perl that $opt_d will not change over the lifetime of the program.

Any time you have a series of variables named $Fld1, $Fld2, etc., it means you made a mistake, because they should have been in an array. I've replaced the $Fld1, $Fld2, ... family with a single array, @Fld.

$uplist was a problem before. It's a large string continaing several fields. Later on, the program would have had to split this string to get at the various fields; this is a waste of time because we have the fields already split up right here and there's no point in joining them just to split them up again later. Instead of turning the relevant fields into a string, I've put them into an anonymous hash, indexed by key, so that the filename is in $uplist->{file} instead of the third section of a whitespace separated string.

This way of doing things is not only faster, it's more robust. If the input file format changes so that a filename might contain space characters, we only need to change the initial split that parses the input data itself. The original version of the program would have needed to have the join changed also, as well as the later split the re-separated the data. Storing the fields in a hash eliminates this problem entirely.

I also eliminated the superfluous @time, @size, @user, @group, and @uplist arrays. They were never used. Packaging all the relevant data into a single hash obviates any possible use of these arrays anyway. Because all the arrays have gone away, we no longer need the index variable $i. Such a variable, which exists only to allow data to be added to the end of an array, is rarely needed in Perl. It is almost always preferable to use push. The push line itself is essentially the same.


This next section is way too long:

    23  undef %saw;
    24  # @newuser = grep(!$saw{$_}++, @user);
    25  $j=0;
    26  foreach  $username ( sort keys %table )
    27          {
    28          my @mylist = @{$table{$username}};
    29          $m=0;
    30          $totalsize=0;
    31          $totaltime=0;
    32          $gtotal=0;
    33          $x=0;
    34          $x=@mylist;
    35          for ($m = 0 ; $m < ($x); $m++)
    36          {
    37                  ( $seconds, $size, $file, $user, $group) = split(' ', $mylist[$m]);
    38                  $totaltime = ($totaltime + $seconds);
    39                  $totalsize = ($totalsize + $size);
    40          }
    41          if ($totaltime==0) { $totaltime=1; }
    42          if ($totalsize==0) { $totalsize=1; }
    43          $avgtr = (($totalsize/$totaltime)/1024);
    44          $gtotal=($totalsize+$gtotal);
    45          $finale[$j]= join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
    46  #       print $finale[$j];
    47          $j++;
    48  }

26 lines is too much for one block. A 26-line block should be rewritten if possible, and if not, its guts should be scooped out and made into a subroutine.

We can reduce this to about fifteen lines, so I won't use a subroutine here. A lot of that reduction is simply elimination of unnecessary code. We can scrap lines 23 and 24, which are never used. Line 25 is an unnecessary initialization of $j whose only purpose is to track the length of the @finale array; we can eliminate $j entirely by changing this:

    45          $finale[$j]= join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
    46  #       print $finale[$j];
    47          $j++;

to say this instead:

        push @finale, join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
        # print $finale[-1];

Now let's work on that inner loop:

    35          for ($m = 0 ; $m < ($x); $m++)
    36          {
    37                  ( $seconds, $size, $file, $user, $group) = split(' ', $mylist[$m]);
    38                  $totaltime = ($totaltime + $seconds);
    39                  $totalsize = ($totalsize + $size);
    40          }

Any time you have a C-like for loop that loops over the indices of an array, you're probably making a mistake. Perl has a foreach construction that iterates over an array in a much simpler way:

        for $item (@{$table{$username}}) {
          $totaltime += $item->{time};
          $totalsize += $item->{size};
        }

Here we reap the benefit of the anonymous hash introduced above. To get the time and size we need only extract the hash values time and size. In the original code, we had to do another split.

This allows us to eliminate the superfluous variables $m, $x and @mylist, so we can remove lines 28, 29, 33, and 34. We've also used the += operator here to save extra mentions of the variable names on the right-hand side of the =.

The modified code now looks like:

        foreach  $username ( sort keys %table ) {
          $totalsize=0;
          $totaltime=0;
          $gtotal=0;
          for $item (@{$table{$username}}) {
            $totaltime += $item->{time};
            $totalsize += $item->{size};
          }
          if ($totaltime==0) { $totaltime=1; }
          if ($totalsize==0) { $totalsize=1; }
          $avgtr = (($totalsize/$totaltime)/1024);
          $gtotal=($totalsize+$gtotal);

          push @finale, join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
          #     print $finale[-1];
        }

This is already only half as large. But we can make it smaller and cleaner yet. $totalsize and $totaltime are related, so they should go on the same line. $gtotal is incorrectly set to 0 here. It is a grand total size of all the files downloaded, and any initialization of it should be outside the loop. The check for $totaltime==0 is to prevent a divide-by-zero error in the computation of $avgtr, but no such computation is performed for $totalsize, so the corresponding check is wasted and should be eliminated:

        my  $gtotal=0;

        foreach  $username ( sort keys %table ) {
          my ($totalsize, $totaltime) = (0, 0);
          for $item (@{$table{$username}}) {
            $totaltime += $item->{time};
            $totalsize += $item->{size};
          }
          if ($totaltime==0) { $totaltime=1; }
          $avgtr = ($totalsize/$totaltime)/1024;
          $gtotal += $totalsize;
          push @finale, join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
          #     print $finale[-1];
        }

For the computation of $avgtr, we can do better. The previous line has a special case for when $totaltime is zero and the average is undefined. In this case, $avgtr is set to an arbitrary and bizarre value; if we sort by $avgtr later, these arbitrary values will appear scattered throughout the rest of the data. It's better to handle this exceptional condition explicitly, by replacing

          if ($totaltime==0) { $totaltime=1; }
          $avgtr = ($totalsize/$totaltime)/1024;

with

          if ($totaltime==0) { $avgtr = '---' }
          else { $avgtr = ($totalsize/$totaltime)/1024 }

Finally, the join here is committing the same error as the one we eliminated before. There's no point in joining when we're just going to have to split it again later anyway; the data are separate now so we might as well keep them separate. The solution is similar; instead of joining the five data items into a string, we parcel them into an anonymous hash so that we can extract them by name when we need to. The final version of the loop looks like this:

        foreach  $username ( sort keys %table ) {
          my ($totalsize, $totaltime) = (0, 0);
          for $item (@{$table{$username}}) {
            $totaltime += $item->{time};
            $totalsize += $item->{size};
          }

          if ($totaltime==0) { $avgtr = '---' } 
          else { $avgtr = ($totalsize/$totaltime)/1024 }
          $gtotal += $totalsize;

          push @finale, {size => $totalsize/(1024*1024),
                         username => $username, 
                         num_items => scalar @{$table{$username}}, 
                         totaltime => $totaltime,
                         avgtr => $avgtr,
                        };
          #     print $finale[-1];
        }


Sort Order

Line 49 is the one that I was originally asked to change:

    49  @realfinal =  sort @finale;

Now that @finale contains structured records, the change is straightforward:

        @realfinal = sort {$a->{size} <=> $b->{size}} @finale;

If you want to sort the final report by username instead, it's equally straightforward: @realfinal = sort {$a->{username} cmp $b->{username}} @finale;


Printing the Report

Now we're into the home stretch:

    51  $p=0;
    52  $w=0;
    53  $w=@realfinal;
    54  #print $w;
    55  for ($p=($w-1) ; $p>=0; $p--)
    56  {
    57          ($Size, $User, $Files, $Time, $AvgSpeed)= split " ", $realfinal[$p];
    58          $position= ($w-$p);
    59          $percent=(($Size/($gtotal/(1024*1024)))*100);
    60          printf ("$position. $User $Files files ");
    61          printf("%.2fMB", $Size) ;
    62          printf " $Time(s) ";
    63          printf ("%.2f% ", $percent);
    64          printf("%.2fK/s", $AvgSpeed);
    65          print " ";
    66  }
    67          

Here we have another C-style for loop that should be replaced by a simple foreach loop; this allows us to eliminate $p and $w. We can loop over the reversed list if we want, or simply adjust the sort line above so that the items are sorted into the right (reversed) order to begin with, which is probably better.

The percentage here is the only place in the program that we use $gtotal, which needs to be converted to megabytes to match the size fields in @realfinal. We may as well do this conversion up front. Making these changes, and eliminating the split because the @realfinal data is structured, yields:

        $gtotal /= (1024*1024);  # In megabytes
        #print @finale;
        my $position = 1;
        for $user (@realfinal) {
                printf ("$position. $user->{username} $user->{num_items} files ");
                printf("%.2fMB", $user->{size}) ;
                printf " $user->{totaltime}(s) ";
                printf ("%.2f% ", ($user->{size}/$gtotal)*100); # percentage
                printf("%.2fK/s", $user->{avgtr});
                print "\n";
                ++$position;
        }

It's probably a little cleaner to merge the many printfs into a single print; another upside of this is that it's easier to see what the format of the output will be:

        $gtotal /= (1024*1024);  # In megabytes
        #print @finale;
        my $position = 1;
        for $user (@realfinal) {
          printf ("%d. %s %s files %.2fMB %.2f%% %.2fK/s\n" , 
            $position, $user->{username}, $user->{num_items}        
            $user->{size}, ($user->{size}/$gtotal)*100, $user->{avgtr}
          );
          ++$position;
        }

That's enough. The new program is 33 lines long, not counting comments, blank lines, and lines that have only a close brace. The original program was 51 lines, so we've reduced the length of the program by more than one-third. The original program had 41 scalar variables, 8 arrays, and 2 hashes, for a total of 51 named variables. The new program has 11 scalars, 3 arrays, and 1 hash, for a total of 14; we have eliminated more than two-thirds of the variables. 15 of these were the silly $Fld3 variables, and another 22 weren't.


Red Flags

A red flag is a warning sign that something is wrong. When you see a red flag, you should immediately consider whether you have an opportunity to make the code cleaner. I liked this program because it raised many red flags:


Get Rid of Array Size Variables

A variable whose only purpose is to track the number of items in an array is a red flag; it should usually be eliminated. For example:

        while (...) {
          $array[$n] = SOMETHING;
          ++$n;
        }

should be replaced with

        while (...) {
          push @array, SOMETHING;
        }

eliminating $n.

Notice that although the $position variable in the final version of the program looks like it might be an index variable, it actually serves a more important purpose than that: It appears in the final output as a ranking.


Use Compound Data Structures Instead of Variable Families

A series of variables named $f1, $f2, etc., should always be replaced with an array. For example:

    ($Fld1,$Fld2,$Fld3,$Fld4,$Fld5,$Fld6) = split(' ',$line);

should be replaced with

    @Fld = split(' ',$line);

A similar statement can be made about hashes. If you have $user_name, $user_weight, and $user_login_date, consider using one structure called %user with keys name, weight, and login_date.


Use foreach to Loop Over Arrays

C-style for loops should be avoided. In particular,

        for ($i=0; $i < @array; $i++) {
          SOMETHING($array[$i]);
        }

should be replaced with

        foreach $item (@array) {
          SOMETHING($item);
        }
Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en