Program Repair Shop and Red Flags

Someone recently asked me to take a look at his report-generating program because he wasn’t able to get the final report sorted the way he wanted.

The program turned out to require only a minor change to get the report in order, but it also turned out to be a trove of common mistakes–which is wonderful, because I can use one program to show how to identify and fix all the common mistakes at once!

First I’ll show the program, and then I’ll show how to make it better. Here’s the original. (Note: Lines of code may be broken for display purposes.)

     1  #!/usr/bin/perl
     2  use Getopt::Std;
     3  getopt('dV');
     4  $xferlog="./xferlog";
     5  $\ = "\n";
     6  $i=0;
     7  open XFERLOG, $xferlog or die "Cant't find file $xferlog";
     8
     9  foreach $line (<XFERLOG>) {
    10        chomp($line);
    11        if (( $line =~ /$opt_d/i) && ( $line !~ /_ o r/))
    12           {
    13           ($Fld1,$Fld2,$Fld3,$Fld4,$Fld5,$Fld6,$Fld7,$Fld8,
               $Fld9,$Fld10,$Fld11,$Fld12,$Fld13,$Fld14,$Fld15) = split(' ',$line);

    14            $uplist[$i] = join ' ',$Fld6, $Fld8, $Fld9, $Fld14, $Fld15;
    15            $time[$i]=$Fld6; $size[$i]=$Fld8; $file[$i]=$Fld9;
               $user[$i]=$Fld14; $group[$i]=$Fld15;

    16            $username= join '@', $user[$i], $group[$i];
    17            push @{$table{$username}}, $uplist[$i];
    18            $i++;
    19      }
    20  }
    21  close XFERLOG;
    22
    23  undef %saw;
    24  # @newuser = grep(!$saw{$_}++, @user);
    25  $j=0;
    26  foreach  $username ( sort keys %table )
    27          {
    28          my @mylist = @{$table{$username}};
    29          $m=0;
    30          $totalsize=0;
    31          $totaltime=0;
    32          $gtotal=0;
    33          $x=0;
    34          $x=@mylist;
    35          for ($m = 0 ; $m < ($x); $m++)
    36          {
    37                  ( $seconds, $size, $file, $user, $group) = split(' ', $mylist[$m]);
    38                  $totaltime = ($totaltime + $seconds);
    39                  $totalsize = ($totalsize + $size);
    40          }
    41          if ($totaltime==0) { $totaltime=1; }
    42          if ($totalsize==0) { $totalsize=1; }
    43          $avgtr = (($totalsize/$totaltime)/1024);
    44          $gtotal=($totalsize+$gtotal);
    45          $finale[$j]= join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
    46  #       print $finale[$j];
    47          $j++;
    48  }
    49  @realfinal =  sort @finale;
    50  #print @finale;
    51  $p=0;
    52  $w=0;
    53  $w=@realfinal;
    54  #print $w;
    55  for ($p=($w-1) ; $p>=0; $p--)
    56  {
    57          ($Size, $User, $Files, $Time, $AvgSpeed)= split " ", $realfinal[$p];
    58          $position= ($w-$p);
    59          $percent=(($Size/($gtotal/(1024*1024)))*100);
    60          printf ("$position. $User $Files files ");
    61          printf("%.2fMB", $Size) ;
    62          printf " $Time(s) ";
    63          printf ("%.2f% ", $percent);
    64          printf("%.2fK/s", $AvgSpeed);
    65          print " ";
    66  }
    67

Let’s start at the top, with the argument and file handling.

     1  #!/usr/bin/perl
     2  use Getopt::Std;
     3  getopt('dV');
     4  $xferlog="./xferlog";
     5  $\ = "\n";
     6  $i=0;
     7  open XFERLOG, $xferlog or die "Cant't find file $xferlog";
     8
     9  foreach $line (<XFERLOG>) {
        ...
    20  }
    21  close XFERLOG;

The name of the input file is hardwired on line 4. Getting the filename from the command line is more flexible. We can leave the old filename in place as a default, retaining compatibility with the original version. I’ve also added error handling to the getopt argument parsing.

       getopt('dV') or die "Usage: $0 [-d] [-V] [filename]\n";
        @ARGV = ('./xferlog') unless @ARGV;
        while (<>) {
          ...
        }

getopt removes the options from @ARGV, leaving only the filenames, if any. If there weren’t any, we put the default filename, /.xferlog, into @ARGV as if the user had supplied it themselves. Since the <> operator reads from the files named in @ARGV, it will read from ./xferlog if no file was specified on the command line. Using <> handles open errors for us automatically, and we can omit the close call because it’s already taken care of for us. Line 5 is superfluous, because $\ already defaults to "\n". Line 6 is superfluous, since $i would be implicitly initialized to 0, but it won’t matter because we’re going to get rid of $i anyway.

I replaced the foreach loop with a while loop. The foreach loaded the entire file into memory at once, then iterated over the list of lines. while reads one line at a time into $_, discarding each line after it has been examined. If the input file is large, this will save a huge amount of memory. If available memory is small, the original program might have run very slowly because of thrashing problems; the new program is unlikely to have the same trouble, and might run many times faster as a result.


     9  foreach $line (<XFERLOG>) {
    10        chomp($line);
    11        if (( $line =~ /$opt_d/i) && ( $line !~ /_ o r/))
    12           {
    13           ($Fld1,$Fld2,$Fld3,$Fld4,$Fld5,$Fld6,$Fld7,$Fld8,
               $Fld9,$Fld10,$Fld11,$Fld12,$Fld13,$Fld14,$Fld15) = split(' ',$line);

    14            $uplist[$i] = join ' ',$Fld6, $Fld8, $Fld9, $Fld14, $Fld15;
    15            $time[$i]=$Fld6; $size[$i]=$Fld8; $file[$i]=$Fld9;
               $user[$i]=$Fld14; $group[$i]=$Fld15;

    16            $username= join '@', $user[$i], $group[$i];
    17            push @{$table{$username}}, $uplist[$i];
    18            $i++;
    19      }
    20  }

Here’s my replacement:

        while (<>) {
          chomp;
          if (/$opt_d/oi &&  ! /_ o r/) {
            my @Fld = split;
            my $uplist = {time => $Fld[5],      size => $Fld[7],
                          file => $Fld[8],      user => $Fld[13],
                          group => $Fld[14],
                         };
            my $username = "$Fld[13]\@$Fld[14]";
            push @{$table{$username}}, $uplist;
          }
        }

Because the current line is in $_ now instead of $line, we can use the argumentless versions of chomp and split and the unbound version of the pattern match operators, which apply to $_ by default. I added the /o option on the first pattern match to tell Perl that $opt_d will not change over the lifetime of the program.

Any time you have a series of variables named $Fld1, $Fld2, etc., it means you made a mistake, because they should have been in an array. I’ve replaced the $Fld1, $Fld2, … family with a single array, @Fld.

$uplist was a problem before. It’s a large string continaing several fields. Later on, the program would have had to split this string to get at the various fields; this is a waste of time because we have the fields already split up right here and there’s no point in joining them just to split them up again later. Instead of turning the relevant fields into a string, I’ve put them into an anonymous hash, indexed by key, so that the filename is in $uplist->{file} instead of the third section of a whitespace separated string.

This way of doing things is not only faster, it’s more robust. If the input file format changes so that a filename might contain space characters, we only need to change the initial split that parses the input data itself. The original version of the program would have needed to have the join changed also, as well as the later split the re-separated the data. Storing the fields in a hash eliminates this problem entirely.

I also eliminated the superfluous @time, @size, @user, @group, and @uplist arrays. They were never used. Packaging all the relevant data into a single hash obviates any possible use of these arrays anyway. Because all the arrays have gone away, we no longer need the index variable $i. Such a variable, which exists only to allow data to be added to the end of an array, is rarely needed in Perl. It is almost always preferable to use push. The push line itself is essentially the same.


This next section is way too long:

    23  undef %saw;
    24  # @newuser = grep(!$saw{$_}++, @user);
    25  $j=0;
    26  foreach  $username ( sort keys %table )
    27          {
    28          my @mylist = @{$table{$username}};
    29          $m=0;
    30          $totalsize=0;
    31          $totaltime=0;
    32          $gtotal=0;
    33          $x=0;
    34          $x=@mylist;
    35          for ($m = 0 ; $m < ($x); $m++)
    36          {
    37                  ( $seconds, $size, $file, $user, $group) = split(' ', $mylist[$m]);
    38                  $totaltime = ($totaltime + $seconds);
    39                  $totalsize = ($totalsize + $size);
    40          }
    41          if ($totaltime==0) { $totaltime=1; }
    42          if ($totalsize==0) { $totalsize=1; }
    43          $avgtr = (($totalsize/$totaltime)/1024);
    44          $gtotal=($totalsize+$gtotal);
    45          $finale[$j]= join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
    46  #       print $finale[$j];
    47          $j++;
    48  }

26 lines is too much for one block. A 26-line block should be rewritten if possible, and if not, its guts should be scooped out and made into a subroutine.

We can reduce this to about fifteen lines, so I won’t use a subroutine here. A lot of that reduction is simply elimination of unnecessary code. We can scrap lines 23 and 24, which are never used. Line 25 is an unnecessary initialization of $j whose only purpose is to track the length of the @finale array; we can eliminate $j entirely by changing this:

    45          $finale[$j]= join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
    46  #       print $finale[$j];
    47          $j++;

to say this instead:

        push @finale, join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
        # print $finale[-1];

Now let’s work on that inner loop:

    35          for ($m = 0 ; $m < ($x); $m++)
    36          {
    37                  ( $seconds, $size, $file, $user, $group) = split(' ', $mylist[$m]);
    38                  $totaltime = ($totaltime + $seconds);
    39                  $totalsize = ($totalsize + $size);
    40          }

Any time you have a C-like for loop that loops over the indices of an array, you’re probably making a mistake. Perl has a foreach construction that iterates over an array in a much simpler way:

        for $item (@{$table{$username}}) {
          $totaltime += $item->{time};
          $totalsize += $item->{size};
        }

Here we reap the benefit of the anonymous hash introduced above. To get the time and size we need only extract the hash values time and size. In the original code, we had to do another split.

This allows us to eliminate the superfluous variables $m, $x and @mylist, so we can remove lines 28, 29, 33, and 34. We’ve also used the += operator here to save extra mentions of the variable names on the right-hand side of the =.

The modified code now looks like:

        foreach  $username ( sort keys %table ) {
          $totalsize=0;
          $totaltime=0;
          $gtotal=0;
          for $item (@{$table{$username}}) {
            $totaltime += $item->{time};
            $totalsize += $item->{size};
          }
          if ($totaltime==0) { $totaltime=1; }
          if ($totalsize==0) { $totalsize=1; }
          $avgtr = (($totalsize/$totaltime)/1024);
          $gtotal=($totalsize+$gtotal);

          push @finale, join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
          #     print $finale[-1];
        }

This is already only half as large. But we can make it smaller and cleaner yet. $totalsize and $totaltime are related, so they should go on the same line. $gtotal is incorrectly set to 0 here. It is a grand total size of all the files downloaded, and any initialization of it should be outside the loop. The check for $totaltime==0 is to prevent a divide-by-zero error in the computation of $avgtr, but no such computation is performed for $totalsize, so the corresponding check is wasted and should be eliminated:

        my  $gtotal=0;

        foreach  $username ( sort keys %table ) {
          my ($totalsize, $totaltime) = (0, 0);
          for $item (@{$table{$username}}) {
            $totaltime += $item->{time};
            $totalsize += $item->{size};
          }
          if ($totaltime==0) { $totaltime=1; }
          $avgtr = ($totalsize/$totaltime)/1024;
          $gtotal += $totalsize;
          push @finale, join ' ', ($totalsize/(1024*1024)), $username, ($x), $totaltime, $avgtr;
          #     print $finale[-1];
        }

For the computation of $avgtr, we can do better. The previous line has a special case for when $totaltime is zero and the average is undefined. In this case, $avgtr is set to an arbitrary and bizarre value; if we sort by $avgtr later, these arbitrary values will appear scattered throughout the rest of the data. It’s better to handle this exceptional condition explicitly, by replacing

          if ($totaltime==0) { $totaltime=1; }
          $avgtr = ($totalsize/$totaltime)/1024;

with

          if ($totaltime==0) { $avgtr = '---' }
          else { $avgtr = ($totalsize/$totaltime)/1024 }

Finally, the join here is committing the same error as the one we eliminated before. There’s no point in joining when we’re just going to have to split it again later anyway; the data are separate now so we might as well keep them separate. The solution is similar; instead of joining the five data items into a string, we parcel them into an anonymous hash so that we can extract them by name when we need to. The final version of the loop looks like this:

        foreach  $username ( sort keys %table ) {
          my ($totalsize, $totaltime) = (0, 0);
          for $item (@{$table{$username}}) {
            $totaltime += $item->{time};
            $totalsize += $item->{size};
          }

          if ($totaltime==0) { $avgtr = '---' }
          else { $avgtr = ($totalsize/$totaltime)/1024 }
          $gtotal += $totalsize;

          push @finale, {size => $totalsize/(1024*1024),
                         username => $username,
                         num_items => scalar @{$table{$username}},
                         totaltime => $totaltime,
                         avgtr => $avgtr,
                        };
          #     print $finale[-1];
        }

Sort Order

Line 49 is the one that I was originally asked to change:

    49  @realfinal =  sort @finale;

Now that @finale contains structured records, the change is straightforward:

        @realfinal = sort {$a->{size} <=> $b->{size}} @finale;

If you want to sort the final report by username instead, it’s equally straightforward: @realfinal = sort {$a->{username} cmp $b->{username}} @finale;


Printing the Report

Now we’re into the home stretch:

    51  $p=0;
    52  $w=0;
    53  $w=@realfinal;
    54  #print $w;
    55  for ($p=($w-1) ; $p>=0; $p--)
    56  {
    57          ($Size, $User, $Files, $Time, $AvgSpeed)= split " ", $realfinal[$p];
    58          $position= ($w-$p);
    59          $percent=(($Size/($gtotal/(1024*1024)))*100);
    60          printf ("$position. $User $Files files ");
    61          printf("%.2fMB", $Size) ;
    62          printf " $Time(s) ";
    63          printf ("%.2f% ", $percent);
    64          printf("%.2fK/s", $AvgSpeed);
    65          print " ";
    66  }
    67

Here we have another C-style for loop that should be replaced by a simple foreach loop; this allows us to eliminate $p and $w. We can loop over the reversed list if we want, or simply adjust the sort line above so that the items are sorted into the right (reversed) order to begin with, which is probably better.

The percentage here is the only place in the program that we use $gtotal, which needs to be converted to megabytes to match the size fields in @realfinal. We may as well do this conversion up front. Making these changes, and eliminating the split because the @realfinal data is structured, yields:

        $gtotal /= (1024*1024);  # In megabytes
        #print @finale;
        my $position = 1;
        for $user (@realfinal) {
                printf ("$position. $user->{username} $user->{num_items} files ");
                printf("%.2fMB", $user->{size}) ;
                printf " $user->{totaltime}(s) ";
                printf ("%.2f% ", ($user->{size}/$gtotal)*100); # percentage
                printf("%.2fK/s", $user->{avgtr});
                print "\n";
                ++$position;
        }

It’s probably a little cleaner to merge the many printfs into a single print; another upside of this is that it’s easier to see what the format of the output will be:

        $gtotal /= (1024*1024);  # In megabytes
        #print @finale;
        my $position = 1;
        for $user (@realfinal) {
          printf ("%d. %s %s files %.2fMB %.2f%% %.2fK/s\n" ,
            $position, $user->{username}, $user->{num_items}
            $user->{size}, ($user->{size}/$gtotal)*100, $user->{avgtr}
          );
          ++$position;
        }

That’s enough. The new program is 33 lines long, not counting comments, blank lines, and lines that have only a close brace. The original program was 51 lines, so we’ve reduced the length of the program by more than one-third. The original program had 41 scalar variables, 8 arrays, and 2 hashes, for a total of 51 named variables. The new program has 11 scalars, 3 arrays, and 1 hash, for a total of 14; we have eliminated more than two-thirds of the variables. 15 of these were the silly $Fld3 variables, and another 22 weren’t.


Red Flags

A red flag is a warning sign that something is wrong. When you see a red flag, you should immediately consider whether you have an opportunity to make the code cleaner. I liked this program because it raised many red flags:


Get Rid of Array Size Variables

A variable whose only purpose is to track the number of items in an array is a red flag; it should usually be eliminated. For example:

        while (...) {
          $array[$n] = SOMETHING;
          ++$n;
        }

should be replaced with

        while (...) {
          push @array, SOMETHING;
        }

eliminating $n.

Notice that although the $position variable in the final version of the program looks like it might be an index variable, it actually serves a more important purpose than that: It appears in the final output as a ranking.


Use Compound Data Structures Instead of Variable Families

A series of variables named $f1, $f2, etc., should always be replaced with an array. For example:

    ($Fld1,$Fld2,$Fld3,$Fld4,$Fld5,$Fld6) = split(' ',$line);

should be replaced with

    @Fld = split(' ',$line);

A similar statement can be made about hashes. If you have $user_name, $user_weight, and $user_login_date, consider using one structure called %user with keys name, weight, and login_date.


Use foreach to Loop Over Arrays

C-style for loops should be avoided. In particular,

        for ($i=0; $i < @array; $i++) {
          SOMETHING($array[$i]);
        }

should be replaced with

        foreach $item (@array) {
          SOMETHING($item);
        }

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub