PDF Processing with Perl
by Detlef Groth
|
Pages: 1, 2, 3
CombinePDFs Package
The application itself mainly performs error checking. If everything is fine, it calls the CombinePDFs::createPDF subroutine, passing the array of input files, the array of page ranges, and the bookmarks information. The bookmarks scalar is optional.
Page ranges can be comma-separated ranges (1-11,14,17-23), single pages, or the all token. You can include the same page several times in the same document.
The file-checking code looks for read permissions and tests if the file is a PDF document by using the CombinePDFs::isPDF($filename) subroutine. Although PDF, by Antonio Rosella, also provides such a method, this package was not developed with the use strict pragma and gives a lot of warnings. Furthermore, the package is not actively maintained, so there seems to be no chance to fix this in the near future. Implementing the isPDF subroutine is quite simple; it reads the first line of the PDF file and checks for the magic string %PDF-1.[0-9] in the first line of the document.
Please note that PDF::Reuse is not an object oriented package. Therefore the CombinePDFs package is not object oriented, either. A user of this package could create several instances, but all instances work on the same PDF file.
Submitting complex data structures via the command line is a difficult issue, so I decided that bookmarks should come from a text file. This file has a simple markup to reflect a tree structure, where each line resembles:
<level> "bookmarks text" <page>
The level starts with 0 for root bookmarks. Children of the root bookmarks have a level of 1, their children a level of 2, and so on. Currently, the system supports bookmarks up to three levels of nesting:
0 "Folder File 1 - Page 1" 1
1 "File 1 - Page 2" 2
1 "Subfolder File 1 - Page 3" 3
2 "File 1 - Page 4" 4
0 "Folder File 2 - Page 7 " 7
1 "File 2 - Page 7" 7
1 "File 2 - Page 9" 9
The parsing subroutine for the bookmarks file CombinePDFs::addBookmarks($filename) should be easy to understand, though that's not necessarily true of the complex data structure created inside this subroutine.
Bookmarks are an array of hashes. addBookmarks uses several attributes. text is the title of the entry in the bookmarks panel. act is the action to trigger when someone clicks the entry. Here it is the page number to open. kids contains a reference to the children of this bookmark entry. During the loop over the file content, the code searches for each level the last entry in a variable and pushes its related children on those last entries. The root bookmarks get collected as an array, and the loop adds the children as a reference to an array, and so on for the grand children. The result is a nested complex data structure which stores all children in the kids attribute of the parent's bookmarks hash—an array of hashes containing other arrays of hashes and so on.
The parsing subroutine for the bookmarks file CombinePDFs::addBookmarks($filename) collects bookmarks in a array of hashes. At the end, it adds the bookmarks to the document with prBookmarks($reference). All of this means that you can use a bookmarks file with the PDF file with a command line like:
$ perl bin/app-combine-pdfs.pl \
--infile out/file-1.pdf --pages 1-6 \
--infile out/file-2.pdf --pages 1-4,7,9-10 \
--bookmarks out/bookmarks.cnt \
--outfile file-all.pdf --overwrite
Currently, you must open the document's navigation panel manually because PDF::Reuse does not yet allow you to declare a default view, whether full screen or panel view. This is easy to fix, and the author Lars Lundberg has promised me to do so in a next release of PDF::Reuse. In order to enable this feature until a new release will appear I included a modified version of PDF::Reuse in the examples zip file that accompanies this article.
Furthermore, the bookmarks use JavaScript functions. To use the bookmarks in PDF viewers other than Acrobat Reader, my patched PDF::Reuse package replaces JavaScript bookmarks with PDF specification compliant bookmarks. To do that, replace the act key with a page key using the appropiate page number and scroll options:
$bookmarks = { text => 'Document',
page => '0,40,50;',
kids => [ { text => 'Chapter 1',
page => '1, 40, 600'
},
{ text => 'Chapter 2',
page => '10, 40, 600'
}
]
}
Then print the bookmarks to the PDF document as usual with prBookmark($bookmarks);.
Tk Application to Combine PDF Documents
Console applications are fine for experienced users, but you can't expect that all users belong to this category. Therefore it might be worth it to write a GUI for combining PDF documents. The Perl/Tk toolkit founded on the old Tix widgets for Tcl/Tk is not very modern, although this might change with the Tcl/Tk release 8.5 and the Tile widgets—but it is very portable. That's why I used it for the GUI example. Because I put a layer between the PDF::Reuse package and the command line application with the CombinePDFs package, it was easy to reuse those parts in the Tk-application app-combine-tk-pdfs.pl.
With the Tk application, the user visually selects PDF files, orders the files in a Tk::Tree widget, and changes the page ranges and the bookmarks text in Tk::Entry fields. Furthermore, the application can store the resulting tree structure inside a session file and restored that later on. It's also possible to copy and paste entries inside the tree, which makes it easy to create a bookmarks panel for single files without using bookmark files. The Tk application can be found in the download at the end of this article.
Beside the final PDF file, the application creates a file with the same basename and the .cnt extension. This file contains the bookmarks for the PDF. It's also useful to continue the processing of the combined PDF file instead of reassembling all the source files again. The entry for this feature is File->Load Bookmarks-File.
When loading a bookmarks file, the same extension convention is in place.
Other PDF Packages on CPAN
I like PDF::Reuse, but there are several other options for PDF creation and manipulation on the CPAN.
- PDF::API2, by Alfred Reibenschuh, is actively maintained. It is the package of choice if creating new PDF documents from scratch.
- PDF::API2::Simple, by Red Tree Systems, is a wrapper over the
PDF::API2module for users who find the PDF::API2 module to difficult to use. - Text::PDF, by Martin Hosken, can work on more than PDF file at the same time and has Truetype font support.
- CAM::PDF, by Clotho Advanced Media, is like
PDF::Reusemore focused on reading and manipulating existing PDF documents. However, it can work on multiple files at the same time. Use it if you need more features thanPDF::Reuseactually provides.
Conclusions
PDF::Reuse is a well-written and well-documented package, which makes it easy to create, combine, and change existing PDF documents. The two sample applications show some of its capabilities. Two limitations should be mentioned however, PDF::Reuse can't reuse existing bookmarks, and after combining different PDF documents some of the inner document hyperlinks might stop working properly. The example source code for the applications, packages, and the modified PDF::Reuse is available.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 2 of 2.



