Felix Crux

Technology & Miscellanea

Tags: ,

I recently remembered that I hadn't yet cleaned up and done an “official” public release of my pdfmunge utility. It's a little Python script that I wrote about a month ago to help me deal with PDFs more effectively on my eBook reader. If you're lucky enough to own a big-screened Kindle DX, you can stop reading now. The rest of us have to deal with reflowing the text of PDFs in order to bring them up to a legible size on tiny screens. Of course, most PDFs don't take kindly to being reflowed, and if they contain any kind of technical diagrams, source code, or images, you're pretty much out of luck. That's where pdfmunge comes in.

The script can do two main things: remove extraneous borders from pages, and rotate and slice them in such a way as to simulate a half-page landscape mode on your eBook reader. It can also do minor things like removing particular pages, or excluding certain pages from the other processing steps. After using it for about a month, I can happily report that it basically transforms my eBook reader into a vastly more useful device, fully capable of handling complex technical PDFs with ease. Here's a shot of it displaying a half a processed page with equations from the awesome Elements of Statistical Learning:

eBook reader showing a page with equations

The script requires Python and the pyPdf package, but is otherwise self-contained. If you are technically inclined, you can get it from the GitHub repository, or you can just download the specific file you need here. I'd be very happy to accept suggestions for improvement, especially if they come with code! See the Readme file for information on how to use the script and on how to contribute enhancements.

Usage Examples

Some examples might help to illustrate how to use it. For a simple example, we can use the PDF version of John Hughes' paper Why Functional Programming Matters, which is available from his website. We want to strip away the large margins, and exclude the last page of references:

./pdfmunge.py --exclude "23" --bounds "125,110,485,670" WhyFP.pdf Test.pdf

For a complicated example that really shows off all the features, let's look at the (free) PDF edition of Mark Dominus' Higher-Order Perl (Amazon.com, Amazon.ca). This book needs to have the large margins and printers' guides stripped off, but uses margins of alternating size on alternating pages. Let's also get rid of some of the “extraneous” material like the dedication page and index (buy a paper copy if you want that kind of convenience!). To make it more readable, we'll slice each page in half and rotate them, creating an imitation of landscape mode on the eBook reader. We'll also want to refrain from processing page 5, which is the title page:

./pdfmunge.py --intact "5" --exclude "1-4,6-8,582-592" \
              --bounds "215,240,557,778" --oddbounds "153,240,495,778" \
              --rotate --margin 3 HigherOrderPerl.pdf Test.pdf

I hope this is as useful to you as it was to me!

blog comments powered by Disqus