I recently remembered that I hadn't yet cleaned up and done an official
public release of my pdfmunge utility. It's a little Python script that I wrote about a month ago to help me deal with PDFs more effectively on my eBook reader. If you're lucky enough to own a big-screened Kindle DX, you can stop reading now. The rest of us have to deal with reflowing the text of PDFs in order to bring them up to a legible size on tiny screens. Of course, most PDFs don't take kindly to being reflowed, and if they contain any kind of technical diagrams, source code, or images, you're pretty much out of luck. That's where pdfmunge comes in.
The script can do two main things: remove extraneous borders from pages, and rotate and slice them in such a way as to simulate a half-page landscape mode on your eBook reader. It can also do minor things like removing particular pages, or excluding certain pages from the other processing steps. After using it for about a month, I can happily report that it basically transforms my eBook reader into a vastly more useful device, fully capable of handling complex technical PDFs with ease. Here's a shot of it displaying a half a processed page with equations from the awesome Elements of Statistical Learning:

The script requires Python and the pyPdf package, but is otherwise self-contained. If you are technically inclined, you can get it from the GitHub repository, or you can just download the specific file you need here. I'd be very happy to accept suggestions for improvement, especially if they come with code! See the Readme file for information on how to use the script and on how to contribute enhancements.
Usage Examples
Some examples might help to illustrate how to use it. For a simple example, we can use the PDF version of John Hughes' paper Why Functional Programming Matters, which is available from his website. We want to strip away the large margins, and exclude the last page of references:
./pdfmunge.py --exclude "23" --bounds "125,110,485,670" WhyFP.pdf Test.pdf
For a complicated example that really shows off all the features, let's look at the (free) PDF edition of Mark Dominus' Higher-Order Perl (Amazon.com, Amazon.ca). This book needs to have the large margins and printers' guides stripped off, but uses margins of alternating size on alternating pages. Let's also get rid of some of the extraneous
material like the dedication page and index (buy a paper copy if you want that kind of convenience!). To make it more readable, we'll slice each page in half and rotate them, creating an imitation of landscape mode on the eBook reader. We'll also want to refrain from processing page 5, which is the title page:
./pdfmunge.py --intact "5" --exclude "1-4,6-8,582-592" \
--bounds "215,240,557,778" --oddbounds "153,240,495,778" \
--rotate --margin 3 HigherOrderPerl.pdf Test.pdf
I hope this is as useful to you as it was to me!
Doug Crockford gave a talk at the University of Waterloo last night as part
of the Yahoo!
Hack-U (University Hack Day), on the same topic as his new book, JavaScript: The Good Parts. I was lucky enough to get a seat, and have tried to
condense my nine pages of notes into an overview of the highlights of his talk. Though I've
rewritten parts, all of the content and wisdom is Doug's -- I'm just the
humble scribe. Of course, any errors of transcription are mine.
Introduction
Believe it or not, JavaScript has good parts! It's an odd language, because
it contains some of the best and some of the worst ideas in programming
language design, and has managed to become both the most popular and most
reviled programming language out there.
Of all languages, JavaScript probably has the broadest range of skills
among its users. It appeals to both computer scientists and cut'n'paste
beginners with no clear idea of what they are doing. It's pretty much the only
language that people use without ever learning! That's both the cause of a lot
of the awful code out there, and an astonishing testament to the fact that
it's actually possible to do that.
JavaScript is not what makes in-browser programming awful -- it's the DOM.
Any language would be painful if you had to use it to interact with the DOM.
It's also what makes things slow and inefficient.
The history of the language is incredibly diverse: it's been influenced by
Scheme (lambdas, loose typing), Self (prototypal inheritance, dynamic
objects), Java (syntax), and Perl (regular expressions).
The Bad Parts
Global variables: This is bad for all the same reasons as
in other languages, but in addition, JavaScript will implicitly declare your
typo-ed identifiers, and silently carry on.
+ both adds and concatenates: You can get away with this
in Java because of the type restrictions, but in JavaScript you'll blithely
try to add a number to, say, a form input which looks like a number but is
actually a string.
Semicolon insertion: This seems like a nice,
beginner-friendly feature, at first. It's implemented by the parser running
along until it hits a syntax error, at which point it rewinds a little,
inserts a semicolon in a likely place, and tries again. This should scare
you.
typeof: What's the typeof an
object? Object. Of an array? Object. Of null? Object.
with and eval: Security
implications are bad. eval is probably the single worst misused
feature. If you find that you want to use it, step away from the
computer
.
Fake arrays: They're actually hash tables, which is OK if
that's what you want, and if that's what you call them.
== and != do type coercion. Unfortunately nobody can
figure out exactly when or how. For example, 0 == '0' is true,
but false == 'false' is false, and '' == '0' is
false, but 0 == '' is true. Thankfully, you can just always use
===.
false, null, undefined, NaN: These are all almost the
same, but not quite.
Bad Genetics
There is also a good deal of bad behaviour that's inherited and shared with
other languages: block-less statements (e.g. one-line if and so
on); expression statements (lone expressions on a line that will be evaluated
and discarded); IEEE floating point (0.1 + 0.2 != 0.3); ++ and -- (leads
developers into clever
behaviour); and fall-through of
switch blocks.
Doug had an amusing anecdote about an episode in the development of JSLint:
A user contacted him suggesting that fall-through of cases be flagged as bad
behaviour. Dough replied with an explanation of the elegance of nicely
structured, intentional fall-through, which convinced the user to retract his
feature request. In the user's response, in addition to withdrawing the
feature request, he reported another bug. When Doug investigated it, it turned
out to be... yup, unwanted switch fall-through. At that moment, Doug says,
I was enlightened
.
Good Parts
JavaScript was the first really mainstream language with lambda and
first-class functions, which other languages are now adopting. This makes JS
an influential language!
Dynamic objects are simple containers that can grow or shrink, and since
they are based on prototypal inheritance, they aren't limited to just being
instances of a class. This is a strictly more powerful object model, but it
takes some getting used to for most people.
Loose typing is one of the controversial parts, which some people would
consider one of the bad things. However, Doug's conclusion is that the added
expressiveness and ease of use is well worth it, since the kind of bugs
avoided by strict typing are usually easy to fix anyway.
Gotchas
Globals
Consider the code:
var names = ['zero', 'one, 'two', ...]
var digit_name = function (n) {
return names[n];
}
Though it works, it makes use of the nasty, global, names, which
could lead to all kinds of nonsense. We could move the variable into the scope
of the function instead, but that would be rather inefficient. Instead, try
this:
var digit_name = (function() {
var names = ['zero', 'one', 'two', ...];
return function (n) {
return names[n];
};
})();
This is an example of one of the good parts: closures. We can define the
variable just once, and then have our function close over it, preserving
state. The trailing () at the very end cause the anonymous function to be
executed right away, binding the returned function to our variable. This is
awesome.
Style Isn't Subjective
Brace positioning is more or less a holy war without any right
answer -- except in JavaScript, where same-line braces are
right and you should always use them. Here's why:
return
{
ok: false;
};
return {
ok: true;
};
What's the difference between these two snippets? Well, in the first one,
you silently get something completely different than what you wanted. The lone
return gets mangled by the semicolon insertion process (remember that from the
list of Bad Parts?), becomes return; and returns nothing. The rest
of the code becomes a plain old block statement, with ok: becoming
a label (of all things)! Having a label there might make
sense in C, where you can goto, but in JavaScript, it makes no
sense in this context. And what happens to false? It becomes one of
those expression statements mentioned in the Bad Parts: it gets evaluated and
completely ignored. Finally, the trailing semicolon -- what about that? Do we
at least get a syntax error there? Nope: empty statement, like in C.
Use same-line braces, folks.
JSLint
JSLint defines a professional subset of JavaScript, and imposes programming
discipline. You should do everything it tells you, even if it hurts your
feelings. Doug says JSLint is smarter about JavaScript than I am, and
probably smarter than you are too
.
History and Future
AJAX and the resurgence in popularity of JavaScript could have happened way
earlier, but Netscape 4 and the other browsers of the time we so
awful. Netscape 4 was a crime against humanity
. IE 6 was the
best browser in the world -- and think of just how bad it is.
However, all that may have been good for JavaScript: had anything happened,
it would have been thrown out and replaced with something much better!
JavaScript would have died with Netscape if not for Microsoft diligently
duplicating it, bugs and all.
Perhaps the very best part of JavaScript: stability! No new design errors
since 1999! Also, no new versions.
Thankfully, ECMAScript Fifth Edition is in the works (and is actually
readable), with nice features like support for object hardening and a strict
mode (invoked with "use strict";, which is an expression
statement under older versions).
Unfortunately, we're still waiting on implementations. Microsoft will
likely have the first working version, but they won't ship it until whenever
IE 9 comes out. Mozilla seem to just be waiting to see what Microsoft does,
and they'll react to that. Apple can't comment
on future products.
Google will just do whatever Apple does (UPDATE: Now that I am an intern at
Google, I should probably clarify that that's Doug Crockford's statement; I
know nothing about any Google plans on the matter).
The Really Good Parts
If you use JavaScript, you have a potential audience of billions. It's the
most widespread -- and despite the bugs, the most cross-platform -- system you
can use.
It is possible to write really good code. In fact, it is mandatory
if you want to maintain sanity.
If you avoid the bad parts, it works really well. It's not just usable and
pleasant; there is brilliance in it.
Misc. Q&A
At the Q&A afterwards, there were a few interesting gems:
- The people in charge of the language (ECMA), and the people in charge of
the DOM (the W3C), have never had a joint session or meeting, but he's
trying to change that.
- He thinks the DOM is awful, and HTML 5 is taking it in
exactly the
wrong direction
.
The Book
I'm going to wrap this up the same way he did: with a plug for his new
book, JavaScript: The Good Parts (Amazon.com, Amazon.ca).
If you do any JavaScript development, get a copy! It contains all of the above
wisdom, and much more.
Now excuse me; I'm off to do some JavaScript.