I recently remembered that I hadn't yet cleaned up and done an official
public release of my pdfmunge utility. It's a little Python script that I wrote about a month ago to help me deal with PDFs more effectively on my eBook reader. If you're lucky enough to own a big-screened Kindle DX, you can stop reading now. The rest of us have to deal with reflowing the text of PDFs in order to bring them up to a legible size on tiny screens. Of course, most PDFs don't take kindly to being reflowed, and if they contain any kind of technical diagrams, source code, or images, you're pretty much out of luck. That's where pdfmunge comes in.
The script can do two main things: remove extraneous borders from pages, and rotate and slice them in such a way as to simulate a half-page landscape mode on your eBook reader. It can also do minor things like removing particular pages, or excluding certain pages from the other processing steps. After using it for about a month, I can happily report that it basically transforms my eBook reader into a vastly more useful device, fully capable of handling complex technical PDFs with ease. Here's a shot of it displaying a half a processed page with equations from the awesome Elements of Statistical Learning:

The script requires Python and the pyPdf package, but is otherwise self-contained. If you are technically inclined, you can get it from the GitHub repository, or you can just download the specific file you need here. I'd be very happy to accept suggestions for improvement, especially if they come with code! See the Readme file for information on how to use the script and on how to contribute enhancements.
Usage Examples
Some examples might help to illustrate how to use it. For a simple example, we can use the PDF version of John Hughes' paper Why Functional Programming Matters, which is available from his website. We want to strip away the large margins, and exclude the last page of references:
./pdfmunge.py --exclude "23" --bounds "125,110,485,670" WhyFP.pdf Test.pdf
For a complicated example that really shows off all the features, let's look at the (free) PDF edition of Mark Dominus' Higher-Order Perl (Amazon.com, Amazon.ca). This book needs to have the large margins and printers' guides stripped off, but uses margins of alternating size on alternating pages. Let's also get rid of some of the extraneous
material like the dedication page and index (buy a paper copy if you want that kind of convenience!). To make it more readable, we'll slice each page in half and rotate them, creating an imitation of landscape mode on the eBook reader. We'll also want to refrain from processing page 5, which is the title page:
./pdfmunge.py --intact "5" --exclude "1-4,6-8,582-592" \
--bounds "215,240,557,778" --oddbounds "153,240,495,778" \
--rotate --margin 3 HigherOrderPerl.pdf Test.pdf
I hope this is as useful to you as it was to me!
Hector is a Literate Haskell implementation of Weizenbaum's ELIZA program.
ELIZA is more or less the original chatterbot
. Implemented by Weizenbaum in the mid-1960s to demonstrate how seemingly intelligent
behaviour could be mimicked by very simple pattern-matching, it shocked even him with how much ordinary people were taken in by the illusion. It is probably best known in conjunction with its DOCTOR script, which allows it to act like a Rogerian psychotherapist (a version of this can be invoked in the Emacs text editor with M-x doctor
). One user was reportedly so affected by the program that he requested that Weizenbaum give him some privacy to discuss his issues with the doctor!
The program works by simple pattern matching. The user's input is checked against a list of pre-defined patterns, and if there's a match, a corresponding output is returned. Eliza also has some facilities for extracting content from the input and using it in the reply (e.g. I need a vacation
might result in How would you feel if you got a vacation?
). The final bit of finesse is a rudimentary memory, so that it can bring up past topics again if it doesn't know what to say (Hector does not do this).
Hector is a Haskell reimplementation of a very similar system. It is written in a literate
style, and should be simple to follow along with. Hector is not a doctor like Eliza -- instead, he's more of a grizzled old programming veteran, having got his start feeding punchcards into a mainframe in his undergraduate years. He can't help you with relationships or inferiority complexes, but he'll be happy to discuss your programming problems with you.
A few sample conversations (My responses prefixed with >):
Hello there. What's the matter?
> I don't know how to write programs in Haskell.
Does anyone have a reference on how to write programs in Haskell?
> Yes, there are lots on the internet.
Oh Internet... a thousand monkeys and associated typewriters, eh?
> More or less, yeah.
At least we agree on something!
Hello there. What's the matter?
> I have the flu.
How long have you had the flu?
> Four days.
I'm pretty sure Dijkstra has a paper that would help you.
> With the flu? I doubt it.
Hello there. What's the matter?
> My regular expressions are wrong.
Regular Expressions... Now you have two problems.
> That's not very helpful.
So, have you read Knuth's books?
> No.
Well, I don't know what to say, then.
I am by no means knowledgeable about Haskell: this program is just the best efforts of a beginner, and I'd be very happy to hear feedback on how to make it more idiomatic or expressive. In particular, the PRNG seed passing strikes me as a kludge, and I suspect there's a nicer way to handle the IO aspects, for example with interact and lines.
The code below covers the interesting parts of the program, but the actual input/response pairs have been left out for conciseness (and because they're largely boring). A link to the full source file is here.
A short list of references about Haskell and AI history/programming can be found at the end of the post. More can be found by looking at the related items
on the linked-to pages.
The Code
We're going to use regular expressions for our pattern matching, and select responses randomly from the set of available responses to that input, so we'll need some imports.
import Data.List
import System.Random
import Text.Regex
The main loop of the program is quite simple: we just keep prompting the user for input and outputting a response until we get an EOF. Before we enter this loop, we'll just print a short greeting. Each iteration of the loop we'll take advantage of the system random number generator to create a new generator for use in picking our output.
main :: IO ()
main = do
putStrLn "Hello there. What's the matter?"
mainloop
mainloop :: IO ()
mainloop = do
input <- getLine
seed <- randomIO
putStrLn $ respond input (mkStdGen seed)
mainloop
The format we'll use for our canned input/response pairs is simple. Each possible input is a regular expression, originally entered as a string. Before use, we'll do some gentle massaging to get it into the Regex type our libraries want. Every input/response pair actually consists of many different inputs and responses: any input in the pair can cause any of the responses to be output. In other words, if any of the inputs in a pair match, one of the outputs in the pair will be returned.
type RawIRPair = ([String], [String])
type IRPair = ([Regex], [String])
Our actual input/response pairs are defined at the end of the program, since they're very bulky and largely uninteresting. They're grouped into categories, so we'll need to put all of them together into one big list.
combinedResponses :: [RawIRPair]
combinedResponses = foldl' (++) [] [general, tech, nonSequitur]
While we're at it, let's convert our input strings into Regexes -- this is the final form that we'll actually use. Note that we're making the regular expressions case-insensitive.
responses :: [IRPair]
responses = regexify combinedResponses
where regexify :: [RawIRPair] -> [IRPair]
regexify [] = []
regexify (x:xs) =
[(map (\i -> mkRegexWithOpts i True False) (fst x), snd x)] ++
regexify xs
Here's where the meat of the program happens: the process of selecting a response. Essentially, we just find the list of responses that match our input, and pick one at random.
respond :: RandomGen g => String -> g -> String
respond input randGen = pickResponse (findMatchingResponses input) randGen
The process of picking the response at random is also straightforward. Once
it's chosen at random, we run it through a function that substitutes any
captured subgroups in place of patterns of the form !0, !1, etc.
pickResponse :: RandomGen g => ([String], [String]) -> g -> String
pickResponse (responses, substrings) randGen =
fillIn (responses !! chosen) substrings
where chosen = (fst $ randomR (0, length responses - 1) randGen :: Int)
fillIn :: String -> [String] -> String
fillIn response substrings = fillInCount response substrings 0
fillInCount :: String -> [String] -> Int -> String
fillInCount response (x:xs) n =
fillInCount (subRegex (mkRegex ("!" ++ show n)) response x) xs (n+1)
fillInCount response [] _ = response
Finding the list of matching responses, however, is a little trickier, largely due to the deep nesting that we need to flatten out before we get a chance to actually try to match our regular expressions. Along with the list of matching potential responses, we return the list of matching substrings (if any).
findMatchingResponses :: String -> ([String], [String])
findMatchingResponses input = checkListOfTuples input responses
where checkListOfTuples input (x:xs) =
case (checkTuple input x) of
([], []) -> checkListOfTuples input xs
(responses, substrings) -> (responses, substrings)
checkTuple :: String -> IRPair -> ([String], [String])
checkTuple input (i:is, r) =
case (matchRegex i input) of
Just a -> (r, a)
Nothing -> checkTuple input (is, r)
checkTuple _ _ = ([], [])
Finally, we need to give Hector something to say. We do this by creating the list of input/response pairs references in our list-flattening function above. I've left this out of the post because it's largely uninteresting, but it looks something like this:
general :: [RawIRPair]
general = [
(["hello"], ["Hi. What's going on?", "Yes, hi. Tell me your problem."]),
(["I need ([a-zA-Z ]*)", "I want ([a-zA-Z ]*)"],
["Well, do you think it's at all possible to get !0?",
"What would change if you got !0?"])]
References
Paradigms of Artificial Intelligence Programming (Amazon.com
, Amazon.ca
)
Real World Haskell (Amazon.com
, Amazon.ca
, free online)
Programming in Haskell (Amazon.com
, Amazon.ca
)
Extensions
There are a few niceties that Hector doesn't implement, that can be considered the infamous exercises left to the reader
. These include: memory of already-used phrases, memory of previous topics, a much more extensive and useful set of input/response pairs, and actual sentience and intelligence.
ELIZA implements the first three.
Doug Crockford gave a talk at the University of Waterloo last night as part
of the Yahoo!
Hack-U (University Hack Day), on the same topic as his new book, JavaScript: The Good Parts. I was lucky enough to get a seat, and have tried to
condense my nine pages of notes into an overview of the highlights of his talk. Though I've
rewritten parts, all of the content and wisdom is Doug's -- I'm just the
humble scribe. Of course, any errors of transcription are mine.
Introduction
Believe it or not, JavaScript has good parts! It's an odd language, because
it contains some of the best and some of the worst ideas in programming
language design, and has managed to become both the most popular and most
reviled programming language out there.
Of all languages, JavaScript probably has the broadest range of skills
among its users. It appeals to both computer scientists and cut'n'paste
beginners with no clear idea of what they are doing. It's pretty much the only
language that people use without ever learning! That's both the cause of a lot
of the awful code out there, and an astonishing testament to the fact that
it's actually possible to do that.
JavaScript is not what makes in-browser programming awful -- it's the DOM.
Any language would be painful if you had to use it to interact with the DOM.
It's also what makes things slow and inefficient.
The history of the language is incredibly diverse: it's been influenced by
Scheme (lambdas, loose typing), Self (prototypal inheritance, dynamic
objects), Java (syntax), and Perl (regular expressions).
The Bad Parts
Global variables: This is bad for all the same reasons as
in other languages, but in addition, JavaScript will implicitly declare your
typo-ed identifiers, and silently carry on.
+ both adds and concatenates: You can get away with this
in Java because of the type restrictions, but in JavaScript you'll blithely
try to add a number to, say, a form input which looks like a number but is
actually a string.
Semicolon insertion: This seems like a nice,
beginner-friendly feature, at first. It's implemented by the parser running
along until it hits a syntax error, at which point it rewinds a little,
inserts a semicolon in a likely place, and tries again. This should scare
you.
typeof: What's the typeof an
object? Object. Of an array? Object. Of null? Object.
with and eval: Security
implications are bad. eval is probably the single worst misused
feature. If you find that you want to use it, step away from the
computer
.
Fake arrays: They're actually hash tables, which is OK if
that's what you want, and if that's what you call them.
== and != do type coercion. Unfortunately nobody can
figure out exactly when or how. For example, 0 == '0' is true,
but false == 'false' is false, and '' == '0' is
false, but 0 == '' is true. Thankfully, you can just always use
===.
false, null, undefined, NaN: These are all almost the
same, but not quite.
Bad Genetics
There is also a good deal of bad behaviour that's inherited and shared with
other languages: block-less statements (e.g. one-line if and so
on); expression statements (lone expressions on a line that will be evaluated
and discarded); IEEE floating point (0.1 + 0.2 != 0.3); ++ and -- (leads
developers into clever
behaviour); and fall-through of
switch blocks.
Doug had an amusing anecdote about an episode in the development of JSLint:
A user contacted him suggesting that fall-through of cases be flagged as bad
behaviour. Dough replied with an explanation of the elegance of nicely
structured, intentional fall-through, which convinced the user to retract his
feature request. In the user's response, in addition to withdrawing the
feature request, he reported another bug. When Doug investigated it, it turned
out to be... yup, unwanted switch fall-through. At that moment, Doug says,
I was enlightened
.
Good Parts
JavaScript was the first really mainstream language with lambda and
first-class functions, which other languages are now adopting. This makes JS
an influential language!
Dynamic objects are simple containers that can grow or shrink, and since
they are based on prototypal inheritance, they aren't limited to just being
instances of a class. This is a strictly more powerful object model, but it
takes some getting used to for most people.
Loose typing is one of the controversial parts, which some people would
consider one of the bad things. However, Doug's conclusion is that the added
expressiveness and ease of use is well worth it, since the kind of bugs
avoided by strict typing are usually easy to fix anyway.
Gotchas
Globals
Consider the code:
var names = ['zero', 'one, 'two', ...]
var digit_name = function (n) {
return names[n];
}
Though it works, it makes use of the nasty, global, names, which
could lead to all kinds of nonsense. We could move the variable into the scope
of the function instead, but that would be rather inefficient. Instead, try
this:
var digit_name = (function() {
var names = ['zero', 'one', 'two', ...];
return function (n) {
return names[n];
};
})();
This is an example of one of the good parts: closures. We can define the
variable just once, and then have our function close over it, preserving
state. The trailing () at the very end cause the anonymous function to be
executed right away, binding the returned function to our variable. This is
awesome.
Style Isn't Subjective
Brace positioning is more or less a holy war without any right
answer -- except in JavaScript, where same-line braces are
right and you should always use them. Here's why:
return
{
ok: false;
};
return {
ok: true;
};
What's the difference between these two snippets? Well, in the first one,
you silently get something completely different than what you wanted. The lone
return gets mangled by the semicolon insertion process (remember that from the
list of Bad Parts?), becomes return; and returns nothing. The rest
of the code becomes a plain old block statement, with ok: becoming
a label (of all things)! Having a label there might make
sense in C, where you can goto, but in JavaScript, it makes no
sense in this context. And what happens to false? It becomes one of
those expression statements mentioned in the Bad Parts: it gets evaluated and
completely ignored. Finally, the trailing semicolon -- what about that? Do we
at least get a syntax error there? Nope: empty statement, like in C.
Use same-line braces, folks.
JSLint
JSLint defines a professional subset of JavaScript, and imposes programming
discipline. You should do everything it tells you, even if it hurts your
feelings. Doug says JSLint is smarter about JavaScript than I am, and
probably smarter than you are too
.
History and Future
AJAX and the resurgence in popularity of JavaScript could have happened way
earlier, but Netscape 4 and the other browsers of the time we so
awful. Netscape 4 was a crime against humanity
. IE 6 was the
best browser in the world -- and think of just how bad it is.
However, all that may have been good for JavaScript: had anything happened,
it would have been thrown out and replaced with something much better!
JavaScript would have died with Netscape if not for Microsoft diligently
duplicating it, bugs and all.
Perhaps the very best part of JavaScript: stability! No new design errors
since 1999! Also, no new versions.
Thankfully, ECMAScript Fifth Edition is in the works (and is actually
readable), with nice features like support for object hardening and a strict
mode (invoked with "use strict";, which is an expression
statement under older versions).
Unfortunately, we're still waiting on implementations. Microsoft will
likely have the first working version, but they won't ship it until whenever
IE 9 comes out. Mozilla seem to just be waiting to see what Microsoft does,
and they'll react to that. Apple can't comment
on future products.
Google will just do whatever Apple does.
The Really Good Parts
If you use JavaScript, you have a potential audience of billions. It's the
most widespread -- and despite the bugs, the most cross-platform -- system you
can use.
It is possible to write really good code. In fact, it is mandatory
if you want to maintain sanity.
If you avoid the bad parts, it works really well. It's not just usable and
pleasant; there is brilliance in it.
Misc. Q&A
At the Q&A afterwards, there were a few interesting gems:
- The people in charge of the language (ECMA), and the people in charge of
the DOM (the W3C), have never had a joint session or meeting, but he's
trying to change that.
- He thinks the DOM is awful, and HTML 5 is taking it in
exactly the
wrong direction
.
The Book
I'm going to wrap this up the same way he did: with a plug for his new
book, JavaScript: The Good Parts (Amazon.com
, Amazon.ca
).
If you do any JavaScript development, get a copy! It contains all of the above
wisdom, and much more.
Now excuse me; I'm off to do some JavaScript.