Felix Crux

Technology & Miscellanea


Tags: , , ,

PDFMunge -- Improve PDFs on eBook Readers

I recently remembered that I hadn't yet cleaned up and done an official public release of my pdfmunge utility. It's a little Python script that I wrote about a month ago to help me deal with PDFs more effectively on my eBook reader. If you're lucky enough to own a big-screened Kindle DX, you can stop reading now. The rest of us have to deal with reflowing the text of PDFs in order to bring them up to a legible size on tiny screens. Of course, most PDFs don't take kindly to being reflowed, and if they contain any kind of technical diagrams, source code, or images, you're pretty much out of luck. That's where pdfmunge comes in.

The script can do two main things: remove extraneous borders from pages, and rotate and slice them in such a way as to simulate a half-page landscape mode on your eBook reader. It can also do minor things like removing particular pages, or excluding certain pages from the other processing steps. After using it for about a month, I can happily report that it basically transforms my eBook reader into a vastly more useful device, fully capable of handling complex technical PDFs with ease. Here's a shot of it displaying a half a processed page with equations from the awesome Elements of Statistical Learning:

eBook reader showing a page with equations

The script requires Python and the pyPdf package, but is otherwise self-contained. If you are technically inclined, you can get it from the GitHub repository, or you can just download the specific file you need here. I'd be very happy to accept suggestions for improvement, especially if they come with code! See the Readme file for information on how to use the script and on how to contribute enhancements.

Usage Examples

Some examples might help to illustrate how to use it. For a simple example, we can use the PDF version of John Hughes' paper Why Functional Programming Matters, which is available from his website. We want to strip away the large margins, and exclude the last page of references:


./pdfmunge.py --exclude "23" --bounds "125,110,485,670" WhyFP.pdf Test.pdf

For a complicated example that really shows off all the features, let's look at the (free) PDF edition of Mark Dominus' Higher-Order Perl (Amazon.com, Amazon.ca). This book needs to have the large margins and printers' guides stripped off, but uses margins of alternating size on alternating pages. Let's also get rid of some of the extraneous material like the dedication page and index (buy a paper copy if you want that kind of convenience!). To make it more readable, we'll slice each page in half and rotate them, creating an imitation of landscape mode on the eBook reader. We'll also want to refrain from processing page 5, which is the title page:


./pdfmunge.py --intact "5" --exclude "1-4,6-8,582-592" \
              --bounds "215,240,557,778" --oddbounds "153,240,495,778" \
              --rotate --margin 3 HigherOrderPerl.pdf Test.pdf

I hope this is as useful to you as it was to me!



Tags: , , ,

Hector: A Literate Haskell Implementation of ELIZA

Hector is a Literate Haskell implementation of Weizenbaum's ELIZA program.

ELIZA is more or less the original chatterbot. Implemented by Weizenbaum in the mid-1960s to demonstrate how seemingly intelligent behaviour could be mimicked by very simple pattern-matching, it shocked even him with how much ordinary people were taken in by the illusion. It is probably best known in conjunction with its DOCTOR script, which allows it to act like a Rogerian psychotherapist (a version of this can be invoked in the Emacs text editor with M-x doctor). One user was reportedly so affected by the program that he requested that Weizenbaum give him some privacy to discuss his issues with the doctor!

The program works by simple pattern matching. The user's input is checked against a list of pre-defined patterns, and if there's a match, a corresponding output is returned. Eliza also has some facilities for extracting content from the input and using it in the reply (e.g. I need a vacation might result in How would you feel if you got a vacation?). The final bit of finesse is a rudimentary memory, so that it can bring up past topics again if it doesn't know what to say (Hector does not do this).

Hector is a Haskell reimplementation of a very similar system. It is written in a literate style, and should be simple to follow along with. Hector is not a doctor like Eliza -- instead, he's more of a grizzled old programming veteran, having got his start feeding punchcards into a mainframe in his undergraduate years. He can't help you with relationships or inferiority complexes, but he'll be happy to discuss your programming problems with you.

A few sample conversations (My responses prefixed with >):

Hello there. What's the matter?
> I don't know how to write programs in Haskell.
Does anyone have a reference on how to write programs in Haskell?
> Yes, there are lots on the internet.
Oh Internet... a thousand monkeys and associated typewriters, eh?
> More or less, yeah.
At least we agree on something!

Hello there. What's the matter?
> I have the flu.
How long have you had the flu?
> Four days.
I'm pretty sure Dijkstra has a paper that would help you.
> With the flu? I doubt it.

Hello there. What's the matter?
> My regular expressions are wrong.
Regular Expressions... Now you have two problems.
> That's not very helpful.
So, have you read Knuth's books?
> No.
Well, I don't know what to say, then.

I am by no means knowledgeable about Haskell: this program is just the best efforts of a beginner, and I'd be very happy to hear feedback on how to make it more idiomatic or expressive. In particular, the PRNG seed passing strikes me as a kludge, and I suspect there's a nicer way to handle the IO aspects, for example with interact and lines.

The code below covers the interesting parts of the program, but the actual input/response pairs have been left out for conciseness (and because they're largely boring). A link to the full source file is here.

A short list of references about Haskell and AI history/programming can be found at the end of the post. More can be found by looking at the related items on the linked-to pages.

The Code

We're going to use regular expressions for our pattern matching, and select responses randomly from the set of available responses to that input, so we'll need some imports.


import Data.List
import System.Random
import Text.Regex

The main loop of the program is quite simple: we just keep prompting the user for input and outputting a response until we get an EOF. Before we enter this loop, we'll just print a short greeting. Each iteration of the loop we'll take advantage of the system random number generator to create a new generator for use in picking our output.


main :: IO ()
main = do
  putStrLn "Hello there. What's the matter?"
  mainloop

mainloop :: IO ()
mainloop = do
  input <- getLine
  seed <- randomIO
  putStrLn $ respond input (mkStdGen seed)
  mainloop

The format we'll use for our canned input/response pairs is simple. Each possible input is a regular expression, originally entered as a string. Before use, we'll do some gentle massaging to get it into the Regex type our libraries want. Every input/response pair actually consists of many different inputs and responses: any input in the pair can cause any of the responses to be output. In other words, if any of the inputs in a pair match, one of the outputs in the pair will be returned.


type RawIRPair = ([String], [String])
type IRPair = ([Regex], [String])

Our actual input/response pairs are defined at the end of the program, since they're very bulky and largely uninteresting. They're grouped into categories, so we'll need to put all of them together into one big list.


combinedResponses :: [RawIRPair]
combinedResponses = foldl' (++) [] [general, tech, nonSequitur]

While we're at it, let's convert our input strings into Regexes -- this is the final form that we'll actually use. Note that we're making the regular expressions case-insensitive.


responses :: [IRPair]
responses = regexify combinedResponses
  where regexify :: [RawIRPair] -> [IRPair]
        regexify []     = []
        regexify (x:xs) =
          [(map (\i -> mkRegexWithOpts i True False) (fst x), snd x)] ++
          regexify xs

Here's where the meat of the program happens: the process of selecting a response. Essentially, we just find the list of responses that match our input, and pick one at random.


respond :: RandomGen g => String -> g -> String
respond input randGen = pickResponse (findMatchingResponses input) randGen

The process of picking the response at random is also straightforward. Once it's chosen at random, we run it through a function that substitutes any captured subgroups in place of patterns of the form !0, !1, etc.


pickResponse :: RandomGen g => ([String], [String]) -> g -> String
pickResponse (responses, substrings) randGen =
  fillIn (responses !! chosen) substrings
  where chosen = (fst $ randomR (0, length responses - 1) randGen :: Int)

fillIn :: String -> [String] -> String
fillIn response substrings = fillInCount response substrings 0

fillInCount :: String -> [String] -> Int -> String
fillInCount response (x:xs) n =
  fillInCount (subRegex (mkRegex ("!" ++ show n)) response x) xs (n+1)
fillInCount response [] _ = response

Finding the list of matching responses, however, is a little trickier, largely due to the deep nesting that we need to flatten out before we get a chance to actually try to match our regular expressions. Along with the list of matching potential responses, we return the list of matching substrings (if any).


findMatchingResponses :: String -> ([String], [String])
findMatchingResponses input = checkListOfTuples input responses
  where checkListOfTuples input (x:xs) =
           case (checkTuple input x) of
             ([], []) -> checkListOfTuples input xs
             (responses, substrings) -> (responses, substrings)

checkTuple :: String -> IRPair -> ([String], [String])
checkTuple input (i:is, r) =
  case (matchRegex i input) of
    Just a -> (r, a)
    Nothing -> checkTuple input (is, r)
checkTuple _ _ = ([], [])

Finally, we need to give Hector something to say. We do this by creating the list of input/response pairs references in our list-flattening function above. I've left this out of the post because it's largely uninteresting, but it looks something like this:


general :: [RawIRPair]
general = [
  (["hello"], ["Hi. What's going on?", "Yes, hi. Tell me your problem."]),
  (["I need ([a-zA-Z ]*)", "I want ([a-zA-Z ]*)"],
   ["Well, do you think it's at all possible to get !0?",
    "What would change if you got !0?"])]

References

Paradigms of Artificial Intelligence Programming (Amazon.com, Amazon.ca)
Real World Haskell (Amazon.com, Amazon.ca, free online)
Programming in Haskell (Amazon.com, Amazon.ca)

Extensions

There are a few niceties that Hector doesn't implement, that can be considered the infamous exercises left to the reader. These include: memory of already-used phrases, memory of previous topics, a much more extensive and useful set of input/response pairs, and actual sentience and intelligence.

ELIZA implements the first three.



Tags: , , ,

Doug Crockford Talk on JavaScript

Doug Crockford gave a talk at the University of Waterloo last night as part of the Yahoo! Hack-U (University Hack Day), on the same topic as his new book, JavaScript: The Good Parts. I was lucky enough to get a seat, and have tried to condense my nine pages of notes into an overview of the highlights of his talk. Though I've rewritten parts, all of the content and wisdom is Doug's -- I'm just the humble scribe. Of course, any errors of transcription are mine.

Introduction

Believe it or not, JavaScript has good parts! It's an odd language, because it contains some of the best and some of the worst ideas in programming language design, and has managed to become both the most popular and most reviled programming language out there.

Of all languages, JavaScript probably has the broadest range of skills among its users. It appeals to both computer scientists and cut'n'paste beginners with no clear idea of what they are doing. It's pretty much the only language that people use without ever learning! That's both the cause of a lot of the awful code out there, and an astonishing testament to the fact that it's actually possible to do that.

JavaScript is not what makes in-browser programming awful -- it's the DOM. Any language would be painful if you had to use it to interact with the DOM. It's also what makes things slow and inefficient.

The history of the language is incredibly diverse: it's been influenced by Scheme (lambdas, loose typing), Self (prototypal inheritance, dynamic objects), Java (syntax), and Perl (regular expressions).

The Bad Parts

Global variables: This is bad for all the same reasons as in other languages, but in addition, JavaScript will implicitly declare your typo-ed identifiers, and silently carry on.

+ both adds and concatenates: You can get away with this in Java because of the type restrictions, but in JavaScript you'll blithely try to add a number to, say, a form input which looks like a number but is actually a string.

Semicolon insertion: This seems like a nice, beginner-friendly feature, at first. It's implemented by the parser running along until it hits a syntax error, at which point it rewinds a little, inserts a semicolon in a likely place, and tries again. This should scare you.

typeof: What's the typeof an object? Object. Of an array? Object. Of null? Object.

with and eval: Security implications are bad. eval is probably the single worst misused feature. If you find that you want to use it, step away from the computer.

Fake arrays: They're actually hash tables, which is OK if that's what you want, and if that's what you call them.

== and != do type coercion. Unfortunately nobody can figure out exactly when or how. For example, 0 == '0' is true, but false == 'false' is false, and '' == '0' is false, but 0 == '' is true. Thankfully, you can just always use ===.

false, null, undefined, NaN: These are all almost the same, but not quite.

Bad Genetics

There is also a good deal of bad behaviour that's inherited and shared with other languages: block-less statements (e.g. one-line if and so on); expression statements (lone expressions on a line that will be evaluated and discarded); IEEE floating point (0.1 + 0.2 != 0.3); ++ and -- (leads developers into clever behaviour); and fall-through of switch blocks.

Doug had an amusing anecdote about an episode in the development of JSLint: A user contacted him suggesting that fall-through of cases be flagged as bad behaviour. Dough replied with an explanation of the elegance of nicely structured, intentional fall-through, which convinced the user to retract his feature request. In the user's response, in addition to withdrawing the feature request, he reported another bug. When Doug investigated it, it turned out to be... yup, unwanted switch fall-through. At that moment, Doug says, I was enlightened.

Good Parts

JavaScript was the first really mainstream language with lambda and first-class functions, which other languages are now adopting. This makes JS an influential language!

Dynamic objects are simple containers that can grow or shrink, and since they are based on prototypal inheritance, they aren't limited to just being instances of a class. This is a strictly more powerful object model, but it takes some getting used to for most people.

Loose typing is one of the controversial parts, which some people would consider one of the bad things. However, Doug's conclusion is that the added expressiveness and ease of use is well worth it, since the kind of bugs avoided by strict typing are usually easy to fix anyway.

Gotchas

Globals

Consider the code:


var names = ['zero', 'one, 'two', ...]
var digit_name = function (n) {
  return names[n];
}

Though it works, it makes use of the nasty, global, names, which could lead to all kinds of nonsense. We could move the variable into the scope of the function instead, but that would be rather inefficient. Instead, try this:


var digit_name = (function() {
  var names = ['zero', 'one', 'two', ...];
  return function (n) {
    return names[n];
  };
})();

This is an example of one of the good parts: closures. We can define the variable just once, and then have our function close over it, preserving state. The trailing () at the very end cause the anonymous function to be executed right away, binding the returned function to our variable. This is awesome.

Style Isn't Subjective

Brace positioning is more or less a holy war without any right answer -- except in JavaScript, where same-line braces are right and you should always use them. Here's why:


return
{
  ok: false;
};


return {
  ok: true;
};

What's the difference between these two snippets? Well, in the first one, you silently get something completely different than what you wanted. The lone return gets mangled by the semicolon insertion process (remember that from the list of Bad Parts?), becomes return; and returns nothing. The rest of the code becomes a plain old block statement, with ok: becoming a label (of all things)! Having a label there might make sense in C, where you can goto, but in JavaScript, it makes no sense in this context. And what happens to false? It becomes one of those expression statements mentioned in the Bad Parts: it gets evaluated and completely ignored. Finally, the trailing semicolon -- what about that? Do we at least get a syntax error there? Nope: empty statement, like in C.

Use same-line braces, folks.

JSLint

JSLint defines a professional subset of JavaScript, and imposes programming discipline. You should do everything it tells you, even if it hurts your feelings. Doug says JSLint is smarter about JavaScript than I am, and probably smarter than you are too.

History and Future

AJAX and the resurgence in popularity of JavaScript could have happened way earlier, but Netscape 4 and the other browsers of the time we so awful. Netscape 4 was a crime against humanity. IE 6 was the best browser in the world -- and think of just how bad it is.

However, all that may have been good for JavaScript: had anything happened, it would have been thrown out and replaced with something much better! JavaScript would have died with Netscape if not for Microsoft diligently duplicating it, bugs and all.

Perhaps the very best part of JavaScript: stability! No new design errors since 1999! Also, no new versions.

Thankfully, ECMAScript Fifth Edition is in the works (and is actually readable), with nice features like support for object hardening and a strict mode (invoked with "use strict";, which is an expression statement under older versions).

Unfortunately, we're still waiting on implementations. Microsoft will likely have the first working version, but they won't ship it until whenever IE 9 comes out. Mozilla seem to just be waiting to see what Microsoft does, and they'll react to that. Apple can't comment on future products. Google will just do whatever Apple does.

The Really Good Parts

If you use JavaScript, you have a potential audience of billions. It's the most widespread -- and despite the bugs, the most cross-platform -- system you can use.

It is possible to write really good code. In fact, it is mandatory if you want to maintain sanity.

If you avoid the bad parts, it works really well. It's not just usable and pleasant; there is brilliance in it.

Misc. Q&A

At the Q&A afterwards, there were a few interesting gems:

  • The people in charge of the language (ECMA), and the people in charge of the DOM (the W3C), have never had a joint session or meeting, but he's trying to change that.
  • He thinks the DOM is awful, and HTML 5 is taking it in exactly the wrong direction.

The Book

I'm going to wrap this up the same way he did: with a plug for his new book, JavaScript: The Good Parts (Amazon.com, Amazon.ca). If you do any JavaScript development, get a copy! It contains all of the above wisdom, and much more.

Now excuse me; I'm off to do some JavaScript.