Apr 30, 2008

'Hansard online'

by Cliopatria

[Cross-posted at Airminded.]

I stumbled across this by accident: a pilot digitisation of Hansard, funded and operated by Parliament. What an excellent thing! It's functional, but based only on a subset of 20th-century Hansard material:

What's on this site? This site is generated from a sample of information from Hansard, the Official Report of Parliament. It is not a complete nor an official record. Material from this site should not be used as a reference to or cited as Hansard. The material on this site cannot be held to be authoritative.

This warning should be heeded -- it's only a prototype and should not be relied upon for any purpose. It's easy to find omissions, such as Baldwin's 'the bomber will always get through' speech, even though there's quite a number of entries for the day in question. The text itself appears remarkably uncorrupt, given the volume of data that's been OCRed: I've only found a few errors (most amusing one: the Marquees of Londonderry -- I guess it must rain there a lot). There are certainly a few minor problems -- for example, once I managed to get the search engine to tell me that a debate in 1958 happened earlier than one in 1944. At present there's no disambiguation between different people with the same name -- so the earliest utterance recorded for Mr. Winston Churchill is on 19 March 1941, and the latest on 11 March 1997 -- nor combinations between (possibly) the same person with different names -- such as Churchill, Mr. Churchill, Mr. Churchill (by private notice), Mr. Churchill (Stretford) and so on. It's all experimental at this stage, so these issues will presumably be addressed in future. (LibraryThing lets its users do a lot of the work for similar problems, but I doubt a HansardThing would ever reach the critical mass needed for that to work.)

But in it's current form, it's easy to use and is laid out in an admirably clear and uncluttered fashion. The little histograms showing the frequency of search results are a nice touch, and you can quickly drill down to a specific timeframe of interest. I LOVE human-readable URLs, ones you could easily read out to somebody (as opposed to ones which end in a 64-character hex hash or some combination of [0-9]s, ?s, =s, and &s); these ones are human-guessable and human-hackable too. The code will be open sourced; the bleeding-edge version is already available. There's even an OS X dashboard widget if you're into that kind of thing. A low-traffic discussion group gives a flavour of plans for future features, such as linking in mentions in Hansard of place names to Google Maps. (Actually, that's a feature that was apparently in the prototype in the past, but doesn't seem to be now -- presumably it will be back.) Eventually there'll be a lot more cross-referencing and so on. Even now you can get useful things like all the Air Ministers listed in one place.

It's nice to see the 20th century get some digitisation love, even if it's only for a pilot. (Some new data is coming soon, apparently, which will go back to 1804.) All of these wonderfully ambitious scanning projects are great, except they tend to stop in 1900. (Hey, people with scads of money for scanning old books and things! Some moderately interesting events took place in the 20th century too, you know.)

What would be really nice, though, would be something similar for all the parliamentary papers -- these are much harder to find in academic libraries than Hansard, in my experience. Actually, this has been done, but the digitisation rights have apparently been sold off to a third party. Hopefully, this won't happen here. And hopefully 'Hansard online' won't just disappear without being replaced by something even better, like the BBC's Infax did one day ...