Researching, Blogspace
My father sent me most of his copy of the latest issue of the Communications of the ACM (that's the"Association of Computing Machinery," the AHA of the computer world), which includes a whole raft of articles on blogging.
The most historical is Rebecca Blood's article on the relationship between software features and the popularity and spread of blogging, which also includes a link to her chronicle of the early days of blogging of which she was an active part (her ten tips for bloggers really should be called the"ten commandments").
But as an historian, the one that made my ears prick up was"Structure and evolution of blogspace" by Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins. Apparently these four are part of a whole cadre of scholars studying"bursty" data systems; in other words, complex systems in which activity comes in irregular clusters.
Their research on the communities of blogspace (this article was about livejournal.com blogs, mostly) has produced some intriguing sociological findings (see tables 1 and 2, p. 37). For example, I just turned 37, and entered the group interested in"SCA [I have friends there], Babylon 5 [best SF TV ever], pagan [friends], gardening [not really], Star Trek [yeah, yeah], Hogwarts [it's ok], Macintosh [not really], Kate Bush [I have 1 CD], Zen [well, I'm an Asianist], tarot [friends]." Even better, the group I left on my birthday is supposed to be interested in"Cross stitch, Thelema, Tivo, parenting, role-playing, bicycling, shamanism, Burning Man" which is pretty poor as a description of my life. How's that for freaky? Actually, I fit the 46-57 profile best, because it includes"science fiction, politics, history, poetry, writing, reading" and a few other things. Also, there's a surprising large number of under three-year-old bloggers, both human and feline.
This is fine, but like a lot of interesting sociological and anthropological research, it's a snapshot of a moving target, an historical document as soon as it's published. I recently read an interview with James"Woody" Watson [PDF], who spearheaded the Golden Arches East research, in which he admitted that the findings published less than a decade ago were no longer reflective of today's eating/living habits. (He also said that industrial monoculture production of meat is going to make eating hamburger the risk equivalent of eating fugu pufferfish {poisonous unless prepared properly} within a few decades)
The algorithms and software tools Kumar, et al. use to track activity are what actually interest me. Quantity is not the issue with modern computing power; it's filtering. They are using algorithms that can pick out groups of blogs tightly knit enough to qualify as communities and within and between those communities pick out"bursts of activity around certain words or expressions" or hyperlinks (p. 39). They found that, even controlling for number of blogs and number of communities, that"The magnitude of burstiness in communities appears to be increasing, suggesting that local community structure and community-level interactions are being reinforced as blogspace grows." (p. 39) These tools could then be used, and I imagine that they will, to do more focused sociological and discourse analysis on particular communities, clusters, slices or topics (there's a whole new realm -- digital anthropology -- waiting for a bold Ph.D. advisor and their intrepid grad students).
What excites me, though, is the thought that"bursty" algorithms could be used on other properly digitized primary sources. So much of what we do as historians is about finding critical masses of interrelated documents, sussing out the connections between people and their ideas and their institutions. In a decade or so, as the digitalization of sources increases, I very much hope to see research which starts not with keywords in an alphanumeric index, but with"bursty" searches of full-text meta-archives. The algorithms don't even need the data to be indexed, particularly, since they seem to function on full-text, and computing power is no barrier anymore.
There'll still be a great deal of work for us human historians to do, picking the interesting questions (though bursty analysis could also reveal questions we hadn't thought to ask yet), interpreting the data, correlating and weighing contradictory and conflicting claims, and, of course, presenting meaningful findings (though that also should be dramatically altered in a decade or so, by the lower cost of internet publishing and the greater flexibility of hypertext [I know, people have been saying this for at least a decade now, but that doesn't mean I'm wrong now]). But the technology will allow us to ask more interesting questions and answer them more quickly: we should embrace it.