With support from the University of Richmond

History News Network puts current events into historical perspective. Subscribe to our newsletter for new perspectives on the ways history continues to resonate in the present. Explore our archive of thousands of original op-eds and curated stories from around the web. Join us to learn more about the past, now.

Historians at Columbia University are using big data to draw historical conclusions

Related Links

 Columbia University’s History Lab

 Science Relevant to History (Big Data)

“Can A Computer Algorithm Do The Job Of A Historian?” is the second article in a BuzzFeed series written with help from Columbia University’s History Lab. This team of historians and data scientists is developing a “Declassification Engine” that turns documents into data and mines it for insights about the history and future of official secrecy. The stories draw from the lab’s searchable database of over 2 million declassified government documents.

How do we decide what counts as history? Well, there’s the first draft, journalism — the stories the media tells about the events of the day. And then there are the endless subsequent iterations, mined from primary sources and dusted off and polished by historians into arguments and narratives that shape our understanding of the world. 

Then there’s a third option, one that is made possible by the deluge of electronic records kept in the second half of the 20th century, and tools of modern data science: automatic event detection. That’s the idea that software can read historical data to try to pick out patterns — discrete events that stick out from an ocean of data as significant. 

In the early 1970s, the State Department began keeping electronic records of the thousands of cables its employees sent about American interests throughout the world. Researchers at Columbia’s Declassification Engine project believe it’s possible to automatically distinguish periods of increased activity in these cables that correspond to historically important events.

Three Columbia University statisticians — Rahul Mazumder, Yuanjun Gao, and Jonathan Goetz — developed an advanced statistical model that allowed them to sift through 1.7 million diplomatic cables from the years 1973–1977, including 330,000-odd cables in which only the metadata has been declassified. The model, with the help of the 2,600 cores in Columbia’s High Performance Computer Cluster, isolated 500 “bursts” — periods of heightened activity where more cables were being sent. And from those 500, the team investigated the top 10, what you might call the most active areas of American diplomacy in a four-year span that included the end of the Vietnam War, roiling conflict in the Middle East, and the OPEC oil embargo. ...

...

So how did the Declassification Engine do? The head of the History Lab team, Matthew Connelly, asked another historian — Daniel Sargent, who recently published a history of American foreign policy in the ’70s — to rank his own 10 most important events from 1973–1977. While there was broad agreement between the computer and the historian on the importance of Middle Eastern politics in the period, Sargent ranked highly some events, including China’s post-Mao transition, that were hardly picked up by the Declassification Engine, and became obviously significant only in hindsight.

In the History Lab blog, Sargent writes, “Comparing and contrasting my top ten with the results that the History Lab generated, I feel a certain relief. For all the differences, which are substantive, our conclusions are not so far removed as I’d feared.” You can read the rest of his list there.

Read entire article at Buzzfeed