What Can the Humanities Teach Us About Big Data?

by Jo Guldi

Jo Guldi, formerly Hans Rothfels Professor of History at Brown University and a Member of the Harvard Society of Fellows, is currently Associate Professor of History at SMU Dallas, where she directs “Think-Play-Hack,” SMU’s interdisciplinary hackathon series for big ideas. She is author of “Roads to Power: Britain Invents the Infrastructure State” (Harvard 2012), and co-author, with David Armitage, of ”The History Manifesto” (Cambridge 2014), which was recently named by the Chronicle of Higher Education as one of the most influential 20 books of the past 20 years in any field.

Like many Americans, I have a love-hate relationship with technology: I inwardly cringe when my preschooler clamors for screen-time with our iPad instead of storytime with a book. Our municipalities, our government, our insurers, and even the vendors of books are awash with technology as well. At a recent hackathon, the expert from the local transit authority confessed that with logs of accidents, and data about the wealth and race of inhabitants, they have too much data to inform decision-making.

Understanding how the humanities have traditionally approached big problems can informhow experts in data science can model meaningful conclusions based on the same skillful concern with answering questions based on a serious inquiry. Humanists, after all, are experts at probing the largest questions of our species. One example might be mastering what philosophers have said about topics like justice or gender since Aristotle, unpacking the values behind those concepts, and coming to a new understanding of how those ideas are changing in our own day. The traditional role of the humanities is to elevate the ambitions of human beings, asking what it means to be a citizen, an heir to the legacies of learning on many continents, or an individual with the capacity of dissent.

Now more than ever, it is important for those who work with big data to train in the questions of the humanities – as for those in the humanities to make clear the relevance of their tools of critical thinking to data scientists. The values of the humanities are the values of treating those questions – and many smaller ones – through skillful scholarship.

The particular skills of humanities scholarship take many forms, but they all agree in emphasizing serious engagement with texts and their contexts. They ask about the nature of the evidence at hand,the valuesthat govern the inquiry,and the many ways of modeling those concepts. These skills, among other things, allow scholars to produce both a strong consensus about truth where it is found, while simultaneously making room for dissentabout issues of interpretation, identity and meaning. Skillful interpretation of the data allows scholars to agree about the facts (for example, which manuscripts are the authentic production of a particular medieval scribe), while establishing room for dissent about the interpretation of those facts (for example, characterizing the perspective of Biblical literalism versus historical interpretation).

I recently proposed the concept of “Critical Search” as a general model for how humanistic values translate into the world of data. Critical Search has three major components that mirror how traditional humanists have approached big questions in the past: seeding a query, winnowing, and guided reading.

Traditionally, humanists begin to unpack a category like “justice” by consulting the canons of the past (which is not to say uncritically accepting the values of the past). They “seed” their research by beginning with a review of learned writing topic, carefully choosing particular texts whose categories resonate with them, just as a gardener carefully selects the seeds to plant in the ground.

Modeling this process has a lot to offer studies in big data. Much like the gardener, critical thinkers need to carefully choose their keywords, categories, and sets of documents to “seed” research in the field of big data. When working with data, defining “justice” or even “gender” requires being clear about which definition one uses. The choices need to be made explicit and self-reflexive because they have strong effects downstream. They need to be documented in order to make the query replicable.

In the era of big data, much of the work of “seeding” is done through the choice of algorithm – whether machine learning, divergence measures, or topic modeling, for example, is used to distill the findings of the data. From the humanities perspective, it isn’t enough to simply perform a search based on an algorithm; the algorithm itself has biases, which will redound through the search process. Only by comparing the insights produced by different algorithms do we get insight into how a particular tool biases the result.

A second step in the model, “winnowing,” explains the work typically done by scholars as they read widely, gaining information about context, and following the insights of pattern recognition, discourse, or critical theory to foreground particular test cases. This step is usually interpretive, which means that there is no objectively “right” answer about the “best” theory, but that scholarship progresses by scholars engaging from each others’ insights.

In the case of big data, “winnowing” means a researcher reviews the results of any particular algorithm to ask how the data and algorithm fit her question. It might mean, for example, discussing how the same algorithm produces different answers at different scales, or how using a different measurement produces different results. For example, in one digital history experiment, three different commonly-accepted equations for divergence produced three radically different answers from the data. Comparing the results of different algorithms means foregrounding the bias inherent in a particular algorithm, equation, or choice of scale.

In data science work, as in problems traditionally addressed by the humanities, the right answer affords room for debate and interpretation. The point is that engineers, even when working with big data, take care to transparently document the choice of a particular algorithm and the ways that it can be seen to bias the results. Iterative seeding and winnowing provides safety barrier against naïvely embracing the results of computational algorithm. At present, it is unclear how dependable most of our best tools for modeling text are, and where careful limits need to be provided. For instance, computer scientists who deal with topic models have themselves called for more studies of whether, why, and how the topic model aligns with insights gained in traditional approaches. Eric Baumer and his colleagues have warned that there is "little reason to expect that the word distributions in topic models would align in any meaningful way with human interpretations." Iterative winnowing and reading offer insurance against embracing foolhardy conclusions from digital processes. A truly critical search requires human supervision wherever the fit between algorithms and humanistic questions is unclear.

The next step in the process is “guided reading,” which mirrors how a gardener picks over moldy and damaged fruit for those good for eating and those good for pie. Presented with an archive, traditional scholars in the humanities actively choose passages for study.

Digital scholars too must reckon with the choice of which findings to present. At this stage in the process, the scholar carefully inspects the results returned by a search process, sometimes sampling them, sometimes generalizing about them (for instance by counting keywords again or topic modeling), before iterating the process again. Making sure that there’s a human step of inspecting the data – or “guided reading” – is important to making sure that the research process is producing meaningful findings. The process of continuously "checking" the work of the computer allows the expert to judge better whether and how the resulting subcorpus fits the scholarly questions at hand. Sampling the results in a structured, regular process allows the scholar to assess the results of a search confidently.

Critical search in itself attunes the scholar's sensitivity to the bias and perspectival nature of particular algorithms. In many cases, however, one pass through the algorithms is not enough. Keyword search, topic models, and divergence measures may all be used to narrow a corpus down to a smaller body of texts, for example identifying a particular decade of interest. In order to precisely "tune" the algorithms to the researcher's question, successive rounds of the critical search process may be necessary.

Critical search means adopting algorithms to the research agendas we already have—feminist, subaltern, environmental, diplomatic, and so on—and searching out those tools and parameters that will enhance our prosthetic sensitivity to the multiple dimensions of the archive. Documenting the choice of seed, algorithm, cut-offs, and iteration can go a long way towards a disciplinary practice of transparency about how we understand the canon, how we develop a sensitivity to new research agendas, and how we as a field pursue the refinement of our understanding of the past.

By emulating the humanities and embracing the skills of critical thought, individuals who engage with the critical search process can make visible and transparent their choices about how they dealt with the data they were presented. Like traditional humanists, they will compare and combine insights from secondary sources and canonical texts as they decide which categories will be extracted and what those categories mean. In explaining any given approach to data, they will fully document the choices they made around different algorithms and their results, thus helping the community as a whole to make room for consensus about facts where they exist and dissent around different interpretive approaches.

Editor's note: The title was corrected on February 27th at 10:30 A.M.