The Hermeneutics of Data and Historical Writing (Fall 2011 version)
Permalink for this paragraph 0
The ongoing digitization of primary sources and the proliferation of born-digital documents are making it easier for historians to engage with vast amounts of research material. As a result, historical scholarship increasingly depends on our interactions with data, from battling the hidden algorithms of Google Book Search to text mining a hand-curated set of full-text documents. However, no standard protocols or procedures for either interacting with or writing about data have emerged. This chapter discusses some new ways in which historians might rethink the nature of historical writing as both a product and a process of understanding.
Permalink for this paragraph 2 We argue that the new methods used to explore and interpret historical data require a new kind of methodological transparency in history writing. Examples include discussions of data queries, reformatting techniques, workflows with particular tools, or the production and interpretation of data visualizations. At a minimum, historians need to de-emphasize the traditional historical narrative in favor of explicating the process of interfacing with, exploring, and then making sense of historical sources in a fundamentally digital form—that is, the hermeneutics of data. We call upon historians to publicly experiment with ways of writing about their methodologies, procedures, and experiences with historical data as a kind of text, as they engage in a cyclical process of contextualization and interpretation. This essay hopes to encourage more dialog about why historical writing must foreground methodological transparency and free itself from the epistemological jitters that make many historians wary of moving away from close readings or using large amounts of data.
Permalink for this paragraph 1
Data in History
Use of data in the humanities has recently attracted considerable attention, and no project more so than Culturomics, a quantitative study of culture using Google Books.1 Of course the idea of using data for historical research is hardly new, whether in the context of quantitative history, early work from the Annales school, or work done under the rubric of “humanities computing.” Even if techniques for data interpretation and manipulation are not considered routine training for historians, it is a hallmark of many award winning history books. Two relatively recent Bancroft prize winning books–Richard L. Bushman’s The Refinement of America: Persons, Houses, Cities (Alfred A. Knopf, 1992) and William Cronon’s Nature’s Metropolis: Chicago and the Great West (W.W. Norton, 1991)–spend a considerable amount of space interpreting historical data.
Permalink for this paragraph 0 In Nature’s Metropolis, Cronon explores the relationships of debt and commodities sales, providing maps of the commercial relationships around these commodities to illustrate the relationship between Chicago and its hinterland. Data well illustrates Cronon’s point, even when his maps become artifacts that invite their own historical interpretation and analysis. Similarly, in The Refinement of America, Bushman draws from Delaware estate inventories from the 1770s to 1840s to show increase use of articles of refinement. Yet such articles were not necessarily luxurious; some of the “carpets” were described as little more than rag rugs. Going beyond the data, Bushman argues that their inclusion suggests less about rugs per se and more about a change in sensibility about bringing dirt inside.
Permalink for this paragraph 0 Although data is central to these excellent studies, it is important to note how data was used. In both cases, data is presented principally as complementary to a narrative argument. It is obviously part of the story, but almost as a footnote–albeit a nicely illustrated one. Also, the sheer quantity of data did not threaten to bottleneck the project. That is, the methodologies used to investigate the data did not need to handle an extraordinary amount of it, nor did the data require extensive manipulation to be usable. To point out that these two excellent studies relied on a humanly manageable number of sources and amount of data is certainly no criticism. But it stands in contrast to the situation that many historians will find themselves in as data becomes more easily findable and usable—and perhaps as they become obliged to use and represent vast quantities of disparate kinds of data.
Permalink for this paragraph 0 Other scholars who work within the domain of the so-called digital humanities have begun to think and write more explicitly about data and its potential for new kinds of research. For example, some Shakespeare scholars recently used statistical procedures to test historical and categorical hypotheses about sets of materials and detect nuances in change over time.2 The Stanford Literary Lab has provided a research locus for rethinking the nature of genre. Yet most projects, as have these, continue to be largely confirmatory, like reinforcing the periodization of Shakespeare’s works or confirming the codified family of literary genres. Again, this is not a criticism of these projects and their outcomes. They are in fact a crucial step forward. As humanists continue to prove that data manipulation and machine learning can confirm existing knowledge, it becomes closer to telling us something we don’t already know. Other large scale research projects, like several funded through the Digging Into Data initiative, have begun to explore transformative potential of data in humanities research as well.
Permalink for this paragraph 0 However, even these projects focus on research (or research potential) rather than making their methodology accessible to a broader humanities audience. In many ways, this might be the result of scholars attempting to legitimize their digital work by appealing to the traditional values (and forms) of the non-digital humanities. That is, they foreground narrative and research results, and minimize the new kinds of methodologies that reach beyond a highly specialized audience. But how can digital historians expect others to take their new methodologies seriously when new ways of working with data (even when not with sophisticated mathematics) remain too much like an impenetrable and mysterious black box? The processes for working with the vast amounts of easily accessible and diverse large sets of data suggest a need for historians to formulate, articulate, and propagate ideas about how data should be approached in historical research.
Permalink for this paragraph 3
Towards a Hermeneutics of Data
What does it mean to “use” data in historical work? Obviously historians have been using and writing about data for well over a century. But having vastly greater quantities of data and tools for exploring it means that “using” data means something very different now than it has previously. The rapid rate of data production and technological change means that we must continue to teach each other how we are using and making sense of data.
Permalink for this paragraph 0 We should be clear about what using data does not imply. For one, it does not refer only to historical analysis via complex statistical methods to create knowledge. Even as data becomes more readily available and as historians begin to acquire data manipulation skills as part of their training, rigorous mathematics is not necessarily essential for using data efficiently and effectively. In particular, work with data can be playful and exploratory and deliberately without the mathematical rigor that social scientists must use to support their epistemological claims. Using data in this way is fundamentally different from using data for quantifying and computing and creating knowledge as per quantitative history.
Permalink for this paragraph 0 Similarly, historians need not treat and interpret data only for rigorous hypothesis testing. This is another crucial difference between our approach and the approaches of the cliometricians of the 1960s and 70s.3 Perhaps such talk of numbers left a bad taste in the mouths of non-numerical historians because of an embrace of the cultural turn, the importance of subjectivity, and a general epistemological stance against the kind of positivism that underpins much of the hypothesis testing baked into the design of statistical procedures or analytical software.
Permalink for this paragraph 2 In other words, data does not always have to be used as evidence, but can be simply for discovering and framing research questions. It took many months of work and considerable resources for Cronon and Bushman to gather primary sources, extract relevant data, and create useful maps and tables. Such a substantial investment of time and energy is only feasible when one has confidence that the data will be useful to illustrate an insight gained elsewhere. It is unlikely anyone would undertake such work if uncertain that the data would yield interesting results. In contrast, as more and more historical data is provided via, or can be viewed in, tools like Google’s N-gram viewer (to take a simple example), playing with data—in all its formats and forms—is more important than ever.
Permalink for this paragraph 0 As the investment of time and energy to acquire data decreases, rapidly working with data can now be a part of historians’ early development of a research question as opposed to being used simply to strengthen an argument. It can be also illustrative of potentially interesting but ultimately dead-ends of scholarly research. These ‘negative results’ (for lack of a better phrase) should not be discarded as they likely would be for a typical scholarly book or journal article. Again, using large amounts of data for research should not be considered being in opposition to more traditional use of historical sources. As historical data become more ubiquitous, humanists will find it useful to pivot between distant and close readings. More often than not, the distant reading will involve creative and reusable techniques to re-imagine and represent the past–at least more so than uses the more traditional humanist texts. For this very reason, it becomes insufficient to simply write about the work in traditional forms that minimize methodology.
Permalink for this paragraph 0 Furthermore, datasets (like the National Archives’ Access to Archival Databases) and interfaces to data (like Google Fusion Tables) are making it easier than ever for historians to pursue a combinatorial approach–mixing different kinds of data sets—and thus provides an exciting new way to triangulate our historical knowledge. Steven Ramsay has suggested that there is a new kind of role for searching to play in the hermeneutic process of understanding, especially in the value of ‘screwing around’ and embracing the serendipitous discovery that our recent abundance of data makes possible.4 This could result, for example, in noticing within the context of London’s central criminal court, the Old Bailey, that trials about poisoning tend to refer to coffee more than to other beverages, and very rarely to food.5 Thus, our methodologies might not be as deliberate or as linear as they have been in the past. And this means we need more explicit and careful ways (perhaps even more playful) ways of writing about them.
Permalink for this paragraph 0
Methodology in Writing
Despite some recent methodological experimentation with data, historians have not been nearly as innovative in terms of writing about it. Even as scholars (at least in certain fields) have embraced communication with new media, historical writing has been largely confined by linear narratives, usually in the form of journal articles and monographs. The insistence on creating a narrative in static form, even if online, is particularly troubling—especially as data becomes more important for historical research–because it obscures the methods for discovery that underlay the hermeneutic research process.
Permalink for this paragraph 1 Regardless of form, we need history writing that explicates the research process as much as the research conclusions. We need history writing that interfaces with, explains, and makes accessible the data that historians use. We need history writing that will foreground the new historical methods to manipulate text/data coming online, including data queries and manipulation, and the production and interpretation of visualizations. There is no question that humanists can be—and in fact are trained to be—skeptical of data manipulation. This is perhaps the preeminent reason why the methodology needs to be, at least for now, clearly explained. With new digital tools, we are still groping to understand how to identify the best methods for very messy circumstances of data.
Permalink for this paragraph 1 The reasons why many historians remain skeptical about data are not all that different from the reasons they can be skeptical about text. Historians have long reflected on the theoretical advantages and practical limitations of various methodologies and approaches to research. Critical theorists and historians alike have commented on the slippery notion of a text; some excellent theoretical work on cybertext and hypertext have muddied the waters further. The last few years have complicated such a notion even more as many traditional texts have come to be seen as data that can be quickly searched, manipulated, viewed from a variety of perspectives, and combined with other data to create entirely new research corpora. It is clear that a new relationship between text and data has begun to unfold.6 This relationship must inform our approach to writing as well as research.
Permalink for this paragraph 2 One way of reducing hostility to data and its manipulation is to lay bare the technical work that produces what we might call “datatext”, a convenient (although perhaps unnecessary) neologism to describe a new object of analysis that requires a new rhetoric and aesthetics for describing it. Methodological tutorials, for example, would not only help legitimate knowledge claims that employ them, but make the methodology more accessible to anyone who might recognize that the same or slightly modified approach could be of value in their own work. Beyond explicit tutorials, there are several key advantages in foregrounding our work with data: 1) It allows others to follow up and verify our claims; 2) It is instructive as part of teaching and exposing historical research practices; 3) It allows us to keep pace with changing tools and ways of using them. Furthermore, openness has long been part of the ethos of the humanities, and humanists continually argue that we should embrace more public modes of writing and thinking as a way to challenge the kind of work that scholars do. For example, Dave Perry in his blog post “Be Online or Be Irrelevant” suggests the potential that blogging has to create “a digital humanism which takes down those walls and claims a new space for scholarship and public intellectualism.”7 This cannot happen unless our methodologies with data remain transparent.
Permalink for this paragraph 0
Case Study: Becoming Users and Communities of Data
Our theoretical and prescriptive remarks thus far will benefit from a concrete example: in this case, one that explores the history of the user. The notion of the user has become ubiquitous. We live in an era of usernames, user experiences, and user-centered design; we tacitly sign end-user license agreements when we install software; we read user guides to figure out how to get our software to do what we want. But our omnipresent conception of ourselves as users obscures the history of the term.
Permalink for this paragraph 3 As previously discussed, it is now takes only seconds to take this seed of inquiry (the history of a term), and see the relationship between the presence of that term and any other similar terms in Google’s N-gram viewer.
Permalink for this paragraph 0 Figure 1: user, producer, consumer, customer from the Google Ngram viewer
Permalink for this paragraph 0 Needless to say, this is not historical evidence of sufficient (if any) rigor to support historical knowledge claims. For one, Google’s data is proprietary and exactly what is or not included in it is unclear. Perhaps more importantly, this graph does not indicate anything interesting about why the term “user” spiked as it did—the real question that historians want to answer. But these are not reasons to discard the tool or to avoid writing about it. Historians might well start framing research questions this way, with quick uses of the N-gram viewer or other tools. Conventionally, this work would remain invisible, and only “real” data would appear in published work only in support an argument of influence or causation. But foregrounding such preliminary work will help readers to understand the genesis of the question, flag any possible framework errors, identify any category mistakes, and perhaps inspire them to think about how to apply such techniques in their own work.
Permalink for this paragraph 0 To investigate the user in more detail, one can use other online corpora to generate a series of radically different interpretive views. For example, searching in the Time Magazine Corpus allows one to see all of the collocates (words that appear within a specified number of words from the search term) and display counts by decade.
Permalink for this paragraph 0 Figure 2: What are users using?
Permalink for this paragraph 0 Column B in Figure 2 lists most words within four words of “user”. So it’s obvious, for example, that “drug” appears within 4 words of “user” 32 times. To better make sense of these results, the collocates were coded into two categories (column A): those that have to do with drugs and those that have to do with technology. There are a few at the bottom that remain uncategorized, but which most likely would be considered technology uses of the term. To draw attention to the patterns in the data, cells in the sheet with two hits have been coded dark green; those with more than two hits.
Permalink for this paragraph 0 On the whole, the chart “users” lends itself to some quick observations. As far as the Time Magazine corpus suggests, the growth around the term “user” happened for both drugs and technology around the same time. The first technology term to appear is telephony, which perhaps suggests that the rise of the “user” may have may have less to do with the rise of computing (our typical conception of it now) than the rise of networks.
Permalink for this paragraph 0 But going beyond the data—making sense of it—can be facilitated by additional expertise in ways that our usually much more naturally circumscribed historical data has generally not required. Owens blogged about this research while it was in progress, describing what he was interested in, how he got his data, how he was working with it. Over the next week the post was viewed over two hundred times; twenty-two researchers and librarians tweeted about the post; some left comments. For example, Rob Townsend, Research Director for the American Historical Association commented that the post was “a fantastic use (and contextualization) of Google n-gram data”.
Permalink for this paragraph 0 Most importantly, Owens received several substantive comments from scholars and researchers. These ranged from encouraging the exploration of technical guides, learning from scholarship on the notion of the reader in the context of the history of the book, and suggestions for different prepositions that could further elucidate semantic relationships about “users.” This discussion resulted from Owens having foregrounded his initial forays into data online where it was easy to give different views of his data. Sharing preliminary representations of data, providing some preliminary interpretations of them, and inviting others to consider how best to make sense of the data at hand, quickly sparked a substantive scholarly conversation. This is not to say we should expect everyone to help with our own research, but that because raw data that we now have so easily ranges so widely over typical disciplinary boundaries, a community approach is even more essential. And it benefits everyone involved, as the discussants are able to learn about data and methodologies that they can apply in their own work.
Permalink for this paragraph 0 In addition to accelerating research, foregrounding methodology and (access to) data gives rise to a constellation of questions that are becoming increasingly relevant for historians. How far, for example, can expressions of data like Google’s N-gram viewer be used in historical work? Although a chart cannot be used as evidence, it certainly can be used to identify curious phenomena that are unlikely to be artifacts of the data or viewer alone. How does one cite data without black-boxy mathematical reductions? We do not refer to how one ought to format the reference in print, but how one can bring the data itself into the realm of scholarly discourse. How does one show, for example, that references to “sinful” in the nineteenth century appear predominately in sermon and other exegetical literature in the early part of the century, but become overshadowed by more secular references later in the century? Typically, this would be illustrated with pithy, anecdotal examples taken to be representative of the phenomenon. But does this adequately represent the research methodology? Does it allow anyone to investigate for themselves? Or learn from the methodology?
Permalink for this paragraph 2 Far better would be to explain the steps used to collect and reformat the data, and, ideally, to make the data available for download. The plain text file that has been organized to make the above linguistic shift in “sinful” easy to detect would be considerably useful for other researchers, who in turn will certainly make other observations and draw new conclusions. Exposed data allows us to approach interesting questions from multiple and interdisciplinary points of view in the way that citations to textual sources do not. Again, this is not to argue for a whole-cloth replacement of close readings and textual analysis in historical research, but rather using data as a complement to it. As it becomes easier and easier for historians to explore and play with this kind of data it becomes essential for us to reflect on how we should incorporate this as part of our research and writing practices. Is there a better way than to simply provide the raw data (as gotten from its sources) and an explanation of how to witness the same phenomenon? Is this the twenty-first century footnote?
Permalink for this paragraph 0
There has been no aversion to using data for historical research. But historians are beginning to use data on new scales, and combine different kinds of data that ranges widely over typical disciplinary boundaries. The ease and increasing presence of data, in terms of both digitized and increasingly born digital research materials, means that—irrelevant of historical field—the historian faces new methodological challenges. Approaching these materials in a context sensitive way requires substantial amounts of time and energy devoted to how exactly we can interpret data, and how it should be incorporated into our writing. We have argued that historians deliberately and explicitly share examples of how they are finding and manipulating data in their research with greater methodological transparency in order to promote the spirit of humanistic inquiry and interpretation.
Permalink for this paragraph 4 We have also argued that working with and writing about data does not mean that historians need to take on the kinds of epistemological burdens that underpin many of the tools that statisticians or quantitative historians have developed. Much useful work with data in history requires little more than simple frequency counts, descriptive statistics, and reformatting to become essential to the historian looking for anomalies, trends, or unusual coincidences. To argue against the necessity of mathematical complexity is also to suggest that it is a mistake to treat data as self-evident or that data constitutes historical argument or proof. Historians must treat data as text, which needs to be approached from multiple points of view, and as openly as possible. Working with data can be playful and exploratory, and useful techniques should be shared as readily as research discoveries. While typical history scholarship has largely kept methodology and data manipulation in the background, new approaches to writing can complement more traditional methods and venues to avoid some of their well documented limitations, especially as it enables sharing data in a variety of forms.
Permalink for this paragraph 0 Gathering data, working with and representing it, and of course writing about it, should be required of all historians in training–not just those in digital history or new media courses–to best use the new kinds of historical sources/data that have opened up new avenues of inquiry for virtually every historical specialty. Of course not all research projects will require facility with data. But just as historians learn to find, collect, organize, and make sense of the traditional sources, they also need to learn to acquire, manipulate, analyze, and represent data. Access to historical sources makes the historical record in the twenty-first century looks rather different than it has ever before. Writing about history needs to evolve as well.
Permalink for this paragraph 0 About the authors: Fred Gibbs is assistant professor of history at George Mason University and director of digital scholarship at the Roy Rosenzweig Center for History and New Media. Trevor Owens is a digital archivist at the Library of Congress. He also teaches digital history at American University.
- Permalink for this paragraph 0
- Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331, no. 6014 (January 14, 2011): 176 -182. ↩
- Witmore, Michael, and Jonathan Hope. “Shakespeare by the Numbers: On the Linguistic Texture of the Late Plays.” In Early Modern Tragicomedy, edited by Subha Mukherji and Raphael Lyne. D.S.Brewer, 2007. ↩
- For a more detailed history of cliometrics and its impact on the digital humanities see Thomas III, William G. “Computing and the Historical Imaginiation.” In A companion to digital humanities, edited by Susan Schreibman, Raymond George Siemens, and John M. Unsworth. Wiley-Blackwell, 2004 ↩
- Ramsay, Steve. “The Hermeneutics of Screwing Around; or What You Do with a Million Books.” http://www.playingwithhistory.com/wp-content/uploads/2010/04/hermeneutics.pdf. ↩
- Frederick Gibbs, “Beware the Coffee” http://criminalintent.org/2011/03/beware-the-coffee/ ↩
- Flanders, Julia. “Data and Wisdom: Electronic Editing and the Quantification of Knowledge.” Literary and Linguistic Computing 24, no. 1 (April 1, 2009): 53 -62. ↩
- Perry, Dave. “Be Online or Be Irrelevant: Thoughts on Emerging Media and Higher Education.” AcademHack, January 11, 2010. http://academhack.outsidethetext.com/home/2010/be-online-or-be-irrelevant/. ↩