Hermeneutics of Data and Historical Writing (Gibbs & Owens) Fall 2011
The Hermeneutics of Data and Historical Writing (Fall 2011 version)
¶ 1
Leave a comment on paragraph 1 0
Introduction
The ongoing digitization of primary sources and the proliferation of born-digital documents are making it easier for historians to engage with vast amounts of research material. As a result, historical scholarship increasingly depends on our interactions with data, from battling the hidden algorithms of Google Book Search to text mining a hand-curated set of full-text documents. However, no standard protocols or procedures for either interacting with or writing about data have emerged. This chapter discusses some new ways in which historians might rethink the nature of historical writing as both a product and a process of understanding.
¶ 2 Leave a comment on paragraph 2 2 We argue that the new methods used to explore and interpret historical data require a new kind of methodological transparency in history writing. Examples include discussions of data queries, reformatting techniques, workflows with particular tools, or the production and interpretation of data visualizations. At a minimum, historians need to de-emphasize the traditional historical narrative in favor of explicating the process of interfacing with, exploring, and then making sense of historical sources in a fundamentally digital form—that is, the hermeneutics of data. We call upon historians to publicly experiment with ways of writing about their methodologies, procedures, and experiences with historical data as a kind of text, as they engage in a cyclical process of contextualization and interpretation. This essay hopes to encourage more dialog about why historical writing must foreground methodological transparency and free itself from the epistemological jitters that make many historians wary of moving away from close readings or using large amounts of data.
¶ 3
Leave a comment on paragraph 3 1
Data in History
Use of data in the humanities has recently attracted considerable attention, and no project more so than Culturomics, a quantitative study of culture using Google Books.1 Of course the idea of using data for historical research is hardly new, whether in the context of quantitative history, early work from the Annales school, or work done under the rubric of “humanities computing.” Even if techniques for data interpretation and manipulation are not considered routine training for historians, it is a hallmark of many award winning history books. Two relatively recent Bancroft prize winning books–Richard L. Bushman’s The Refinement of America: Persons, Houses, Cities (Alfred A. Knopf, 1992) and William Cronon’s Nature’s Metropolis: Chicago and the Great West (W.W. Norton, 1991)–spend a considerable amount of space interpreting historical data.
¶ 4 Leave a comment on paragraph 4 0 In Nature’s Metropolis, Cronon explores the relationships of debt and commodities sales, providing maps of the commercial relationships around these commodities to illustrate the relationship between Chicago and its hinterland. Data well illustrates Cronon’s point, even when his maps become artifacts that invite their own historical interpretation and analysis. Similarly, in The Refinement of America, Bushman draws from Delaware estate inventories from the 1770s to 1840s to show increase use of articles of refinement. Yet such articles were not necessarily luxurious; some of the “carpets” were described as little more than rag rugs. Going beyond the data, Bushman argues that their inclusion suggests less about rugs per se and more about a change in sensibility about bringing dirt inside.
¶ 5 Leave a comment on paragraph 5 0 Although data is central to these excellent studies, it is important to note how data was used. In both cases, data is presented principally as complementary to a narrative argument. It is obviously part of the story, but almost as a footnote–albeit a nicely illustrated one. Also, the sheer quantity of data did not threaten to bottleneck the project. That is, the methodologies used to investigate the data did not need to handle an extraordinary amount of it, nor did the data require extensive manipulation to be usable. To point out that these two excellent studies relied on a humanly manageable number of sources and amount of data is certainly no criticism. But it stands in contrast to the situation that many historians will find themselves in as data becomes more easily findable and usable—and perhaps as they become obliged to use and represent vast quantities of disparate kinds of data.
¶ 6 Leave a comment on paragraph 6 0 Other scholars who work within the domain of the so-called digital humanities have begun to think and write more explicitly about data and its potential for new kinds of research. For example, some Shakespeare scholars recently used statistical procedures to test historical and categorical hypotheses about sets of materials and detect nuances in change over time.2 The Stanford Literary Lab has provided a research locus for rethinking the nature of genre. Yet most projects, as have these, continue to be largely confirmatory, like reinforcing the periodization of Shakespeare’s works or confirming the codified family of literary genres. Again, this is not a criticism of these projects and their outcomes. They are in fact a crucial step forward. As humanists continue to prove that data manipulation and machine learning can confirm existing knowledge, it becomes closer to telling us something we don’t already know. Other large scale research projects, like several funded through the Digging Into Data initiative, have begun to explore transformative potential of data in humanities research as well.
¶ 7 Leave a comment on paragraph 7 0 However, even these projects focus on research (or research potential) rather than making their methodology accessible to a broader humanities audience. In many ways, this might be the result of scholars attempting to legitimize their digital work by appealing to the traditional values (and forms) of the non-digital humanities. That is, they foreground narrative and research results, and minimize the new kinds of methodologies that reach beyond a highly specialized audience. But how can digital historians expect others to take their new methodologies seriously when new ways of working with data (even when not with sophisticated mathematics) remain too much like an impenetrable and mysterious black box? The processes for working with the vast amounts of easily accessible and diverse large sets of data suggest a need for historians to formulate, articulate, and propagate ideas about how data should be approached in historical research.
¶ 8
Leave a comment on paragraph 8 3
Towards a Hermeneutics of Data
What does it mean to “use” data in historical work? Obviously historians have been using and writing about data for well over a century. But having vastly greater quantities of data and tools for exploring it means that “using” data means something very different now than it has previously. The rapid rate of data production and technological change means that we must continue to teach each other how we are using and making sense of data.
¶ 9 Leave a comment on paragraph 9 0 We should be clear about what using data does not imply. For one, it does not refer only to historical analysis via complex statistical methods to create knowledge. Even as data becomes more readily available and as historians begin to acquire data manipulation skills as part of their training, rigorous mathematics is not necessarily essential for using data efficiently and effectively. In particular, work with data can be playful and exploratory and deliberately without the mathematical rigor that social scientists must use to support their epistemological claims. Using data in this way is fundamentally different from using data for quantifying and computing and creating knowledge as per quantitative history.
¶ 10 Leave a comment on paragraph 10 0 Similarly, historians need not treat and interpret data only for rigorous hypothesis testing. This is another crucial difference between our approach and the approaches of the cliometricians of the 1960s and 70s.3 Perhaps such talk of numbers left a bad taste in the mouths of non-numerical historians because of an embrace of the cultural turn, the importance of subjectivity, and a general epistemological stance against the kind of positivism that underpins much of the hypothesis testing baked into the design of statistical procedures or analytical software.
¶ 11 Leave a comment on paragraph 11 2 In other words, data does not always have to be used as evidence, but can be simply for discovering and framing research questions. It took many months of work and considerable resources for Cronon and Bushman to gather primary sources, extract relevant data, and create useful maps and tables. Such a substantial investment of time and energy is only feasible when one has confidence that the data will be useful to illustrate an insight gained elsewhere. It is unlikely anyone would undertake such work if uncertain that the data would yield interesting results. In contrast, as more and more historical data is provided via, or can be viewed in, tools like Google’s N-gram viewer (to take a simple example), playing with data—in all its formats and forms—is more important than ever.
¶ 12 Leave a comment on paragraph 12 0 As the investment of time and energy to acquire data decreases, rapidly working with data can now be a part of historians’ early development of a research question as opposed to being used simply to strengthen an argument. It can be also illustrative of potentially interesting but ultimately dead-ends of scholarly research. These ‘negative results’ (for lack of a better phrase) should not be discarded as they likely would be for a typical scholarly book or journal article. Again, using large amounts of data for research should not be considered being in opposition to more traditional use of historical sources. As historical data become more ubiquitous, humanists will find it useful to pivot between distant and close readings. More often than not, the distant reading will involve creative and reusable techniques to re-imagine and represent the past–at least more so than uses the more traditional humanist texts. For this very reason, it becomes insufficient to simply write about the work in traditional forms that minimize methodology.
¶ 13 Leave a comment on paragraph 13 0 Furthermore, datasets (like the National Archives’ Access to Archival Databases) and interfaces to data (like Google Fusion Tables) are making it easier than ever for historians to pursue a combinatorial approach–mixing different kinds of data sets—and thus provides an exciting new way to triangulate our historical knowledge. Steven Ramsay has suggested that there is a new kind of role for searching to play in the hermeneutic process of understanding, especially in the value of ‘screwing around’ and embracing the serendipitous discovery that our recent abundance of data makes possible.4 This could result, for example, in noticing within the context of London’s central criminal court, the Old Bailey, that trials about poisoning tend to refer to coffee more than to other beverages, and very rarely to food.5 Thus, our methodologies might not be as deliberate or as linear as they have been in the past. And this means we need more explicit and careful ways (perhaps even more playful) ways of writing about them.
¶ 14
Leave a comment on paragraph 14 0
Methodology in Writing
Despite some recent methodological experimentation with data, historians have not been nearly as innovative in terms of writing about it. Even as scholars (at least in certain fields) have embraced communication with new media, historical writing has been largely confined by linear narratives, usually in the form of journal articles and monographs. The insistence on creating a narrative in static form, even if online, is particularly troubling—especially as data becomes more important for historical research–because it obscures the methods for discovery that underlay the hermeneutic research process.
¶ 15 Leave a comment on paragraph 15 2 Historical work has needed to tell a good story, and methodology has not made for a very good story or the kind of historical writing that is likely to be published in traditional venues. Relatively simple text searches or charts that aid in our historical analysis are perhaps not worth including in a book, but our searches and work with data have grown increasingly complex, as has the data available to us. While these can present new perspectives on the past, they can only do so to the extent that other historians feel comfortable with methodologies that are used. This means using appropriate platforms to explain our methods. Does it make sense to explain new research methods that are wholly dependent on large datasets and their manipulation and visualization in a static book that distances the reader from the tools and techniques being described? Of course the realities of the profession restrict publishing freedoms (no one has gotten tenure for a really good website version of their dissertation), but our work need not be restrained by a false dichotomy between new media and old media. We suggest that exploratory methodological work can exist online in a perfectly complementary way to more traditional publication venues—and that the symbiotic pair will make both elements the better for it.
¶ 16 Leave a comment on paragraph 16 1 Regardless of form, we need history writing that explicates the research process as much as the research conclusions. We need history writing that interfaces with, explains, and makes accessible the data that historians use. We need history writing that will foreground the new historical methods to manipulate text/data coming online, including data queries and manipulation, and the production and interpretation of visualizations. There is no question that humanists can be—and in fact are trained to be—skeptical of data manipulation. This is perhaps the preeminent reason why the methodology needs to be, at least for now, clearly explained. With new digital tools, we are still groping to understand how to identify the best methods for very messy circumstances of data.
¶ 17 Leave a comment on paragraph 17 1 The reasons why many historians remain skeptical about data are not all that different from the reasons they can be skeptical about text. Historians have long reflected on the theoretical advantages and practical limitations of various methodologies and approaches to research. Critical theorists and historians alike have commented on the slippery notion of a text; some excellent theoretical work on cybertext and hypertext have muddied the waters further. The last few years have complicated such a notion even more as many traditional texts have come to be seen as data that can be quickly searched, manipulated, viewed from a variety of perspectives, and combined with other data to create entirely new research corpora. It is clear that a new relationship between text and data has begun to unfold.6 This relationship must inform our approach to writing as well as research.
¶ 18 Leave a comment on paragraph 18 2 One way of reducing hostility to data and its manipulation is to lay bare the technical work that produces what we might call “datatext”, a convenient (although perhaps unnecessary) neologism to describe a new object of analysis that requires a new rhetoric and aesthetics for describing it. Methodological tutorials, for example, would not only help legitimate knowledge claims that employ them, but make the methodology more accessible to anyone who might recognize that the same or slightly modified approach could be of value in their own work. Beyond explicit tutorials, there are several key advantages in foregrounding our work with data: 1) It allows others to follow up and verify our claims; 2) It is instructive as part of teaching and exposing historical research practices; 3) It allows us to keep pace with changing tools and ways of using them. Furthermore, openness has long been part of the ethos of the humanities, and humanists continually argue that we should embrace more public modes of writing and thinking as a way to challenge the kind of work that scholars do. For example, Dave Perry in his blog post “Be Online or Be Irrelevant” suggests the potential that blogging has to create “a digital humanism which takes down those walls and claims a new space for scholarship and public intellectualism.”7 This cannot happen unless our methodologies with data remain transparent.
¶ 19
Leave a comment on paragraph 19 0
Case Study: Becoming Users and Communities of Data
Our theoretical and prescriptive remarks thus far will benefit from a concrete example: in this case, one that explores the history of the user. The notion of the user has become ubiquitous. We live in an era of usernames, user experiences, and user-centered design; we tacitly sign end-user license agreements when we install software; we read user guides to figure out how to get our software to do what we want. But our omnipresent conception of ourselves as users obscures the history of the term.
¶ 20 Leave a comment on paragraph 20 3 As previously discussed, it is now takes only seconds to take this seed of inquiry (the history of a term), and see the relationship between the presence of that term and any other similar terms in Google’s N-gram viewer.
¶ 21 Leave a comment on paragraph 21 0
¶ 22 Leave a comment on paragraph 22 0 Needless to say, this is not historical evidence of sufficient (if any) rigor to support historical knowledge claims. For one, Google’s data is proprietary and exactly what is or not included in it is unclear. Perhaps more importantly, this graph does not indicate anything interesting about why the term “user” spiked as it did—the real question that historians want to answer. But these are not reasons to discard the tool or to avoid writing about it. Historians might well start framing research questions this way, with quick uses of the N-gram viewer or other tools. Conventionally, this work would remain invisible, and only “real” data would appear in published work only in support an argument of influence or causation. But foregrounding such preliminary work will help readers to understand the genesis of the question, flag any possible framework errors, identify any category mistakes, and perhaps inspire them to think about how to apply such techniques in their own work.
¶ 23 Leave a comment on paragraph 23 0 To investigate the user in more detail, one can use other online corpora to generate a series of radically different interpretive views. For example, searching in the Time Magazine Corpus allows one to see all of the collocates (words that appear within a specified number of words from the search term) and display counts by decade.
¶ 24 Leave a comment on paragraph 24 0
¶ 25 Leave a comment on paragraph 25 0 Column B in Figure 2 lists most words within four words of “user”. So it’s obvious, for example, that “drug” appears within 4 words of “user” 32 times. To better make sense of these results, the collocates were coded into two categories (column A): those that have to do with drugs and those that have to do with technology. There are a few at the bottom that remain uncategorized, but which most likely would be considered technology uses of the term. To draw attention to the patterns in the data, cells in the sheet with two hits have been coded dark green; those with more than two hits.
¶ 26 Leave a comment on paragraph 26 0 On the whole, the chart “users” lends itself to some quick observations. As far as the Time Magazine corpus suggests, the growth around the term “user” happened for both drugs and technology around the same time. The first technology term to appear is telephony, which perhaps suggests that the rise of the “user” may have may have less to do with the rise of computing (our typical conception of it now) than the rise of networks.
¶ 27 Leave a comment on paragraph 27 0 But going beyond the data—making sense of it—can be facilitated by additional expertise in ways that our usually much more naturally circumscribed historical data has generally not required. Owens blogged about this research while it was in progress, describing what he was interested in, how he got his data, how he was working with it. Over the next week the post was viewed over two hundred times; twenty-two researchers and librarians tweeted about the post; some left comments. For example, Rob Townsend, Research Director for the American Historical Association commented that the post was “a fantastic use (and contextualization) of Google n-gram data”.
¶ 28 Leave a comment on paragraph 28 0 Most importantly, Owens received several substantive comments from scholars and researchers. These ranged from encouraging the exploration of technical guides, learning from scholarship on the notion of the reader in the context of the history of the book, and suggestions for different prepositions that could further elucidate semantic relationships about “users.” This discussion resulted from Owens having foregrounded his initial forays into data online where it was easy to give different views of his data. Sharing preliminary representations of data, providing some preliminary interpretations of them, and inviting others to consider how best to make sense of the data at hand, quickly sparked a substantive scholarly conversation. This is not to say we should expect everyone to help with our own research, but that because raw data that we now have so easily ranges so widely over typical disciplinary boundaries, a community approach is even more essential. And it benefits everyone involved, as the discussants are able to learn about data and methodologies that they can apply in their own work.
¶ 29 Leave a comment on paragraph 29 0 In addition to accelerating research, foregrounding methodology and (access to) data gives rise to a constellation of questions that are becoming increasingly relevant for historians. How far, for example, can expressions of data like Google’s N-gram viewer be used in historical work? Although a chart cannot be used as evidence, it certainly can be used to identify curious phenomena that are unlikely to be artifacts of the data or viewer alone. How does one cite data without black-boxy mathematical reductions? We do not refer to how one ought to format the reference in print, but how one can bring the data itself into the realm of scholarly discourse. How does one show, for example, that references to “sinful” in the nineteenth century appear predominately in sermon and other exegetical literature in the early part of the century, but become overshadowed by more secular references later in the century? Typically, this would be illustrated with pithy, anecdotal examples taken to be representative of the phenomenon. But does this adequately represent the research methodology? Does it allow anyone to investigate for themselves? Or learn from the methodology?
¶ 30 Leave a comment on paragraph 30 2 Far better would be to explain the steps used to collect and reformat the data, and, ideally, to make the data available for download. The plain text file that has been organized to make the above linguistic shift in “sinful” easy to detect would be considerably useful for other researchers, who in turn will certainly make other observations and draw new conclusions. Exposed data allows us to approach interesting questions from multiple and interdisciplinary points of view in the way that citations to textual sources do not. Again, this is not to argue for a whole-cloth replacement of close readings and textual analysis in historical research, but rather using data as a complement to it. As it becomes easier and easier for historians to explore and play with this kind of data it becomes essential for us to reflect on how we should incorporate this as part of our research and writing practices. Is there a better way than to simply provide the raw data (as gotten from its sources) and an explanation of how to witness the same phenomenon? Is this the twenty-first century footnote?
¶ 31
Leave a comment on paragraph 31 0
Conclusions
There has been no aversion to using data for historical research. But historians are beginning to use data on new scales, and combine different kinds of data that ranges widely over typical disciplinary boundaries. The ease and increasing presence of data, in terms of both digitized and increasingly born digital research materials, means that—irrelevant of historical field—the historian faces new methodological challenges. Approaching these materials in a context sensitive way requires substantial amounts of time and energy devoted to how exactly we can interpret data, and how it should be incorporated into our writing. We have argued that historians deliberately and explicitly share examples of how they are finding and manipulating data in their research with greater methodological transparency in order to promote the spirit of humanistic inquiry and interpretation.
¶ 32 Leave a comment on paragraph 32 4 We have also argued that working with and writing about data does not mean that historians need to take on the kinds of epistemological burdens that underpin many of the tools that statisticians or quantitative historians have developed. Much useful work with data in history requires little more than simple frequency counts, descriptive statistics, and reformatting to become essential to the historian looking for anomalies, trends, or unusual coincidences. To argue against the necessity of mathematical complexity is also to suggest that it is a mistake to treat data as self-evident or that data constitutes historical argument or proof. Historians must treat data as text, which needs to be approached from multiple points of view, and as openly as possible. Working with data can be playful and exploratory, and useful techniques should be shared as readily as research discoveries. While typical history scholarship has largely kept methodology and data manipulation in the background, new approaches to writing can complement more traditional methods and venues to avoid some of their well documented limitations, especially as it enables sharing data in a variety of forms.
¶ 33 Leave a comment on paragraph 33 0 Gathering data, working with and representing it, and of course writing about it, should be required of all historians in training–not just those in digital history or new media courses–to best use the new kinds of historical sources/data that have opened up new avenues of inquiry for virtually every historical specialty. Of course not all research projects will require facility with data. But just as historians learn to find, collect, organize, and make sense of the traditional sources, they also need to learn to acquire, manipulate, analyze, and represent data. Access to historical sources makes the historical record in the twenty-first century looks rather different than it has ever before. Writing about history needs to evolve as well.
¶ 34 Leave a comment on paragraph 34 0 About the authors: Fred Gibbs is assistant professor of history at George Mason University and director of digital scholarship at the Roy Rosenzweig Center for History and New Media. Trevor Owens is a digital archivist at the Library of Congress. He also teaches digital history at American University.
- ¶ 35 Leave a comment on paragraph 35 0
- Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331, no. 6014 (January 14, 2011): 176 -182. ↩
- Witmore, Michael, and Jonathan Hope. “Shakespeare by the Numbers: On the Linguistic Texture of the Late Plays.” In Early Modern Tragicomedy, edited by Subha Mukherji and Raphael Lyne. D.S.Brewer, 2007. ↩
- For a more detailed history of cliometrics and its impact on the digital humanities see Thomas III, William G. “Computing and the Historical Imaginiation.” In A companion to digital humanities, edited by Susan Schreibman, Raymond George Siemens, and John M. Unsworth. Wiley-Blackwell, 2004 ↩
- Ramsay, Steve. “The Hermeneutics of Screwing Around; or What You Do with a Million Books.” http://www.playingwithhistory.com/wp-content/uploads/2010/04/hermeneutics.pdf. ↩
- Frederick Gibbs, “Beware the Coffee” http://criminalintent.org/2011/03/beware-the-coffee/ ↩
- Flanders, Julia. “Data and Wisdom: Electronic Editing and the Quantification of Knowledge.” Literary and Linguistic Computing 24, no. 1 (April 1, 2009): 53 -62. ↩
- Perry, Dave. “Be Online or Be Irrelevant: Thoughts on Emerging Media and Higher Education.” AcademHack, January 11, 2010. http://academhack.outsidethetext.com/home/2010/be-online-or-be-irrelevant/. ↩
# Writing in the Digital Age Reflection
2011-10-26
Fred Gibbs and Trevor Owens take aim at an important topic, namely the ways that historians should approach writing about their methods in an age of data processing and visualizations. They argue that “new methods used to explore and interpret historical data require[s] a new kind of methodological transparency in history writing,” one that privileges historical narrative not only as a product but also a process. Methodological transparency includes discussing data queries, workflows with tools, and the production and interpretation of visualizations, as well as a de-emphasis on the traditional historical narrative in favor of explaining a process of inference with historical data that is principally digital.
Especially important is their call for digital humanists to foreground their methodological techniques and explain these in a way that allows broad accessibility for a wide array of scholars curious about methods for approaching data. They rightfully point out that historians likely place their digital methods in the background in order to emphasize the traditional approach to historical narrative and give their projects legitimacy. Yet by ignoring methodological explanations, digital techniques remain “an impenetrable and mysterious black box” peered into only by those with the knowhow and technical wherewithal to make things happen. Like most other scholarly fields, historians usually explain their methodologies and frameworks in journal articles and book introductions, including the sorts of primary source material used in constructing scholarly conclusions. Digital methods are different in that the techniques allow for a much broader swath of historical data to analyze with techniques that make such analysis relatively fast.
Also important is their explanation of what data is and is not. Perhaps drawing on Jerome McGann and Joanna Drucker, the authors describe the process of working with data as “playful” and “exploratory.” Data does not need to reach solid conclusions through statistical methods. Simply doing basic queries in Google ngrams can raise new questions about topics, although the data present in the graph would not constitute a solid form of historical evidence. Nor does data need to take on a sort of scientific quality that relies on rigorous hypothesis testing, a crucial difference, they remind readers, from cliometrics. Tied to both of these realizations is their insistence that data is not only evidence, but a broader framework or discovery of research inquiry. The examples provided of Owens’ querying the rise of the user illustrates this very well by demonstrating how quieries raised questions for his research. By digging deeper into texts he uncovered new possibilities about the phrase and its rising use in the later decades of the twentieth century. The tandem approach of distant reading and close reading — enabled only by digital techniques and clearly explained as a method — gave historical significance the trends Owens was seeing.
Connected to Owens approach was the publicness of his techniques. His research question was immediately public, appearing as a thoughtful blog posts on his website, and soon spread through Twitter and received thoughts and comments from hundreds of other scholars. In no area of print publication is such a method of publicly testing an idea possible to this extent. By making preliminary investigations and interpretations public and inviting public comment and consideration, historians have an ability to engage in substantial scholarly debates early on in the research process. The value here is a richer, more thorough discussion of an early idea that will only serve to assist Owens in thinking about his research in new ways.
The authors argue against a mathematical complexity so often found in the social sciences. They stand against any claims that data can stand as self-evident proof. Historical interpretation still remains a key component to the analysis of historical data. Indeed, they suggest “datatext” as a label for treating data *as* a text. In this way, the analysis of data is treated as any other historical source — from multiple perspectives and considerations.
Towards the end the authors write that historians do not need to take on the “epistemologial burdens” of staticians or quantitative historians. That is why this piece is so important. Precisely because historians are making inferences off large datasets without necessarily subjecting that data to mathematical vigor, we must explain not only the conclusions drawn from the data but the process of reaching those conclusions. There is a serious disconnect between reader and writer if it remains unexplained how conclusions were reached through data visualization. Even if such tools and visualizations are tools built by an individual scholar for a specific research topic, an explanation of how such things work and what it reveals in terms of historical questions is as important as the conclusions drawn from the tools.
Perhaps there is a level of statistical inference that *must* happen when working with large sets of data. We may not need the precise toolkit of staticians — for example, in topic modeling, knowing something about the Dirichlet parameter will not likely be of much concern —but there is a danger in cherry-picked phrases or words appearing as a significant correlation when, statistically, that may not be the case. We might could think of words with multiple meanings — “character” may refer to the moral qualities of an individual, but it might also refer to imaginative creatures or a jab (“he’s a real character”) — may well require a different sort of analysis somewhere in between distant and close reading. In other words, historians may have to carry an “epistemological burden” of their own. Ours may not center on statistical or mathematical formulas, but certainly will include the very process of working with data. The reason we chose a word or the process of historical inference is essential to the workings of data hermeneutics.
It may be beyond the scope of the essay, but the authors may speak a little more about the training of historians in assessing, collecting, interpreting, manipulate, and disseminating data. In their conclusion they argue that gathering, working with, and representing data should be required training, no different than learning about the various frameworks and techniques used in historical writing and interpretation. But perhaps they could expand this slightly and make a fuller case as to the importance of such a change in training. Their explanation would not only foreground data hermeneutics as part of the very work we should do as historians, but also perhaps give graduate students some ideas for thought and discussion among their colleagues, advisors, and mentors. The case is still being made that digital history has value as a method. Integrating the process and product of data into the broader argument of digital humanism only helps to strengthen the case.
A note from the editors: For additional commentary on this essay, please see the page for general comments on the book.
For what it’s worth, despite the occasional niggles I’ve added below I think this is an important piece that deserves an audience more than most other things I’ve read about digital humanities scholarship.
What troubles me about this article is its implicit assumption that “data” is a kind of thing different from “evidence.” The word evidence, incidentally, does not appear for the first time in this essay until paragraph 11.
What do the authors mean by data? How is it different from evidence? It seems to me that if an essay argues for transparency in historical method, then explaining what these two basic terms mean–and assumption that data and evidence are mutually exclusive things–is foundational to the exercise.
I like this essay very much. Though I’m not someone who tends to find collocation observations or the Google Ngram particularly compelling, I do appreciate the ideas the authors have about how digital tools can change historians’ methodologies and in fact must change the ways in which we talk about them. And I particularly like the notion that learning how to play with data has the potential to open up some new lines of inquiry in our discipline.
Great article! Inspired by this and memories of maths lessons past, I definitely think we need to start ‘showing our working out’. While I agree we should start doing it now — through blogs in particular — it does make you start thinking about formats. How can bring together these different forms of historical writing (along with the sources and data) and publish in a meaningful way that itself encourages exploration and re-use.
I’m also wondering about how we communicate the mistakes, dead ends and dumb luck that all go into research. Even in a methodological article, the tendency will be to work towards a particular end point. Perhaps we need something that combines elements of the artist’s sketchbook with the scientist’s lab notebook. Would this sort of transparency be a bit too scary?
I really enjoyed this article.
Reading this alongside Bauer’s raises a question in my mind. Bauer writes, “you have to be careful while designing your database to ensure that you accurately model your field of study without feeding your own preconceptions back into your analysis” (Bauer paragraph 2). To what extent do you think using larger web-based databases overcomes this issue? Do you think that geographical, political, gendered or cultural biases exist within web-based databases with vast collections of certain source types and under-sampled selections of others?
Gibbs and Owens point out that historical writing has been “largely confined by linear narratives” and they argue that historians have generally hidden their methods from view. They call for historical writing that “explicates the research process” and “interfaces with, explains, and makes accessible the data that historians use.” Their main concern is that the field is currently and will be transitioning to large-data arguments and perhaps away from “humanly manageable” forms of historical research. Methodological transparency will clearly be essential in this environment. There have been many recent posts about humanities scholars in the digital age awash in data. I would like these authors to explain more about “data” — both what they mean by the term and how “data” is presently changing form and quality. We can see that Google offers a particular form of data in its N-gram viewer, but these data are quite distinct from the data that Cronon or Bushman assembled. Further work in this essay might explain a typology for data in the digital age. And therefore why hermeneutics of data might be so important now.
you’re right. we need to be clearer about our definitions of data and evidence, especially since they are two different things.
exactly right: we’re not nearly clear enough on these points, especially the notion about changing form and quality of data.
This is a very important essay on a crucial topic. To channel Jo Guldi, one of the interesting questions on the horizon is “What can historians do with sloppy answers to big questions?” I particularly like the emphasis on play, especially because I’m not sure there is a better approach for most kinds of historical data. Keeping statistical methods in mind is hardly a bad idea, but what we have is almost always too fragmented for statistical analysis — What is a statistically viable sample of inscriptions from Athens? Well, what’s your sample set? All the inscriptions there ever were, or just the ones that have survived?
Using play as your model, could also force historians to be explicit about where they got their data and how they did (and did not) analyze it. If you aren’t even pretending that your methods are exhaustive, then the burden is on you to show why what you did do is of real value.
In our invitation to revise & resubmit your essay, we wrote:
We support the thoughtful public comments your essay has already received, as well as your own proposed revisions in response to them. For example, Fred Gibbs writes:
“thanks, bethany. you’re right on both counts: we need to fix the “remain” typo, and drucker should appear here. my worry was that ‘capta’ doesn’t really capture (sorry) our point here, though similar, and i didn’t want them to be conflated. i’m generally against neologisms, but wonder if the notions of text and data are too firmly established to refer to something else (or how they are to be used), even if not wholly different. probably, the idea of datatext should be expanded here and the word itself excised.”
We would encourage you to take up in the essay (as Trevor Owens has already done in a comment) Ted Underwood’s warning about the problems concomitant with expansion of scale and about the cherry-picking that historians have been practicing long before data mining became possible (but which may certainly be intensified by it). Underwood writes:
“the expansion of scale is not a trivial problem. If you sufficiently expand the volume of data you’re considering, ordinary historical intuition about the significance of an example starts to become unreliable, and you need to think statistically. But if the truth is to be told, this started to become a problem as soon as we got keyword-searchable databases. A lot of us are already, in practice, doing a kind of cherry-picking with those tools.” (paragraph 32)
We would also like to underscore Amanda Seligman’s suggestion that the relationship between data and evidence (or that of facts and evidence, in more traditional narrative-based historiography) requires elucidation here. It would highlight an important point of contact between this essay and Stefan Tanaka’s elsewhere in the volume. As Charlotte Rochez writes in a comment on paragraph 30, it would be beneficial if you were to identify and refer to connections between your claims and e.g. examples in others of the volume’s essays, such as John Theibault’s on visualizations and historical arguments.
Typographical and grammatical errors need to be corrected throughout; some of these have been highlighted by public commenters but others have not, e.g. paragraph 20: “… it is now takes only seconds to…”
Please do your best to incorporate these recommendations into your revised essay. According to the word count at the bottom of the WordPress editing window, your current essay is 4,340 words. In order to meet our obligations to the Press, your final resubmission must not exceed 4,400 words.