Fielding History (Bauer) Fall 2011
Fielding History: Relational Databases and Prose
¶ 1 Leave a comment on paragraph 1 0 It wasn’t until I started writing the introduction to my dissertation, “Revolution-Mongers: Launching the U.S. Foreign Service, 1775-1825,” that I realized how much building The Early American Foreign Service Database and its underlying open source software package, Project Quincy, influenced how I understand and explain my research. At that point the EAFSD had been live for four months, and I had been telling people that the two projects have a symbiotic relationship. My dissertation contains the stories, quirky situations, and historiographical analysis necessary to bring the past to scrutiny and life. The database provides the backdrop which showcases particular moments as quintessential or unusual, but it is also a standalone secondary resource, a separate publication in its own right. All of this was and remains true, but as I started that introduction, I became conscious of another way the two projects inform each other. As I described the nature of late eighteenth-century diplomacy — the difference between diplomats and consuls, the geopolitical realities of empire, the personal and commercial connections between Foreign Service officers — I found my description replicating the data structure I had built into the EAFSD, because that structure was the best way to get my background knowledge of my topic on paper.
¶ 2 Leave a comment on paragraph 2 6 When I realized this overlap I gave a little cheer, because I knew I had designed the EAFSD properly. Databases are normative statements about reality. If all data are theory-laden, then data structures are theories in and of themselves. When you design a database you are making proclamations about what matters (and, by implication, what can be safely ignored), and because relational databases are particularly constricting in how you can represent and link data, you are forced to be very explicit and systematic in your choices. This constriction has lead some historians to abandon relational databases for more flexible data structures, like XML or semantic linking. Some of this rejection is fueled by the fact that databases and statistical packages were adopted by historians before the technology was sufficiently advanced to handle historical sources with the nuance they require.1 We should remember that eighty-eight hole punch cards frustrated the cliometricians themselves, as well as their readers. In my opinion, much of the reaction against relational databases is simply another symptom of the split among historians that goes back to the very beginning. As a rule of thumb, if you prefer Herodotus to Thucydides you probably want XML. It all depends on your sources and temperament. Relational databases are powerful tools, but they work best when the data you want to record and analyze consists of discrete pieces of information with clear connections between them. However, you have to be careful while designing your database to ensure that you accurately model your field of study without feeding your own preconceptions back into your analysis.
¶ 3
Leave a comment on paragraph 3 1
Designing a Database
Good decision support database design involves breaking the metadata description of a data set (and therefore its logical organization) into the smallest viable components and then linking those components back together to facilitate complex analysis. This process, known as normalization, helps keep the data set free of duplicates and protects the data from being unintentionally deleted or unevenly updated.2 These components are known as entities, and the links are called relationships. Each entity represents something in the “real world” which is modeled in the database. Entities contain fields, discrete pieces of data, each with a designated name and datatype (ex. “start_year” “integer”). Entities are sometimes referred to as tables and fields are also called attributes.3 Entities and relationships only make sense when discussed together, because they take their form from each other. Relationships connect entities, and entities are constructed based on how they relate to each other. But while the analytic power and stability of relational databases comes from its basis in relational algebra, the conceptions can be hard to grasp in the abstract. So, let us turn to a concrete example: The Early American Foreign Service Database.
¶ 4 Leave a comment on paragraph 4 0 The heart of my dissertation is concerned with tracing written information flows to and among American Foreign Service officers who served from 1775 to 1825. The database was created to help me track these flows, which are preserved in the historical record as letters. This brings up another crucial part of designing databases for historical projects: you need to think long and hard about the nature of the sources you are using and what data you need to analyze. For the network/prosopographical4 analysis I am doing, I do not want to record the full text of the letters, although I do use the database to determine which letters should be read in full. The best databases point you back to the original sources for more information. So the database structure had to begin with the information that can be extracted from a letter.
¶ 5 Leave a comment on paragraph 5 0
¶ 6 Leave a comment on paragraph 6 0 Figure 1 illustrates the fielded data typically contained in a letter. Letters have the names of the sender and recipient. Letter writers usually indicate where they are writing and where they want to send the letter (whether the recipient is there when the letter arrives is, of course, another issue entirely). Letters also have a number of dates associated with them. There is the date the letter was begun, the date the letter was finished (with additional dates for addenda and enclosures), and, if you are very lucky, the date when the letter was received and then another date for when it was entered into an archive. So, if we are to model the data extracted from a letter, the resulting entity might look something like the second graphic.
¶ 7 Leave a comment on paragraph 7 0
¶ 8 Leave a comment on paragraph 8 0 Letters can be sent to and from individuals or organizations (two or more people acting together). They are sent to and from locations on particular dates (more on this later). Letters are given titles for when you need to cite them, and in case the same letter is sent to more than one person, you can mark it as a “circular,” with the term ‘boolean’ meaning that the field can only have the values ‘true’ or ‘false.’ The Letters entity also has the ever-useful “Notes” field for any information that does not fit nicely into one of the pre-chosen fields. Notice also how many of the fields are marked as “foreign keys.” A foreign key means that the field in question is in fact one end of a relationship with another entity.
¶ 9 Leave a comment on paragraph 9 2 This means that in order to accurately trace a correspondence network the database needs to have entities for Individuals, Organizations, and Locations. Everything else is specific to a particular letter, including the title, the notes field, and the dates. How you choose to record information about people, places, and groups depends on what information you think you will be able to reliably gather about most of the members of each category. You want to strike a careful balance between the uneven richness of sources and a relatively uniform standard for comparison. Just because a person or organization left behind more surviving documentation does not automatically make them more important, just easier to study.
¶ 10 Leave a comment on paragraph 10 0 As you are designing entities to describe other parts of the database, it is often helpful to create tables that hold subject keywords you want to use for classifying and later searching. Pre-selected keywords often work best when a clearly defined set of people are in charge of marking up the content. They are great for searching, and if indexed in a hierarchical structure, can provide semantically powerful groupings (especially for geographical information). As a historian, however, I am wary of keywords that are imposed on a text. If someone calls himself a “justice,” I balk at calling him a “judge” even if it means a more efficient search.
¶ 11 Leave a comment on paragraph 11 4 Of course, it all depends on your data and what you want to do with it, but my preferred solution is have, at minimum, two layers of keywords. The bottom layer reflects the language in the text (similar to tagging), but those terms are then grouped into pre-selected types. You can fake hierarchies with tags, but it requires a far more careful attention to tag choices than I typically associate with that methodology. For example, in the EAFSD I have an entity called AssignmentTitles that contains all the titles given to U.S. Foreign Service officers by the various American governments. However, there were forty-five distinct titles used between 1775 and 1825, and without highly specialized knowledge it is difficult to understand how they related to each other. So I created another entity, AssigmentTypes, which groups those titles into three distinct types: “diplomatic,” “consular,” and “support staff,” allowing for ease of searching among similar appointments without having to remember every term for consul, or those performing consular functions, used by the Continental Congress, the Congress of the Confederation, and the State Department. It was this three-part distinction that I unconsciously replicated in the introduction to my dissertation, which made me realize the two publications were more intimately linked than I had previously understood.
¶ 12
Leave a comment on paragraph 12 2
Modeling Time
When designing databases for historical research and teaching it is crucial to remember that these databases are works of history. One of the great challenges of digital history, but also one of our field’s most important contributions to digital humanities in general, is the careful representation of time. Our sources do not exist in some eternal present, but are bound to the past in ways that computers find hard to understand. Computers record time in ways that are simply ridiculous when you are trying to bring the past alive. Who thinks in date-time stamps? True, someone’s life can change in the blink of an eye, but fractional seconds are not helpful in recording human experiences. In fact, they impose an anachronistic, hyper-precise gloss on events that creates an unnecessary barrier to comprehension. While building the EAFSD there was a harrowing week when I could not enter dates prior to 1999, and any date field left blank reverted to today’s date. I could not concentrate on anything else while the two historical dates I had entered into the database were wrong.
¶ 13 Leave a comment on paragraph 13 0 Even so, relational databases have very powerful analytic tools for analyzing dates and date ranges that can be very useful for historical purposes. The trick, therefore, is to massage the strict date-time formats to hold your data in ways that are properly formatted, but also intellectually honest. Interface design is your friend in this case, because you can set a whole range of options for how you want your dates to be displayed. However, it is still important to think long and hard about how you want to record dates in the database.
¶ 14 Leave a comment on paragraph 14 2 How you record dates will depend on what sorts of dates your sources provide. While PostgreSQL (and other relational database packages) do not know how to handle dates that are not in the Julian calendar, with the appropriate settings they can record dates back to the fifth millennium B.C.E.5 Figure out how you want to map your dates to the Julian calendar, and explain that process clearly on your site and any documentation you provide. Depending on the age and completeness of your sources, you may need to record partial or fuzzy dates. Partial dates are dates that are missing pieces of information (ex. June 1922). Fuzzy dates are date ranges (ex. January 5-7, 1789). Neither are officially supported, but can (with some ease) be built into the data structure. For partial dates, you can choose to enter only the data you have (month and year) and leave day as 1. Then add a series of boolean flags called “day known,” “month known,” and “year known.” Depending on which of those fields are true, the system can display the dates appropriately. This means that on average you will have a fifteen-day margin of error on any of your partial dates, but can still use all the default date calculators. For date ranges, you can have start_date and end_date fields, or the fields can be labeled “no earlier than,” and “no later than,” which is how TEI (Text Encoding Initiative) handles date ranges. Keep in mind that the more elaborate the solution, the harder it will be to extract date information. The simplest solution that can be mapped to your sources is your best bet. Once the dates are in your system, you can decide how best to display them.
¶ 15
Leave a comment on paragraph 15 8
Historical Prose
So, how does all of this affect the writing of history? One answer is that standalone secondary source databases are already a major form of publishing historical research. While I am not submitting the EAFSD as my dissertation, it is a publication in its own right. As more and more history finds its way online, databases will structure future research in ways that we need to be very careful and thoughtful about. Making data structures (and the theoretical decisions that underly them) transparent through good documentation is a first step toward educating our colleagues and students about the material they are likely to find available in digital formats. There are not nearly enough digital resources for historical sources that carefully explain the reasons why the designers built their databases the way they did.6
¶ 16 Leave a comment on paragraph 16 4 Databases can also be used for note taking, which as Ansley Erickson has shown, is a powerful tool for research.7 But designing databases brings a whole new set of issues to the forefront of the researchers mind: What are the structural similarities of my sources? What are the most important elements of the world I study? What are the key relationships between those elements? How do I need to represent time? It is my belief that investigating these questions in a systematic way deepens the historian’s understanding of their own source material and analytic framework. How that is represented in their prose (if any is generated), will depend largely on the historian and the historical subject under investigation. At a bare minimum, finding the contours of your subjects’ reality will sharpen your own understanding of what is worth including in a narrative analysis, and what is best left aggregated in the database. Earlier uses of databases by cliometricians in the 1960s and 1970s focused on large-scale analysis to discover the average experience of people in different walks of life, whether in New England townships or the U.S. Army.8 In contrast, working with a database allows me to privilege the mistakes and missed communications of individual Foreign Service Officers. I have found that one of the greatest benefits of a data structure as constricting as a relational database, is its ability to place the downright weird in historical context. While I was drawn to the topic because of the Foreign Service’s ability to function despite being run entirely by amateurs who, at best, learned while doing, the database has allowed me to see where the especially interesting gaps or overcompensations occurred. By making it easier to find the overall trends, I am free to explore, without overstating, any anomalies I find in the course of my research. For those of us who work on trans-Atlantic and even global topics, that freedom can prove invaluable as we sculpt arguments from an ever expanding set of potential sources.
¶ 17 Leave a comment on paragraph 17 0 About the author: Jean Bauer is the Digital Humanities Librarian at Brown University. She is finishing her dissertation, “Revolution-Mongers: Creating the U.S. Foreign Service, 1775-1825,” in the Corcoran Department of History at the University of Virginia. www.jeanbauer.com
- ¶ 18 Leave a comment on paragraph 18 0
- William G. Thomas III, “Computing and the Historical Imagination” in ed. Susan Schreibman, Ray Siemens, John Unsworth, A Companion to Digital Humanities. (Oxford: Blackwell, 2004). http://www.digitalhumanities.org/companion/ and Daniel Cohen and Roy Rosenzweig Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web. (Philadelphia: University of Pennsylvania Press, 2005). For an older, but still excellent discussion of digital history (past and present) see Edward L. Ayers, “The Pasts and Futures of Digital History,” 1999. http://www.vcdh.virginia.edu/PastsFutures.html. ↩
- For a solid overview of relational databases, see Stephen Ramsay, “Databases” in A Companion to Digital Humanities. ↩
- For more technical readings on databases and relational algebra see E. F. Codd, “A Relational Model of Data for Large Shared Data Banks. Communications of the Association for Computing Machinery” 13(6): 377–87. C.J. Date, The Database Relational Model: A Retrospective Review and Analysis. (Reading: Addison-Wesley, 2001). Elmasri, R. and S. Navanthe, Fundamentals of Database Systems. (Redwood City: Benjamin/Cummings, 2004). ↩
- Prosopography, or group biography, consists of investigating common characteristics of a group of people, many of whose individual lives would be difficult to trace on their own. See Lawrence Stone, “Prosopography,” Daedalus 100.1 (1971), pp. 46-71. ↩
- “Date conventions before the 19th century make for interesting reading, but are not consistent enough to warrant coding into a date/time handler.” This is the final line of PostgreSQL’s Documentation on DateTime Datatypes, found online at http://www.postgresql.org/docs/8.4/static/datatype-datetime.html. Lines like that make me laugh, because the only other option is crying. ↩
- For a tool I have developed to make this easier, see http://www.jeanbauer.com/davila.html. DAVILA is an open source relational database schema visualization and annotation tool, and it generated the image of the Letters entity seen above. ↩
- See Ansley Erickson’s essay in this same volume as well as this earlier version: Ansley Erickson, “Historical Research and the Problem of Categories: Reflections on 10,000 Digital Notecards,” Writing History: How Historians Research, Write, and Publish in the Digital Age, October 6, 2010, http://writinghistory.wp.trincoll.edu/2010/10/06/erickson-research/. ↩
- Edward M. Cook, The Father’s of the Towns: Leadership and Community Structure in 18th Century New England, (Baltimore: John Hopkins University Press, 1976). J.C.A. Stagg, “Enlisted Men in the United States Army, 1812-1815: A Preliminary Survey,” The William and Mary Quarterly, 3rd Series, Vol. 43, No. 4 (Oct., 1986), pp. 615-645. ↩
This essay helped bring to life a different and more sophisticated use of database technology than I was familiar with. Your descriptions of the basic architecture of the database are clear to this novice reader.
My ongoing question as I was reading was less about the technology than the substance of your dissertation and database. I would have found it helpful to know earlier, possibly interwoven with your description of your database and the decisions it required you to take, what your core research questions were: why did you want to look at correspondence networks, and why was doing so in this way valuable? This is not only to situate the example better, but to allow those of us who haven’t imagined database variants of our own work to understand the relationship between the two more fully, and thus to be better able to think about possibilities for our own work.
I also wrote a blog post about writing this essay called “Am I even qualified?: Writing about digital history”. It is more about writing DH in general, than anything particular to what I write here. If anyone is interested it can be found on my blog, “Packets:” http://packets.jeanbauer.com/2011/10/16/am-i-even-qualified-writing-about-digital-history/
This essay is important in ways that I know I do not fully comprehend. Part of its importance is that Bauer is so far ahead technologically than people like me! I sent a note about it to my collaborator indicating that this is an important essay for us but I don’t know why yet!
In the current “spatial turn” it is refreshing to hear Jean Bauer’s sage and insightful reminder that one of the most important contributions to digital humanities we can make is “the careful representation of time.” Date-time stamps, to anyone who has attempted to make a database in say Access and transfer it to say early versions of Mysql would recognize Jean Bauer’s “harrowing” moment: finding out that the database cannot capture dates before 1999 (remember Y2K?) or that the early web relational database operates on “unix time” and cannot convert easily dates from the eighteenth century. Bauer’s revitalization of relational databases, however, makes great sense and provides an inspiring and thoroughly grounded case study for historians to consider. She asserts carefully that historians moved away from the RDBMS model because many web database systems, such as msql, MySQL, and post-gres, did not initially handle time and other important historical values very well or very usefully. She’s right in this assessment I think. But there is another dimension to the move to XML and semantic linking that she might explore. Historians face large quantities of documentary materials rather than tabular data and some decisions to abandon RDBMS approaches and favor XML encoding came from the recognition and concern that these materials were fundamentally textual and documentary rather than numerical. And in addition the energy in digital humanities around TEI and XML suggested other reasons for making this move. Perhaps, Bauer might give a more robust exploration of how historians might conceive of textual sources and new database systems (MongoDB for example) and what epistemological considerations might be at stake in this decision if any.
Bauer’s section on “modeling time” is important, but I’d like to see it developed in the way that Tanaka’s essay does. This section slips too quickly into “tips” on how to encode time in RDBMS when I’d like to know more about how her project “models” or might “model” temporal relationships.
Thanks so much for the great comments. One theme that emerges is a desire for more examples. I initially struggled with deciding how much time to devote to my own research topics, but clearly I need to flesh this out more.