This post was originally published on 10/10/2015 as part of the Clio I class at George Mason University.
People often perceive data as unbiased evidence. As Trevor Owens points out, this is a dangerous practice. Just like any other set of primary sources, data needs to be questioned. It had an author with an agenda who made decisions. Every decision, although not necessarily malicious, has consequences. What data is collected? What data is not collected? How is it organized? How was it collected? Who collected it? All of these questions need to be considered when evaluating data. This week our assignment was to locate and clean a set of historical data. I chose to clean data about the signers of the Declaration of Independence provided by the US National Archives & Records Administration. This set of data includes the following information for the signers: Name, State Rep., Date of Birth, Birthplace, Age in 1776, Occupation, Number of Marriages, Number of Children, Date of Death, & Age at Death. Although this information looks relatively standard and innocuous, it shows evidence of some curious decisions.
The first thing that caught my attention was the information that was included. I thought the decision to count the signers’ “Number of Marriages” and “Number of Children” to be peculiar. What does this information tell us about the signers? Most of the men were or pretended to be heterosexual men. It does suggest that there was a high birth rate and judging by the number of marriages, a high fatality rate for mothers. Although interesting, I am not sure why this data was included. Secondly, I noticed what information was not included. If this data is intended to educate the public about the signers, why are the names of wives and children absent? Why are they only numbers? What about the signers’ education, religion, and political experience? Finally, I noticed the language of the information. Although the signing of this document would eventually create the United States, the colonists represented colonies, not states.
This data highlights some of the challenges that historians face with converting textual information into numerical data. In the process of creating data, authors often favor content that can be quantified. The number of marriages is manipulated easier than the names of wives. Could the frequency of female names produce just as interesting of a historical argument? If this is the only data we have to work with, we will never know. This is just a minor example of how the creation of data can silence history. As Michel-Rolph Trouillot argues in Silencing the Past, silences enter historical narratives at multiple different points in the process including the moment of fact creation (the making of sources) and the moment of fact assembly (the making of archives). My previous example affects both of these moments.
The format of data can also cause headaches. For example, the data about the founders includes their first and last names in a single column heading. Thankfully, some data can be easily cleaned or reformatted for usefulness. Many resources exist on the web that explain how spreadsheet formulas can be used to separate, combine, count, and sort information. Using some of these formulas, I was able to clean the data easily in addition to combining it with additional data that I considered useful. These shortcuts for cleaning data saved me about an hour of work. However, using these shortcuts on data that includes thousands or tens of thousands of entries instead 56 would save hundreds of hours of work.