, , , , , , , , , , ,

The question when you first read this title could be: why do I use a couple of “closely synonyms” to describe counter entities in a “from – to” clause? They, indeed, do not carry the same implication, at least within the scope of this discussion.

Historical data is currently scattered and spread all over the world, from one of the first Renaissance libraries back in 1300s to the deepest levels of Pharaoh’s pyramids in Egypt. A great number of projects have been established with the objective of collecting historical information, then transforming it into more intuitive datasets that would be useful for research scientists and historians. This work is not simply “a piece of cake”, but rather considered to be among the most complex Big Data campaigns globally, in other words, “a hard nut to crack”.

This abstract will provide an overview of historical big data analytic process throughout continuous series of stages, outlined by The Collaborative for Historical Information and Analysis (CHIA). Especially, the last part about data retrieval is involving the comparison of different approaches by CHIA, Slave Biographies: Atlantic Database Network, and various projects for capturing metadata of historical correspondences.


Image from CHIA Website [4]

Collecting information from treasure troves

Like any other kind of information, the world’s historical data is huge and can not simply be collected by individuals, small teams or even single groups of scientists. In the abstract “Center for Historical Information and Analysis: Big Data in History” [1], the CHIA project was conducted in a large scale that involves affiliates of academic teams and researchers around the world.

There were essential activities included in this project as means of collecting and gathering data:

  • Exploiting the power of crowd-sourcing, which is capable of “opening the bottleneck of systematic study of human society at large scale” [1].
  • Establishing collaborative infrastructure among social scientists to encourage the emergence of a global network of humanities and social-science researchers.
  • Peer-reviewing of datasets, so as to ensure the high quality of every dataset created and maintained.
  • Utilizing the use of supercomputers and state-of-the-art software programs, application programming interfaces (APIs) to accompany the parallels and connections between social and natural sciences.

Those activities, despite varying in different aspects, converge at taking modern advances of technology. A deeper insight to CHIA’s collaborative architecture can be observed from the demonstration below.

CHIA architecture

Archiving and integrating information

Assembling a large number of datasets is not sufficient to produce global data—the data need to be merged into a single, uniform data repositor

CHIA Website [4]

This quote is definitely true, and repeatedly mentioned in almost every big data project. The CHIA project does not simply address storing large quantities of collected data, but also defines and connects them in a central and interactive structure. This, through a creation of global historical data resource, is able to directly resolve the issues of synchronization and relation of inconsistent local datasets as well as aggregation of regional and global levels caused by additional metadata.  For instance, the first argument can be identified by “ocean sediment samples, ice cores, or dendrochonologies” that offer “centuries of information” in certain places [1]. Meanwhile, the latter is linked to “census and epidemiology datasets” regarding millions of people in regionally spatial scale.

There were three crucial tasks involved in this process of CHIA:

  • Refining data, which was referred to as “hovering data” by Ruth Mostern [1], is an effective process to address issues in terms of data structure and intellectual property. The latter, specifically, is one of the key aspects in the contemporary digital world.
  • Integrating data, which basically defines and tackles the selection problems of standards’ appropriation and descriptions’ formality of historical datasets.
  • Evaluating data, involving reviewing datasets and offering imprimatur of publication as primary value. CHIA reviewers recorded those specifically in the “Journal of World Historical Information”.

Finally, make information “eatable”

“Eatable”, in this context, means that historical data after being gathered and integrated needs to be retrieved and represented in proper means to researchers and other parties. Different tactics for this process are analyzed more deeply in specific projects as follow.

Never before has it been more important to the humanities to try to manage a deluge of data and turn bits of information into useful knowledge

Leach, Jim [3]

The CHIA project introduced a prototype archive and retrieving system based on “faceted search”. The idea is to enable users to get selected data from three defined dimensions: space, time and topic. Utilizing the advantages of overarching and interactive data resource as well as proper application of Dataverse Network [5][6], all other dimensions are adjusted accordingly once any of those is modified. Then the program will return expected results based on search criteria and enable users to explore their studies interactively through visualized geographical distribution in the mapping area. In the near future, the researchers of CHIA expect to expand its functionality to an “ultimate objective” that allows users to envision the breadth of world-historical analytic system.

On the other hand, several projects focusing on correspondences and social networks of scholars, mentioned in “Optimized platform for capturing metadata of historical correspondences” [2], utilizes a powerful index scheme in retrieving data. Due to the characteristics of their work that requires entering and receiving unstructured data in lighting speed, simplicity and accuracy of retrieved data are always prioritized (even though they fundamentally contradicts with each other). Indexing is the best match for this requirement, facilitates auto-completion functionality based on external catalogs. In details, when users enter some letters as part of a scholar’s name, or date input related to somebody, the program will look for all possible results and return them accordingly. This concept is, though, not new from giant search engines’ sight (Google, Bing, Yahoo!, etc.), but a real breakthrough in historical research and has great potentials to grow.

Data representing was conducted in another approach by “Slave Biographies: Atlantic Database Network” [3], in which individual biographies, relationships and contexts of African slaves were discovered and integrated from various sources about individual slaves, from notarial documents, police reports, church books and censuses, and so on. Data in Slave Biographies project is not simply and discretely presented in raw text but rather envisioned under visual graphs such as histograms, pie and line charts, and most interestingly complex webs of social and kinship networks. By this approach, the Slave Biographies project has been quite successful in empowering users to see the extensive patterns of large datasets, meanwhile identify small single stories of individual slaves and their family units.

Slave Bio

Images taken from Slave Biographies’s Website [7]


At this point, I hope those discussions could explain how archives of data differ from raw storage, from gathering, integrating to representing historical information. “Never before has it been more important to the humanities to try to manage a deluge of data and turn bits of information into useful knowledge” [3]. Even though each of the above projects was going on various ways in the conquer, applied different techniques in particular problems of historical data analysis, they all eventually took their first steps to the new era of digital historical studies.


Abstracts from Digital Humanities 2013

[1] Center for Historical Information and Analysis: Big Data in History (http://dh2013.unl.edu/abstracts/ab-331.html)

[2] Optimized platform for capturing metadata of historical correspondences (http://dh2013.unl.edu/abstracts/ab-246.html)

[3] Slave Biographies: Atlantic Database Network (http://dh2013.unl.edu/abstracts/ab-194.html)

Additional resources

[4] CHIA’s Website (http://www.chia.pitt.edu/index.htm)

[5] The Dataverse Network Project’s Website (http://thedata.org/)

[6] World-Historical Dataverse – University of Pittsburgh (http://www.dataverse.pitt.edu/index.php)

[7] Slave Biographies’ Website (http://slavebiographies.org/main.php)