, , , , ,

There is a tendency in digital humanities field to connect data from diverse sources in order to uncover underlying patterns. Some aspects are indeed only visible while looking at the bigger picture and at scale. Data from disparate sources stored in a structured way, can assist analysts to extract hidden properties. This approach can add value to the work of humanities scholars. The problem, however, is that it is usually hard to connect data from different data-sets. Interoperability is a big issue in this field and makes it difficult for humanists to take advantage of them. The usage of different formats and incompatible data structures hinders the ability to use the data in a meaningful way.

A great example is the work of Kalliopi Zervanou et. al. [1]. Their work focuses on gathering data relating to the usage of medical plants in low lands*, during the time period of 1550 to 1850. The goal was to gather data, so that researchers were able to unveil patterns, that were invisible, when each source was taken into account separately. For instance, it was revealed that medical plants were not only used for medical reasons but for other reasons as well, such as public acceptance. The aggregated data did not only include sources from different scientific fields (historical, etymological, pharmaceutical etc.) but also contained structured (databases, excel tables etc.) and non-structured (free-text transcripts, images etc.) data. This was actually the most challenging aspect of their work, because information coming from diverse sources usually has different formats. For instance, excel tables have to be transformed into another format before they can be stored into a relational database. The obvious downside was that a lot of time and resources were spent to overcome incompatibility. After finalizing the project, the data was released to the public. The propagation of the problem is evident, since their data might be needed in future projects by different groups. It is, therefore, worth while for the data to be released in a format, which will make its future use easier.

This is not the only project, however, that follows this direction. In “Digitizing the Prosopography of the Roman Republic” (DPRR) project [2], the aim is to develop a database which contains data regarding the career paths and connections of elite individuals in ancient Roman Empire. The aristocratic elite arguably contributed greatly to the development of the empire. The project plans to draw data from numerous print and digital prosopographies** and link their content together. By exploring and connecting all this data, new properties are expected to emerge. Hopefully this will shed some light in the transformation of a city-state into a great empire. However, they explicitly comment on the difficulties of bringing the sources together and combine them into one structure. Handling disagreements between them is expected to be a challenging aspect. It is now evident that the issue of data interoperability is a reoccurring problem. It happened before and continues to rise worries when planning future projects.

The first steps toward a viable solution of this issue has been performed by the Canadian Writing Research Collaborator (CWRC) [3], which has developed a model to link diverse data-sets together. They use an entity-based model to describe the data. In this context, four classes are defined, which are Person, Organization, Place and Tittle. Entities must belong to one of these classes. Furthermore, for the purpose of consistency, entity schemas govern the creation of the entity records. The end goal is to connect data-sets via a set of linked data. The model is deliberately very relaxed and minimalistic, therefore it can incorporate data from various sources. This way different projects are not forced to follow the same strict specifications. Every scholar can work at his own setting and project autonomy is guaranteed. Moreover, the model is compatible with the RDF standards [4], thus the data can be effortlessly translated into an open Semantic-Web data-set, which would make it easily accessible to other groups.

This is undoubtedly a step towards the right direction. After the obstacle of interoperability is overcome, then seeking hidden properties in seemingly unrelated data would be made straightforward. In return, this may lead to considerable advances in many fields.


[1] Creating Time Capsules for Colonial Botanical Drugs in the Early Modern Low Countries

[2] Digitizing the Prosopography of the Roman Republic

[3] An entity-based approach to interoperability in the Canadian Writing Research Collaboratory

[4] RDF specification

* Low lands is the region around Netherlands and Belgium.

** Prosopography is an investigation of the common characteristics of a historical group.