Written language has been a useful tool for communication since its development by the Sumerians in 3200 BCE. However written language can also become a barrier to communication when you are not familiar with the alphabet, language, spelling or syntax. In the field of Digital Humanities, language presents a unique challenge to understanding and extracting meaningful information from historical documents. Not only does language evolve over time and the meanings of words can change but languages were not standardized historically as they are today, meaning that spelling, syntax, and abbreviations were much more freeform in the past. It was also quite common for multiple languages to be used within the same text or document – sometimes for just a word here or there. These linguistic complexities present a challenge for digitizing and interpreting historical documents. At the recent 2014 Digital Humanities Conference held in Lausanne, Switzerland, several papers were presented on the subject of the challenges of multilinguality and evolving language within the field of Digital Humanities. This post will explore this trend through the research presented at this conference.
In his long paper “Tracking Semantic Drift in Ancient Languages: the Bible as Exemplar and Test Case”, Matthew Munson looks at computational analysis tools to track the way that language changes over time. In particular, Munson focuses on analysing semantic drift – or the way that the meanings of words change over time – in a hope that the methods could one day be applied to ancient languages where native proficiency no longer exists. Using a statistical method called co-occurrence analysis, Munson analysed the Old and New Testaments in Greek to determine if pairs of words formed a collocation set. Munson was able to track how the words associated with God change from describing a “ruler who leads and makes war” to a “patron who offers and receives favors from clients”. This same method could be extended to other historical texts or languages in the future, with the intention of improving our understanding of semantic drift within language.
Cristina Vertan presented a workshop and a short paper on the challenges of multilinguality within historical documents. Now that important documents are more widely available online, many groups without the necessary language skills may want to consult or provide insights into these documents. In her workshop “Multilinguality in Historical Documents – Challenges and Solutions for Digital Humanities”, Vertan and her collaborators, Laurent Romary, Stefanie Dipper , and Noah Bubenhofer explore some of the difficulties in making historical documents understandable to a wider array of user groups. Historical texts are often multilingual, with Latin passages interspersed in the language of the actual text. In certain linguistically rich areas such as the Balkans, texts can contain entire paragraphs in local languages. Furthermore, without a language standard, there is a lot more variation in syntax spelling and semantics in historical texts. Vertan, et. al. posit that these challenges should be addressed by adapting the existing language resources and tools including character-level Machine Translation, using historical and modern data as comparable corpora, using historical texts in different languages as parallel or comparable corpora, word and/ or paragraph-level language identification, and crosslingual retrieval in historical documents. The purpose of the workshop was to discuss these resources, and perhaps identify others in order to address the challenges of multilinguality in historical texts.
In her short paper “Less Explored Multilingual Issues in the Automatic Processing of Historical Texts – A Case Study”, Vertan explores the use of language technology tools for making historical texts more understandable to the common user through an analysis of the works of Dimitire Cantemir, the prince of Moldavia at the end of the XVII century. The purpose of the case study was to explore the challenges of multilingual aspects of Cantemir’s works, but also to demonstrate how these aspects can be used to enhance the knowledge base of the texts. Cantemir wrote a history of the Ottoman Empire and a history of Moldavia, which until the end of the XIX century remained the only acceptable reference on the topics. The texts were initially written in Latin, although contained passages in Romanian and Ottoman Turkish, and then later translated into German. From the German was translated into English, French, Romanian and Russian. By first splitting the sentences in the German, English and Romanian versions, and then aligning the sentences in each translation, Vertan was able to identify common strings in each translation. These common strings were used to identify Old Romanian and Ottoman Turkish phrases found in the original texts. As both Old Romanian and Ottoman Turkish were written in the Church Slavonic alphabet, the transliterations were not always standardized. This method allows for identifying these phrases in each translation, and blocking these phrases from further processing. Eventually the hope is to provide multilingual knowledge in a mouse-over function in the presentation interface of these historical documents.
Evolving languages and the use of multiple languages in ancient texts, combined with the lack of standardized language, poses a serious challenge to opening up the field of Digital Humanities to more common users. However, it also provides an excellent opportunity to broaden our knowledge base of language evolution and even to gain further understanding of historical texts themselves. Munson and Vertan, et. al. provide several different methods for analysing multilinguality or the evolution of the meaning of language, including using collocation of words in texts or by comparing different translations of the same text. The knowledge gained from these methods can be expanded beyond the case studies presented at this conference. However, language will continue to present a challenge in field of Digital Humanities, as more and more texts and documents are made available to a wider range of user groups. As Vertan, Romary, Dipper and Bubenhofer explored in their workshop, this is an area within digital humanities that requires the collaboration of many different disciplines to develop and to adapt tools to address the challenges of language. Munson and Vertan demonstrated that using well-established texts, such as the Bible or Cantemir’s works, can provide excellent case studies for testing new methods.
1. Munson, Matthew. Tracking Semantic Drift in Ancient Languages: The Bible as Exemplar and Test Case. in Digital Humanities Lausanne ’14 Conference Archive, retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Paper-394.xml.
2. Romary, Laurent, Stefanie Dipper, Noah Bubenhofer, Cristina Vertan. Multilinguality in Historical Documents — Challenges and Solutions for Digital Humanities. in Digital Humanities Lausanne ’14 Conference Archive, retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Workshops-911.xml.
3. Vertan, Cristina. Less Explored Multilingual Issues in the Automatic Processing of Historical Texts — A Case Study. n Digital Humanities Lausanne ’14 Conference Archive, retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Paper-733.xml.