, , , ,

It is of common knowledge that languages constantly evolve over time. Identifying the mechanics of this phenomenon is a crucial problem to tackle. Obviously if we only consider modern languages we can easily get an insight on this evolution by consulting native speakers or specialists. However, when it comes to the study of extinct historical languages we stumble upon a lack of extractable information. For instance, how can we handle a huge amount of data resulting from 3000 years of Greek literature from Homer’s era to today?

Matthew Munson investigates this challenge in [1] by offering a method relying on co-occurrence patterns in two or more corpora of texts which he entitles “semantic drift”. The fundamental point to keep in mind is that an isolated word taken out of its context cannot be rigorously understood, whereas considering this word as the central component in a larger sentence allows us to better grasp its meaning. Based on this idea, Munson derives a theory on word-sense variation around an experiment involving the Greek Old Testament and the Greek New Testament. This choice is mainly justified by the fact that these texts are the most influential in the history of western civilization and hence this research can be easily generalized to other pieces of work.

After decomposing a stream of text into tokens, co-occurrence tables were computed for every word in an 8-word window. Using a statistical approach based on log-likelihood measures, Munson was able to measure the significance of relationships between pairs of words. The cosine similarity statistic was then brought into the picture to quantify relationships between on one side, words in the Old Testament and their counterpart in the New Testament and on the other side, between every word in the Old Testament and every other word in the Old Testament, and analogously for the New Testament. For example, by looking at the word θεός (God) in the Old Testament and the New Testament, Munson established a list of twenty words that were closer to the word God in the Old Testament and those that were closer to the word God in the New Testament. The colors in Figure 1 are used to group words that are closely related in terms of signification. It is interesting to notice that in the New Testament God is more closely related to evil rulers such as Satan and Pilate and servants of God, namely Paul and Peter, whereas in the Old Testament, God appears closer to the lexical fields of ruling, violence and agriculture. This analysis gives us an insight on how the representation of God has evolved through time and how this evolution can be linked with a historical explanation.

DH2014_325_Old Testament2

Figure 1 : Results from the cosine similarity measure between θεός (God) and the list words in the Old and New Testament.

Another approach to a semantical analysis of analogies in texts is developed in [2]. A semantic annotation methodology exploring metaphorical content is built based on the Historical Thesaurus of English database. Metaphors illustrate how human beings translate abstract concepts into concrete entities. Recent research has shown that on average, in English discourse, every seventh word is a metaphor. Hence, one must fully understand this subtile figure of speech in order to remove any potential ambiguities. The authors of [2] believe that semantic annotation is the best strategy to deal with a large collection of textual data where imagery comes into play.

Taking a closer look at the Historical Thesaurus of English highlights the complexity of an effective search on word forms. For instance, the word “strike” has 181 different meanings in English. In order to answer this problem, the Glasgow-Lancaster Semantic Annotation System was developed to reach an unprecedented level of precision in corpus annotation. This system matches each word in a corpus to the Historical Thesaurus category to which it belongs. Each word being annotated with a meaning code corresponding to the given category, the words are aggregated into a 600,000 word corpus. A log-likelihood test is then performed to compare the frequencies of the meaning codes of the words in the corpus with a random concatenation of texts from Wikipedia that serves as a reference. This procedure enables the identification of the key semantic domains of the corpus in question. Two popular scientific documents aiming to vulgarize abstract scientific content provide a testbed for the study of non-literary analogy. One of the examples considered in [2] is The Fabric of the Cosmos, a book about theoretical physics. While the four most frequent semantic domains (space, distance, photon, computation of time) are as expected related to physics, the next four (woven fabric, pattern/design, stringed instruments, spinning textiles) are not directly related to the main topic of the book and hence, carry the information about the metaphorical content. In other words, the author discusses physics by using metaphors of fabric and strings. This atomic annotation system allows us to discover where metaphorical content is located and how it clusters in a given corpus.

Alexander and Anderson showcase in [3] the evolution of English semantics. The Historical Thesaurus of English renders possible the visualization of the evolution of each semantic category throughout time, by taking into account the metaphorical structures that link the different categories. The HT classifies into categories all English words from the Anglo-Saxon period to today, more precisely 739,742 entries are split into 225,131 categories. Semantically close categories are then grouped together to form clusters. This approach offers an efficient way to capture social and cultural advances through the evolution of the lexicon. One of the main advantages of the HT structure is the analysis of the apparition of synonyms which reflects the creation of numerous terms having the same meaning. For linguists, the birth of synonyms for a given word is explained by a difference in perception and awareness from the speaker’s point of view. Moreover, a situation where a large number of words carry the same meaning reveals the importance of the conveyed concept. The non-linearity of the graph in Figure 2 shows that there are some booming periods, for example the English Renaissance in the mid 1400’s, as well as some phases with moderated growth.

DH2014_66_Alexander - Figure 1

Figure 2 : Growth of the English language across time.

In addition to the global evolution, three aggregate semantic categories are considered. As seen in Figure 3, this cluster not only reflects a global trend in the evolution of semantics but also highlights a correlation between the members of this cluster due to marking historical or cultural events.

DH2014_66_Alexander - Figure 2

Figure 3 :The growth of three semantic fields. Square: 02.01.15 Attention and Judgement; Circle: 03.05.05 Moral Evil; Triangle: 03.10.13 Trade and Commerce.

Although computer-based techniques enable a better understanding of semantics in digital texts, there is still a lot to be done to fully grasp the metaphorical content. The Mapping Metaphor project aims to better understand this aspect by overlapping the semantic HT categories.

In conclusion, the review of these three papers emphasizes the progress in understanding the semantical evolution of the English language. After considering the statistical test developed by Munson in [1] to analyze the semantic drift in Ancient Greek between the Old and the New Testament, we underlined the power of the Historical Thesaurus of English to measure the semantic evolution of groups of words that are structured in appropriate clusters. We can interpret this evolution throughout time by means of major historical, cultural or social disruptions which are the building blocks of the lifespan of languages. As mentioned in [3], this interpretation must be further investigated by diverse branches of the Digital Humanities, such as linguistics and literary studies.

References :

1. M. Munson, Tracking semantic drift in ancient languages (http://dharchive.org/paper/DH2014/Paper-394.xml)

2. M. Alexander, J. Anderson, A. Baron, F. Dallachy, C. Kay, S. Piao, P. Rayson, Metaphor, popular science and semantic tagging : distant reading with the Historical Thesaurus of English (http://dharchive.org/paper/DH2014/Paper-77.xml)

3. M. Alexander, W. Anderson, “Civilisation arranged in chronological strata” : A digital approach to the English semantic space (http://dharchive.org/paper/DH2014/Paper-448.xml)