, , , , ,

In projects like the Venice Time Machine, there is a big amount of digitalized text to analyze. Obviously, giving this task to a human being appears not to be realistic and even hiring a large number of reader would imply complex logistic infrastructures to share the retrieved knowledges. Thus the need for tools that extract information from corpora of text is great. For this matter, computational methods are created. In this blog post, we will present several methods which aim to realize this complex task.

A first way to retrieve information from a corpora is to classify the texts into several subsets. This way, the reader can select more precisely the important readings to do. Classification can also enlightened trends into the corpora like structural differences, time periods, language complexity. An already well established method for text classification is authorship attribution. This field brought strong statistical tools to analyze and retrieve information on corpora (using them to conclude which author had written which text). In their paper [1], H. Craig and al. broaden this tools for more general use.

First, to achieve text classification thru language complexity, F. Jannidis and H. Craig are using two measures, the Shannon Entropy and the Jensen-Shannon Divergence, to determine the level of complexity of different novels of the nineteenth and twentieth century. There are able to separate the novels into two categories: low-brow and high-brow novels.

In a second part, M. Eder and J. Rybicki are using bootstrap consensus network to floodlight text similarity using most frequent common words thru the texts. There are able to extract, from a visualization tool, chronological order to and also stylistic drift, implying that “distant reading by most-frequent-words frequencies can mirror the evolution of literary style over hundreds of texts,[…] and open new perspectives for close reading.”[1]

Bootstrap consensus network of over 500 English novels

Fig. 1: Bootstrap consensus network of over 500 English novels [1]

As these method appears to work well, we have to keep in mind that the question of calibration and validation is complex. Indeed the previous methods need to be validated “by traditional literary history, classification and interpretation”[1] as for authorship attribution it is easier since it is sufficient to test the method on texts for which we already know the author.

Text classification is not the only tool for retrieving information from corpora. G. Roe and al. (see [2]) are using topic modeling technic on the french encyclopedia of Diderot and d’Alembert to find a distribution of recurrent topics in this particular corpora. Their results are very interesting, in the sense that this algorithm can detect the same topic in various contexts. In this case, it is showing a spread of the narrative of Enlightenment thru a lot o different encyclopedia’s articles, “many of these encyclopedic discourses were deployed subversively in order to move the narrative of Enlightenment forward” [2]. The fact that the context in which the topic is does not matter, shows a high sensitiveness of the algorithm towards texts underlying meaning.

In a perfect world, the easiest way of getting precise information about a corpora would be simply to ask the question you want to answer to someone who knows the corpora and has already analyzed it. In his paper [1], M. Kestemont present a method based on Deep Learning algorithms that allows the reader to ask complex question to a program. Still the question are limited but ones like “What is to Warsaw, like France is to Paris?” can be answered, in this case “Poland”. Sadly, the underlying data structure needed to achieve this is not clearly explained into the paper, however it is a promising tool for information retrieving for simple corpora, like corpora of well structured contracts in which you would want to find relationship between two parts, for example.

Fig. 5- Country and Capital Vectors Projected by PCA. (Copyright- Mikolov et al. for Google Inc.)

Fig. 2: Country and Capital Vectors Projected by PCA. (copyright: Mikolov et al. for Google Inc.) [1]

Although this methods are providing really great ways to retrieve information from texts. A. Plasek and D.L. Hoover [3] are presenting a critical point of view on these methods. Indeed they strongly state that “ the critic who endeavors to put forth a “reading” puts forth not the text, but a new text” [3]. It is important not to forget that results of a program still are interpretation of a text. We would tend to forget it since all these tools are based on mathematical knowledges and are “science-made”. A. Plasek and D.L. Hoover claim in their paper that “ if the algorithms that deforms the original text is to facilitate our interpretive insights, knowing what algorithm does seems crucially important”. This advice seems to be highly relevant for text understanding and should be kept in mind when using tools such as presented before.

To conclude, we can say that all these various computational tools to extract information, from word complexity to main topics enlightenment, are very useful to give a first interpretation and to provide insights on large corpora of texts, which human would not be able to deal with. However, they should not be use blindly and mostly the reader should keep a critical point of view towards the results provided.


[1] H. Craig, M. Eder, F. Jannidis, M. Kestemont, J. Rybicki, C. Schöch. Validating Computational Stylistics in Literary Interpretation. 

[2] G. Roe, C. Gladstone, R. Morrissey. Discourses and Disciplines in the Enlightenment: Topic Modeling the French Encycolopédie.

[3] A. Plasek, D.L. Hoover. Starting the Conversation: Literary Studies, Algorithmic Opacity, and Computer-Assisted Literary Insight.