Indeed, as it has been said by Gregory Crane [1], what can we do with a million books? In other words, can we infer the desired knowledge from a big corpus of written entities in a reasonable amount of time? Sometimes, it is needed to step away from a painting, to create a distance between the viewer and the work to discover what it really represents. Distant reading, i.e. using mathematical tools to extract and analyze features on a large amount of texts, is the answer of digital humanists to this provocative question and has been a trending field of research these past years. Its applications range from deducing authorship, period or topics for historical texts, to understanding social networks tendencies. This blogpost will be dedicated to summarizing and comparing three abstracts which have been presented at the Annual Conference of the Alliance of Digital Humanities Organizations this year, in order to emphasize the importance and versatility of distant reading on large textual corpora.

A Distant Reading Visualization for Variant Graphs [2] is a perfect illustration of this paradigm. The authors investigated the differences and similarities between various translations of the Bible through variant graphs, i.e. visual representations of dissimilarities at the word level in different editions of the same text. It is apparent that a variant graph is in itself not suitable for the analysis of twenty-four distinct translations of the Bible, thus the authors combined these graphs through the corpus in order to extract knowledge at a higher level. They took advantage of the highly hierarchized structure of the Bible, segmented in books, chapters and verses, to help visualizing the results.  They’ve been able to highlight or confirm the influence of certain translations. For instance, the Darby Bible, which is thought to be the most exact transcription of the ancient languages, shows surprising similarities with other Bibles. This suggests that they were not only influenced by the King James Version, which is considered the most influential translation.


Illustration of a variant graph for 24 English translations of Genesis 1:1 [2]

The second abstract, Character Network Analysis of Émile Zola’s Les Rougon-Macquart [3], steps away from pure semantics and gains in abstraction. Yannick Rochat analyzed here the iconic book collection by Émile Zola, a work that is well tailored for this kind of approach, Zola himself being known for having a particularly scientific view on writing. The author used technical concepts from network characterization to describe the interactions between the protagonists of each book composing Les Rougon-Macquart. He thus calculated properties of the character network of each story, determining its density, coreness and centrality. This work, despite discrepancies between the figures and the descriptive texts, gives an insight about the literary intentions of Zola, e.g. to explore intimate relations between closely related characters, or on the contrary to depict the deterministic chaos in a vast decentralized network of protagonists.

There is already a noticeable shift between the two first abstracts in terms of what is investigated. While the first text focuses, as said, on semantics and dissimilarities between translations, the second one tries to decipher something as complex as fictional relations in an enclosed set of protagonists. This could lead to think of other applications for distant reading:  this approach could be used to investigate even more abstract concepts, answering to ontological or anthropological questions by again compiling and processing a tremendous amount of information relative to a certain domain.  That’s what the authors of the third text [4], titled The Unspoken Word: Race and the New Language of Identity, were trying to achieve, while focusing on defining and characterizing the concept of race. They compiled articles containing keywords related to race and ethnicity, originating from journals ranging from anthropology to genetics or forensic sciences. They extracted the vocabulary that was frequently surrounding these keywords, thus creating a “lexical fingerprint” associated with the concept of race. Representations of these results via topological graphs, one for the publications and one for the semantic fields, allowed easier visualization. This permitted to highlight the intrinsic correlation between period of publication, orientation of the journal, and the lexicon used to define what is “race”.


Topological graph showing the lexicon associated with race or ethnicity, with colors indicating the group term membership. The proximity of two words indicates their frequency of co-occurrence [4]

All in all, these three abstracts shed light on the immense variety of possible applications for distant reading by showing a gradation in the degree of abstraction of the investigated matter: we move from analyzing the semantic field of a text, its very essence, to applying mathematical methods to characterize interaction between (fictional) entities, to finally propose a definition for a concept related to essence of a human being. They also distinguish themselves in the choice of methods applied to distance the humanist from the corpus. Indeed, the techniques used can be as well visual analysis of variant graphs as network mathematics or natural language processing. Despite these differences in rationale, these studies are reunited by the fact that, with the help of scientific thinking and computational power, they gain knowledge in a huge literary corpus, knowledge that would otherwise be out of reach due to time issues and impossibility to synthetize.

To close this blogpost, we can affirm, considering the recent successes and promising applications in the humanities field, that distant reading and natural language processing will have a crucial role to play, as much in understanding gigantic historical sources as in dealing with societal or ontological questions.


[1] Crane, Gregory. “What do you do with a million books?D-Lib magazine 12.3. (2006): 1.

[2] Jänicke, Stefan et al. “A Distant Reading Visualization for Variant Graphs“. DH2015. Sydney, 2015. Web.

[3] Rochat, Yannick. “Character Network Analysis of Émile Zola’s Les Rougon-Macquart“. DH2015. Sydney, 2015. Web.

[4] Algee-Hewitt, Bridget Frances Beatrice et al. “The Unspoken Word: Race and the New Language of Identity“. DH2015. Sydney, 2015. Web.