, , , , ,

Among the several lectures held in the 2012 Digital Humanities conference in Hamburg, it is possible to find some interesting topics concerning the development of particular digital processes and methods for the analysis of literary dataset, with specific interest in computational approaches to linguistic studies.

Concerning this argument, one of the paper presented faces an important issue in literary research and is related to the large and heterogeneous data that one has to analyze; in particular, the absence of a structured organization of quotations increases the trouble in identifying the first source of the reference. A smart approach to untie this node is applied in the case of the Oxford English Dictionary (OED) and discussed in the paper The electronic “Oxford English Dictionary”, poetry, and intertextuality by David Williams from University of Waterloo (Canada).

The Oxford English Dictionary first appeared in 1828 and under many points of view it could be considered the world’s biggest crowdsourced work of reference. The peculiarity of this dictionary is that a most of the definitions are based on the evidence provided in the source’s quotation, and a great part of these comes from literary texts and poems: in this first references network, which starts from literature and poetry and ends in the OED, one can easily recognize a first order intertextuality.

Thanks to this, OED became very early a certain artistic guide for poets and writers, who began to consider the book a useful source of information and inspiration. Thus, a second order intertextuality develops, this time linking the Dictionary contents with the new writer’s works. Nevertheless, this second order intertextuality raises a fundamental issue for literary researchers: a poem may make use of that are recorded in the OED without having read them there, that is there could be just a first order intertextuality.

Without any marks of quotation genre but just with a “sampling” quoting process, as in the case of OED, it results very difficult to identify the reciprocal influences between dictionary and author, and it seems impossible or impracticable to quantify with reference to specific literary genres. Indeed, a simple research for author can underline the messy and redundant tags and notations. For this reason, the goal of William’s project concerns the development of a parallel OED, called OED2. This version solves the problem related to the absence of a strict quotation structure: indeed, poetic quotations are now marked for genre, to allow for advanced search and comparison of poetic quotations. This marking process will lead to a faster comparative analysis of the reciprocal influences between OED compilers and writers, achieving the resolution of first and second order intertextuality.

Another conference which regards computational approaches for extracting information from literary works is held by Tomoji Tabata from University of Osaka in Japan. The paper presented is called Approaching Dickens’ Style through Random Forest and it is aimed to provide an alternative computational method, a machine-learning classification technique, which tries to delineate Dickens’ writing style and underline the most used and the avoided lexical items used in his works.

The computational method usually used is called key word analysis and it consists in calculating the frequency of each word, comparing this value with a standard text and then applying the log-likelihood ratio (LLR) to obtain the significance of differences. Nevertheless, the key word analysis presents some mathematical issues: indeed, the LLR emphasizes high frequencies words and this can sometimes bias the interpretation of the entire data-set.

For this reason, Tabata proposes the application of Random Forests, a particular classification algorithm that learns on a large number of classification trees, a forest, randomly generated from the dataset. This particular bootstrap procedure avoid the need for cross-validation, achieving a very high accuracy and managing to work with very large databases.

The contents and style of two authors, in this case Dickens and Collins, which worked in the same period, can be mapped in the two-dimensional diagram below: the stylistic distance can be well observed, with the exception of two unusual pieces by Collins Antonina (1850), setting in ancient Rome, and the travel book Rambles beyond Railways (1851).

Dickens vs Collins in a multi-dimensional scaling diagram based on the proximity matrix generated by RF.

A third interesting computational tool presented which can be well applied in literary studies is called HyperMachiavel: a translation comparison tool presented by the Italian studies department at ENS de Lyon, developed with the aim of facilitate the analysis of several editions and translations of literary works, with large interest in French translation. The Lyon group proposes a particular environment, called HyperMachiavel, dedicated to assist researchers with many aspects of their job. Particular attention is drawn to the comparison between different translations of the same book: for this scope, the interface presents two or more panels, in order to manipulate more editions at the same time; the alignment of the text will be automatic and performed by classical tools (statistic measures, distribution algorithms), with an enhanced focus on the user’s need for editing and controlling the correspondences.

HyperMachiavel Parallel view of the aligned texts (right panel) and ‘Corpus view’ emphatises the text structure common to all versions (left panel)

HyperMachiavel also takes into account the problem of applying linguistic annotation on an old text, proposing tokenizer’s choice (registered in an XML file) with the possibility for the researcher to bring corrections to the model or personalized tagging.

One of the most important feature of this tool is the possibility to visualize and graphically compare the contents of the texts analyzed; this is made by computing some frequencies and distribution and finally mapping the equivalence between one translator and another, like shown in the figure below.

Views of French equivalents for the Italian concept “ordine”

All the paper presented aim to provide new and useful methods to analyze the author’s style and influence; while OED2 can help finding the real source that have influenced some author, the second computational approach and the HyperMachiavel tool can highlight and compare the different linguistic trend between an author and another.

Nevertheless, we hope that all that computational approaches to classical literary work won’t blunt the natural pleasure to take and read a book sitting on the sofa: a book can’t be described just by the frequency of the words that are written in it.