, ,


Investigation into text’s authorship, identification of a writer’s main style, morphological analysis and historical studies through text’s comprehension have always been central topics in many fiery debates. These are just a few examples of fields of studies in which the application of computational analysis can help scholars to light up really tangled issues. Bringing numbers in a non-mathematical domain is a clever expedient able to trace stylistic and morphological elements that may lead the reader to get in closer connection with the writer’s mind. The potential of a similar mechanism seems to be boundless and it represents not only an analytical tool to be applied whenever human capabilities are insufficient, but also a starting point for further analysis otherwise ignored.


The main functioning principle of a computational process applied to a digital text is easier than what could be imagined. It consists in a powerful tuning method able to measure distribution of words starting from their frequency spectrum and their standard deviation from the average frequency. This analysis is aimed at the identification of words dispersion expressed by the coefficient of variation (CoV). Limiting the analysis to the simple calculation of these indices represents a misleading process unable to achieve useful and concrete results. Indeed this aspect is due to take into account really rare words showing a very high CoV (Most Variable Words or MVW) but which are actually useless for our purpose.

David L. Hoover suggests a really efficient method called CoV Tuning in order to improve this kind of analysis removing above-mentioned weaknesses. It consists in a combination of the Eder and Rybicky method (2011), aimed at focusing the attention on the most frequent words (MFW), refined by the CoV approach. It is clear that computation analysis admits limits due to user’s decisions concerning words’ retaining methods based on the number of texts they appear in. Obviously this number varies for each application of the algorithm and it turns out to be often challenging to track the optimal one. Despite this difficulty computational analysis proves to be a really powerful tool always capable of dealing with complex problems, otherwise not manageable, failing in an extremely restrained number of cases.

Hoover shows through his experiments considerable improvements produced by computational analysis applied to solve authorship attribution issues. Considering some multi-author problems related to a huge number of texts to be attributed, CoV Tuning proves to well manage a similar situation by correctly grouping considered texts by author in almost the totality of cases, and whenever failing it always shows improvements compared to every other applicable methods. The two following graphics show the results, obtained by David L. Hoover during one of his recent experiments, linked to the application of two different computational text methods in order to classify 43 novels by 15 different authors. As demonstrated in the figures CoV Tuned (Fig.2) is an admirable method able to compute a really effective analysis showing just one error in such a complicated problem and witnessing a great step forward compared to the Standard Cluster Analysis. (Fig.1)


Fig.1 Standard Cluster Analysis considering 700 MFW (Most Frequent Words).


Fig.2 CoV Tuned considering 700 MVW (Most Variable words).


Computational analysis is not only a quantitative solver to be used whenever the number of variables involved is large enough to make a question irresolvable from the simple human effort, it also represents a powerful approach to be proposed to solve problems related to texts’ qualitative analysis. When used under specific conditions it is able to find out stylistic and morphological properties improving scholars’ understanding in literary studies.

Suzuki and Yamashita proposed a typical issue in which computational analysis could be applied in order to reach relevant results. They selected a popular Japanese novelist, Kotaro Isaka well known for his capacity of switching perspectives during the narration of the stories, and subjected his novels to computational analysis. The purpose was to distinguish each perspective thanks to their peculiar text construction detected by the application of a computational algorithm. Dividing the novels by sections they obtained results indicating that each perspective was characterized by specific textual and stylistic properties that were different for each characters of the story. Taking into account frequency of tokens for each perspective and their coefficient of variation it was possible to detect important features such as differences between characters, and to focus on author’s intent.


The Macro-Etymological Analyzer is an useful computer program wrote by Jonathan Pearce Reeve aimed at finding out significant etymological characteristics. Thanks to a large database the program looks up words extracted from an uploaded text in order to establish their etymological derivation. Combining the analyzer with a computational method makes possible the detection of significant out-of-the-normal densification or thinning of specific words chosen by a previous etymological study. Proceeding with the usual computation of frequency, standard deviation and coefficient of variation of the selected words, particular layout can be easily detected. These characteristics can’t be neglected and in many cases contribute to put the reader in a closer contact with the writer purpose. J. P. Reeve offers a suitable example about the importance of a similar stylistic makeup referring to a famous James Joyce’s novel called “A Portrait of the Artist as a Young Man” in which the growth of the protagonist during the chapters is highlighted by a gradual linguistic evolution represented by a greater and greater use of words derived from Latin and Ancient Greek. (Fig.3)



Finding out stylistic aspects is not the only capacity of a computational etymological analysis. Indeed applying the same procedure to historical books can lead scholars to make remarkable discoveries about text sources and therefore about the reliability of the same books. J. P. Reeve talks about an early test performed on “King James Bible” revealing the same densification of Hellenic words in synoptic gospels with the exception of John’s one, implying a different source of transcription for the latter gospel. (Fig.4) Such analysis identifies important features in text studies acting as a very affective tool for historical researches.




The purpose of this article is to demonstrate the indispensable use of the computational analysis to improve texts understanding. It must be used as a tool able to detect particular characteristics that would otherwise remain unknown. For these reasons computational analysis is quickly developing and therefore gradually reaching incredible levels of efficiency.


  1. “Tuning the word frequency list”   Hoover, David L.   (http://dharchive.org/paper/DH2014/Paper-765.xml)
  2. “Analysis of perspectives in contemporary Japanese novels using computational stylistic methods”    Suzuki, Takafumi and Yamashita, Natsumi     (http://dharchive.org/paper/DH2014/Paper-820.xml)
  3. “Macro-Etymological textual analysis”   Reeve, Jonathan Pearce   (http://dharchive.org/paper/DH2014/Paper-732.xml)