, , , ,

The problem of authorship attribution is an on-going issue when it comes to certain historical texts. These problems are different and can range from assigning authors to anonymous texts, to the labour division. Similarly, methods employed when dealing with authorship verification can also vary. Here are explained three possible ways of determining authorship – text-mining, quantitative analysis, and stylometry. Examples of these methods are the abstracts of three papers: A Text-Mining Approach to the Authorship Attribution Problem of Dream of the Red Chamber, Authorship Problem of Japanese early modern literatures in Seventeenth Century, and Stylometry and the Complex Authorship in Hildegard of Bingen’s Oeuvre, respectively.

In the first paper, the authors tackle the issue of the authorship of a well-known Chinese classic novel. In the past century, a scholar came forth with evidence that the first 80 chapters were written by a single author, while the remaining 40 were attributed to someone else. Also, there was the question of the two chapters missing from the oldest edition. In order to verify these findings regarding the latter 40 chapters, researchers have used different statistical methods, such as word length, or the frequency of function words. The need for a fresh approach emerged due to contradictory results in the earlier stylistic experiments (e.g. as a result of different feature selections). The approach proposed in this paper is a text-mining approach, where a mining function is defined so as to find terms that show the differences in the two corpora. This new method was designed not to use pre-defined words (usually function words), but to generate candidate words. In addition to this, the number of chapters in which a term appears was also taken in consideration, as well as the average frequency of the term occurrence.

Unlike common textual analysis methods that use function words to investigate the differences between two texts, in the text-mining approach described in this paper, the words are generated by the function itself.

On the other hand, the second paper lays focus on Yorozu no humihougu, by using quantitative analysis. Namely, upon the writer’s death, his followers posthumously edited and published his works, so the question of authorship subsequently emerged. This research set out to re-examine the text in question by applying quantitative analysis. While comparing this text to another one, proved to belong to the writer, the researchers analysed the appearance ratio of the 6 most common word classes: nouns, particles, main verbs, auxiliary verbs, adjectives, and adverbs. As for the analysis of appearance rate, principal component analysis with a correlation matrix was used.

Even though this method showed significant differences in the two works, other features, such as content and date of each work, need to be taken into consideration, as well.

The third paper discussed here represents yet another method of discriminating between two authorships. While dictating texts in Latin to her scribes (since she did not quite master the language), Hildegard of Bingen allowed them to correct her spelling and grammar, under her supervision. Nevertheless, the extent to which these manuscripts were altered stylistically, especially by her last secretary, Guibert of Gembloux, remains unverified. Therefore, researchers aimed to explore the fine line between her own manuscripts and those (significantly) altered by her secretary. That is, scientists employed stylometry, a method that focuses on high-frequency function words, rather than content words. Even though function words are usually devoid of any greater lexical meaning, in this particular case, it was precisely these words that were mostly corrected by her scribes. In order to analyse the given corpus, the scientists first combined lemmatization (for normalising the orthography of the Latin texts) and the principal component analysis, as in the aforementioned paper.  The results that yielded were in accordance with the Synergy Hypothesis, which states that a style emerging from a collaboration between two authors can significantly differ from their individual styles. As was the case with the previously mentioned methods, stylometry also proved general validity when analyzing corpora.

To sum up, the availability of these methods and their validity make them quite useful tools when determining authorships. On the one hand, quantitative features of the second and third metod have proven their worth when it comes to frequency of certain words and the appearance of their ratio. On the other hand, however, due to the lack of a clear-cut set of features to be examined, different conclusions can emerge. Consequently, the need for a more fine-tuned approach, such as text-mining, arose. Thanks to the digitalization era, old manuscripts and historical works can now be examined from different standpoints, and the trend of using combined and upgraded versions of various textual and stylistic methods, seems to produce plethora of new and sound data.


  1. Tu, Hsieh-Chang; Hsiang, Jieh (2013).  A Text-Mining Approach to the Authorship Attribution Problem of Dream of the Red Chamber. July 18, 2013.(http://dh2013.unl.edu/abstracts/ab-162.html)
  2. Uesaka, Ayaka; Murakami, Masakatsu (2013). Authorship problem of Japanese early modern literatures in Seventeenth Century. July 19, 2013. (http://dh2013.unl.edu/abstracts/ab-384.html)
  3. Kestemont, Mike; Moens, Sara; Deploige, Jeroen (2013). Stylometry and the Complex Authorship in Hildegard of Bingen’s Oeuvre. July 18, 2013. (http://dh2013.unl.edu/abstracts/ab-126.html)