The issue of authorship attribution finds its root early in history. The standard attribution to Homer of the Iliad and Odyssey or the analysis of multiple authorship of the Old Testament constitute two examples. Nevertheless, authorship attribution appears to be a subject of interest for modern issues too. Juola (Department of Mathematics and Computer Science, Duquesne University), in his work named Authorship Attribution [4], gives the example of the book «Imperial Hubris: Why the West is Losing the War on Terror» published in 2014 by Potomac Books about the experience of an anonymous presumed senior US intelligence official arguing that US policy concerning war against terror was misguided. Juola pointed out the following: «According to the July 2, 2004 edition of the Boston Phoenix, the actual author was Michael Scheuer, a senior CIA officer and head of the CIA’s Osama bin Laden unit in the late 1990s. If true, this would lend substantial credibility to the author’s arguments.» [4]. Indeed, this is a case where presumed attribution could be analyzed with stylometric methods in order to quantify uncertainly about the author.

The question of authorship, though simple to state, appears to be vast and extremely complicated one. Indeed, even if texts seem to be immutable, they are subject of changes. Texts have their own dynamic through transcriptions, comments, adds of punctuation, digitalization or censorship. In this context, authorship recognition and stylometry (aka. the study of linguistic style in texts) become hot subjects to study in the realm of digital humanities. I will try to convince you based on three recent projects done in the subject. But first, let us give some terminology.

  1. The closed problem: we start with a sample of texts of diverse authors as a training set and try to assess the authorship of a different sample of texts, named the test set (generally of smaller size than the training set), based on machine algorithms.
  2. The open problem: we have an anonymous text and we try to answer the question ‘Who was the author?’. It is a much more complicated task than the closed problem.
  3. Stylometry: Which is a less ambitious task, though the right point to start with, consisting of analyzing the variation of linguistic style in texts and detecting potential multiple authorship.

Let me now introduce to you three projects related to the subject presented at the DH2015 conference in order to have a better idea of the domain.

Hoover (New York University, United State of America) recently studied the question of Crane’s posthumous publication in his project: Cora Crane’s Contribution to Stephen Crane’s Posthumous Fiction [1]. Stephen Crane (1871-1900) considered as an innovative writer died early (28 years old) and considerable uncertainty about authorship remains regarding his posthumous published texts. Indeed, His wife Cora Crane is thought to have contributed to some extend to these texts. This situation is an example of a closed problem: we have two possible authors (Stephen and Cora Crane) and have the task to detect possible multiple authorship in these texts. Though his studies are not yet finished, his current conclusion, based on clustering methods, is that Cora’s possible contribution, if any, should not be major.

An example of stylometry can be found in Eder (Pedagogical University Krakow, Poland) work: Through the Magnifying Glass: Rolling Stylometry for Collaborative Authorship [2]. In this article, Eder presents a new method designed to analyze collaborative authorship, thus answering the third type of problem. His simple idea is to chunk the whole text in equal parts and then use similarity tests based on various machine learning algorithms. He applied his method on the Latin translation of the Bible known as the ´Vulgate’, trying to detect multiple authorship through change in style. His interesting resuts are the significant change of authorship between the two testaments. A graphic about the text segmentation can be seen on Figure 1.

Segmentation of styles in the Vulgate. (black and red colors represents different styles.)

Figure 1: Segmentation of styles in the Vulgate. (black and red colors represent different styles.)

Finally the question of the scalability of the current methods arise, as Eder worried about in his work: Taking Stylometry to the Limits: Benchmark study of 5.281 Texts from “Patrologia Latina” [3]. Indeed, at a large scale the problem of authorship analysis becomes extremely complicated. Methodological questions arise naturally in this context regarding the efficiency of the actual state-of-the-art algorithms. Eder performed a benchmark test on a final sample of about 1500 texts from around 200 different authors. He pointed out the surprisingly poor performance globally of all the algorithms in terms of accuracy when facing such a big number of different authors. Nevertheless, he interestingly concluded that, often, the punctuation seems to play an important role in the classification at large scale.

We clearly see through this three works that there is still a lot to be done in the subject. The actual algorithms are adapted to problems where the task is guided (known authors, large training sets, stylometry on large texts) and not complex (few number of authors). But when encountering complex dependence across texts or big number of writers, the current algorithms’ accuracy is questionable. It is evident too that stylometry is very closed to the study of natural language processing, which is an actual incredibly hot subject of research.

Finally, though at early age, the authorship attribution issue is probably one of the future key subject in digital humanities. Indeed, understanding and improving changes in linguistic style would be an important break-through, because one would be able to better source texts, organize them as well as understand maybe more complex and surprising dependencies among texts.