In a very general point of view the authorship problem refers to all the issues that arise in the association of an author to a specific document.

Especially when dealing with ancient and old literature it’s of great importance to fairly associate a text or a play (or any kind of document) to its real author (or authors). This process appears to be more difficult especially in the case of documents with no author or documents in multiple editions but we also want to be sure that who claims the authorship of a document is in fact its actual author.

In this job, authorship techniques come as a very versatile tool that can help not only in defining if a person is the real author but also if a document has one or more authors and, in the second case, which author wrote which part. It’s also important to underline that since the tools usually focus on a few aspects of the text in its analysis, a more complete analysis should be done to ensure the correctness of the results, hence these tools should not be taken as is or as granted.

The purpose of this article is to present different techniques used in different authorship problems and to show the strength of each technique and the flexibility of this field (since it doesn’t treat only the same type of documents).

Dream of the Red Chamber [1]
Dream of the Red Chamber is among the greatest Chinese classic novels and its first edition containing only the first 16 chapters has been published in 1754 by Cao Xueqin. The peculiarity of this book was in its published editions each one containing more chapters, until the earliest existing version of all the 120 chapters in 1791, edited by other authors (Cheng Weiyuan and Gao E). Because of this continuous growing of the book it wasn’t clear if all the chapters were written by Cao Xueqin or if some of the chapters were written by someone else and, in case, by whom.
The authorship attribution research started in the 20th century where the different researches were splitted in 2 groups, one theorizing that Cao Xueqin was the only author, the other stating that Xueqin only wrote the first 80 chapters while the last 40 were written by Gao E.
While the first techniques were all similarly based on the comparison of frequencies of certain specific linguist features chosen by the researchers (function words) between the different chapters, a different approach was introduced by Yang in 2003 who was trying to establish which one of the 2 version of authorship was correct.
Yang’s method was more flexible since it was based on the comparison of the frequencies of unigrams between 12 groups of 10 chapters each. As a result he found strong similarities between the first 2 groups, the next 6 groups (chapters 21-80) and the last 4 groups (chapters 81-120) concluding that the 2 authors theory was the correct one.
Yet another technique was used a few years later to prove the correctness of this second theory. This technique was based on a text-mining function over terms in different documents (chapters) collected in 2 sets (in our case A the first 80 chapters and B the second 40 chapters). Output of this function was an index measuring the comparison of frequencies of each term in A and B. A double analysis over unigrams and bigrams highlighted a similar behaviour in which a lot of terms were mostly in one set or in the other showing an actual difference in the style of writing between A and B supporting the 2 authors theory.

Hildegard of Bingen [2]
In a more subtle way, different techniques can be used to study different writing styles on texts from the same author as in the case of Hildegard of Bingen’s work, a famous prophetess who, because of her poor knowledge of Latin, always relied on scribes in her writings. While all her collaborators were only allowed to make superficial linguistic changes, her last collaborator Guibert of Gembloux was asked to render her language stylistically more elegant.
In such a case it comes natural to wonder to what extent the authorship is attributable to Hildegard or Guibert.
The analysis in this case focused on the differences of writing style between Hildegard and Guibert comparing letters from Guibert, letters from Hildegard and letters from Hildegard written during Guibert secretaryship and each group presented 3 different styles, suggesting the Synergy Hypothesis in which a text resulting from collaboration displays a style markedly different from the one of each collaborating author.
This comparison helped in evaluating the dubious writings Visio de sancto Martino and Visio ad Guibertum missa and the results of the analysis were in fact more incline to attribute the authorship to Guibert, but again this type of technique only evaluate the texts at a superficial and purely stylistical level and cannot (and should not) be mixed nor confused with the author of the content.

Tools and research [3]
Moving from authorship attribution examples to a more technical point of view we find stylometry tools. On one side stand-alone dedicated programs, on the other more simple existing softwares, each one usually used at different stages of the analysis. As it was in the analysis of Hildegard work we find in between powerful tools like R, a language and environment for statistical computing and graphics. The strength of this language stays in its broad usage possibilities and its flexibility that merges together a large set of predefined analysis functions and the possibility to build statistical application from scratch. This level of versatility makes this tool a very helpful resource for both researchers with programming skills and without. Of course the first ones will find in this language further utility with the possibility to create more sophisticated and dedicated scripts and design new techniques.
Examples of predefined tools for stylometry analysis written in R are:
stylo, the main and most versatile tool;
classify, mostly derived from stylo script;
rolling delta, analyze collaborative work and assign author to each fragment;
oppose test, contrastive analysis between two given set of texts;
keywords, based on word frequencies.

A better and deeper explanation of these tools can be found at [4].
The three different cases presented in this article were a clear example of the strength and the flexibility of authorship attribution techniques, used in different situation to prove different hypothesis. Even if it’s not possible to give a complete assurance of the correctness of the results the growth of this field in the research made huge steps using techniques more and more advanced, and applying different techniques made possible for the researchers to find the flaws in previous techniques and to compare the results to obtain meaningful and more complete results.


1. A Text-Mining Approach to the Authorship Attribution Problem of Dream of the Red Chamber.

2. Stylometry and the Complex Authorship in Hildegard of Bingen’s Oeuvre.

3. Stylometry with R: a suite of tools.