Tags

, , , , , , , ,

Sometimes the author of a certain work is unknown. Authorship attribution algorithms help when physical evidence is not available due to missing originals or dictation. An example would be the three collaborative works of Joseph Conrad and Ford Madox Ford (The Inheritors (1901), Romance (1903) and The Nature of a Crime (1909, 1924)), where the exact division of labour is not known.

One way of analyzing the text is described by Rybicki Jan, Hoover David and Kestemont Mike. They use a modified version of the well-known algorithm “Delta” (Burrows 2002). This algorithm uses the relative frequency of the n most used words to calculate a so called centroid. The modified algorithm called “Rolling Delta” analyzes a text in equal sized, partially overlapping segments. Each of those segments is tested to the previously calculated centroid of the reference works. Sudden drops in the results state a change of author.

ab-121.002

For example it is possible to identify chapter 16 and 17 (between line a and b on the graphic) of “The Inheritors” to be written by Conrad whereas the rest of the novel is mainly Ford’s style. Surprisingly the writing style of Ford even survived Conrad’s editing for “The Inheritors” as well as for “The Nature of a Crime”. On the other hand no indications for the contribution of Ford to Conrad’s “Nostromo” can be found even though Ford claims to have written some parts.

Another article from Hoover David describes a method under development which does not use the most frequent words but attends to use the whole spectre. Again based on Burrows work (Burrows 2007: 36), two categories of words (Zeta and Iota) are described. Zeta words are neither extremely rare nor very common, whereas Iota words are extremely rare. The known version of the algorithm used by Hugh Craig is to divide known texts into equal-sized segments (Craig and Kinney 2009). Then it is calculated how many reference segments exist from one author including a word and how many segments of the author do not include the same word. The percentage of those values is then added together. This allows getting a list of marker words that are used by one author but avoided by the other. Based on how many times those marker words are used, it is now possible to cluster the work done by each author.

ab-124.002

The distinction for Zeta words is ok but not very good. It becomes clearer if only Iota words are used. The goal of the author is however to use the whole spectre of words. The results are showed in the graphic. Conrad’s work is in the under right corner and well distinguished from the work of Ford. The effect of using all the words at once might help to avoid some problems occurring if various different methods are used for various different texts. The results are promising but the method needs to be tested on more works and authors.

The two mentioned works describe two different algorithms to achieve the same goal, namely to determine who wrote which part of a collaborative work. The methods however are quite different. The first one depends on the statistical distribution of the most common words whereas the second one is based on the presence or absence of so called marker words. One of the problems of the second method is the fact that those marker words have to be calculated for each pair of authors to be distinguished.

A common problem for both previously mentioned methods is that they are not available as easy to use tools. As a solution Eder Maciej, Kestemont Mike and Rybicki Jan programmed a suite of tools in R with a graphic interface. The reason is that according to their experience “humanists might be allergic to the raw command-line mode provided by R“. The goal was to put everything in a single script called „Stylo“ which is easy to use for a variety of input formats (plain text, XML, HTML) and provides different analyzing algorithms. All the plots can be exported in several formats as well as the raw data of the analysis to perform further statistical analysis. Furthermore, preprocessing steps like removing personnel pronouns specific to a certain text and not representative for the author are integrated. Finally, they support different algorithms for classification, including “Rolling Delta” and Craig’s version of Zeta. The authors believe that one day, out-of-the box scripts will cover many of the known methods used for stylometry.

The availability of such an easy to use tool not only allows scholars without programming knowledge to analyse data. It is if reasonably documented also highly adapted for the use in class. The authors experience shows that this tool provides a work around for the difficulties of beginners facing R without losing calculating performance.

The scripts are available under
https://sites.google.com/site/computationalstylistics/

References

http://dh2013.unl.edu/abstracts/ab-121.html
http://dh2013.unl.edu/abstracts/ab-124.html
http://dh2013.unl.edu/abstracts/ab-136.html

Burrows, J. (2007). All the Way Through: Testing for Authorship in Different Frequency Strata, LLC, 22. 27-47.

Burrows, J. (2002). ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing. 17. 267-287.

Craig, H., and A. Kinney (eds) (2009). Shakespeare, Computers, and the Mystery of Authorship. Cambridge: Cambridge University Press.

Advertisements