Tags

Authorship verification is a hot topic of research in digital humanities community as can be seen by the papers submitted to the Digital Humanities Conference in Hamburg this year. The problem of authorship verification is to assign an author to a work of unlabeled authorship. This problem can be a closed-set problem which is to assign an author from a given set, or it can be open-set in which case it is necessary to state that the work does not belong to any of the authors if that is the case. This problem of authorship verification is approached by different methods, and three of these methods are mentioned here.

In the paper Distractorless Authorship Verification, the problem tackled is, given a document and a candidate author, what is the likelihood that the work is written by that author. First of all, the method gets a set of features from a training data that is known to be written by the author. Then the set of features obtained from the document is compared to the features obtained from training set for the author. If the similarity between the features lies above a certain threshold, calculated by a function such as the angle between the feature vectors, the document is accepted to be written by the author. One of the important things here is to define what is the features to compare. In this respect the paper uses character n-grams, which are the frequency of n character sequences occurring in the works.

In the paper Characterizing Authorship Style Using Linguistic Features, a method of authorship verification based on how the author refers to people is proposed. The method makes use of elements belonging to two categories, semantic – personal names, and syntactic – how the personal names are referred within the text. Two examples given in the paper are the sentences “Robert Ryan is the prey he captures, along with the girl Janet Leigh” and “Lois Lane and Clark Kent are sent to cover a circus”. These two sentences coming from different authors show different usages while referring to personal names. The first one uses apposition – elements placed next to each other as “the girl Janet Leigh” and the other one uses personal name grouping. The test data used by the paper is taken from IMDBb62 collection for movie comments. The reason given in the paper for choosing such database is because the authors refer to personal names often during summarizing movie plots.

Another interesting paper is Evaluating Unmasking for Cross-Genre Author Verification. In this paper the authorship is verified across different genres since an author might produce work in different genres, such as Victor Hugo who wrote novels and plays (and many more actually). The method tested for this purpose is called unmasking. Unmasking, is an iterative process that compares two texts, one unknown authorship and one known. The most discriminative features between the two texts are reduced in each iteration. If the two works are of same authors the number of discriminative features are less than the case where the authors are different. Thus when a discriminative feature is removed from two texts of same the author, the remaining texts will have more similarity than the previous case showing a big drop in dissimilarity. If the same operation was done on texts of different authors since there are a more discriminative features, although the remaining texts will be more similar, the drop in dissimilarity is not as high as the same author case. By feeding the dissimilarity graphs after each iteration to a machine learning algorithm (such as Support Vector Machine) new texts are classified.

To summarize the authorship verification trend seen in digital humanities community, we can say that new methods, such as making use of syntactic information around personal names, are being tried out and the previously developed methods are being tried by using different descriptors or author/text sets, such as unmasking applied to cross-genre author verification or destractorless verification. When the performances of methods studied, although the accuracy can be high with a finite author – text set or with certain assumptions imposed on the texts, the solution for authorship verification in general case still remains an open problem.

Advertisements