Every time we read a text, we consider the author. Not only the authorship is important to the reputation, professional advancement, and financial support of individuals but also allow us to judge the information that a text contain, in light of what we know about the personalities and wisdom of the author and even lead to a deeper comprehension of the text. But sometimes the author of a text is unknown. This is the problem of authorship distribution: how can we discovery the author of an anonymous text? Especially there is external evidence absence and a direct analysis of the text cannot reveal the author. Fortunately, language is variable. Two individuals of the same generation and locality, speaking the same dialect and moving in the same social circles, are never absolutely at one in their speech habits, which makes it possible for us to obtain hint and evidence from mining the text content.
Text mining, as a branch of content analysis, has been a traditional issue of digital humanity for a long time, which covers gender bias, readability, content similarity, reader preferences, even mood and etc. In authorship attribution field, text mining works mainly by comparing the values of textual measurements in that text to their corresponding values in each author’s writing sample to determine which sample is the best match. Here we will give a brief explanation of how the text mining methods are applied in authorship attribution by summarizing and comparing three articles.
The first article1 compare Saikaku’s works to Dansui’s three works using principal component analysis (PCA).PCA reduces the dimensionality of a dataset consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the dataset. When applied to the frequencies of high-frequency items in texts, PCA often successfully reveals the authorial structure in a dataset. The article examines the appearance rate of the seven principal grammatical categories the appearance rate of the eight principal particles using the PCA. By plotting the pairwise of the proportion of the first principle component(x-axis) and the second (y-axis), the article shows the Saikaku’s work and Dansui’s work are on distinct part of the plane, revealing Saikaku’s works and Dansui’s work differ in grammatical categories and particles.
Cluster analysis aims to group objects of similar kind into respective categories by using a number of different algorithms and methods for grouping, which is another useful method for text mining. The article2 use it to measure Cora’s participation in Stephen’s late fiction. Having known Cora’s and Stephen’s fictions differ in themes and settings, a preliminary cluster analysis shows Cora’s stories all cluster separately from Stephen’s late stories. Then the essay continues to analysis the questionable stories and state that the presence of Cora’s cluster in Stephen’s stories demonstrates that Cora was a contributor to Stephen’s late stories but a minor one.
The third article3 conduct empirical evaluation of improved Burrow’s “Delta”. The main emphasis of this essay is not on the new procedure but on the sophistication in the existing Burrow’s Delta. Since John Burrows first proposed Delta as a new stylometric measure, it has become one of the most robust distance measures for authorship attribution and has been shown to render very useful results in different text genres. Firstly, the essay shows the distribution of the most three frequency words in English and German Database, and then shows the performance of distance measure in three language text. All those lead to the conclusion of the better performance of the modification of Delta.
Besides the three methods stated above, more and more new methods are boomed in the authorship attribution, such as machine learning, sequential analysis and so on. Although there are so many methods, the crucial problem is we do not know which the best indicator for authorship is. To compare them, we need to test them on the same database. Actually, all these methods have their drawbacks. For example, the CPA methods can be used to tell different author but it works little to measure the participation of one author in others’ works. For the cluster analysis, the toughest job is how to organize observed data into meaningful structures, which have it only be applicable in several specified fields. The Burrow’s Delta is comparatively general in different genres and languages, but it hasn’t been test on Asian languages. And all these methods are not as accurate as we would like. Let alone when we are asked to distinguish between two 20 or 30 authors, none of the individual methods are moderately successful. Using the advantages of the methods and combining them may be the best solution.
4:J. F. Burrows, “Delta: a measure of stylistic difference and a guide to likely authorship,” Literary and Linguistic Computing 17, pp. 267–287,