Much of the history can be extracted by written documents from the past. In order to see the whole picture, a human would have to spend lifetimes reading these documents, remembering events, names and places and putting them together in a meaningful way. Of course, this is an impossible task. There is too much data to process, which is why computational approaches are used to extract the important information from this vast data set. People have to be linked with other people, with institutions, with events.
Text analysis lies at the heart of understanding the past through the written medium. Once a text has been digitalized, it can be interpreted by a computer. text analysis tackles the problem of how an algorithm exposes the meaning of a text, classifies it, associates it with other documents and creates the desired connections.
One of the simplest approaches of text analysis is the word frequency list. The word frequency list can give a hint at the level of the author’s vocabulary and therefore allows to conclude on his/her educational background. But it doesn’t stop there. The word frequency list can also be used in combination with statistical tests to determine if two texts have been written by the same or by two different authors, as authors tend to use the same words and expressions. In addition to that, if it’s already used to differentiate an author from another, then it can also be used for grouping texts and for authorship attribution.
In order to be able to compare word frequency lists they first must be comparable. Depending on the application, there obviously are words that should be removed from the word frequency lists because they cloud an objective analysis. David L. Hoover mentions various approaches that are currently being used in his paper “Tuning the word frequency list”. Such approaches include testing different ranges of MFW (Most Frequent Words), removing words that are highly frequent only in one text, or words absent in any or many texts, removing function words like personal pronouns because significant differences arise from the choice of gender of the main characters. There are Burrow’s two categories of words, Zeta- and Iota-words, words that are neither high nor low frequent and extremely rare words respectively.
Every time a word is removed from the word frequency list, there is an assumption hidden. These assumptions are not trivial to interpret and are often overlooked where they should be highlighted and discussed.
In his paper, Hoover proposes another method involving a coefficient of variance (CoV), a normalized standard deviation over the average frequency. Only a number of the words with the highest CoV that appear in more than a certain number of texts are retained. After comparing his proposed method with a standard method, Hoover finds that his method performs better on a set of texts with only two authors, but states that the T-test is still more effective. On a data set with many authors with multiple texts each, it’s performance is comparable with the standard method’s. The CoV Tuning turns out to be a promising method for authorship attribution, but more research should go into finding the optimal values for the range of MFW and the optimal value for the minimal number of texts in which a word has to appear.
Another method for text analysis is explained in Jonathan Pearce Reeve’s paper on “Macro-etymological textual analysis”. Reeve shows that an analysis of the origins of the loanwords can help quantify the level of discourse, help determine the context and reveal stylistic properties. Latinate words (of Latin origin) are most frequent in learned texts and government documents, but not preponderant in romance and adventure stories. Hellenic (words of ancient Greek origin) are used more frequently than Latinate words in religious texts. Differences such as these could provide enough information so that documents can be grouped by categories of writing, by context. During experiments with the King James Bible, the algorithm “revealed the gospels Matthew, Mark, Luke, and John to have much higher proportions of Hellenic words than other books” (Reeves, 2014). It seems that because these books were translated from the Greek, not from Hebrew like most books of the Old Testament, it contained more Hellenic loanwords. But when looking closer at the results from the algorithm, Reeves discovered that in fact, most of these Hellenic words were names such as “Jesus”. It’s important to note that with all results, the assumptions made in the algorithm should be taken into consideration.
When trying to reproduce an experiment by Ramsay on man- and women-only words used in Virginia Woolf’s The Waves, Aaron Plasek and David L. Hoover are presented with a different outcome than Ramsay’s, even though the task is an easy one. The difference arises from an omitted final retrospective chapter, which significantly altered the result. Plasek and Hoover state the importance of mentioning these interpretative decisions. Firstly, of course, so that results may be reproduced, but more importantly because the choosing assumptions to design an algorithm is an interpretation in itself.
The rarity of these only-words incites a reconsideration of this method as the only-words appeared at most four times. Due to the small data set, noise and randomness is so high that results can easily be overinterpreted. Highlighting these algorithmic difficulties is almost as important as the result and should be incorporated in a critical conversation.
Literary critics first feared that with computational approaches the study of literature will become mechanical, but both Ramsay and McGann state that criticizing a text is putting forth a new text that highly depends on idiosyncracy. They further say that computers should help critics have a more objective point of view and will expose information otherwise unknown. Additionally, Ramsay insists that literary critisism should strive to have conclusions that are based on “the nature and depth of the discussions” on the result and not on what the data suggests itself, that these algorithms exist to provoke thought and allow insight.
While searching for knowledge nowadays mostly happens on the internet, books and physical documents still haven’t lost their value. They still represent an accepted, reviewed base of expert knowledge that the unreliable random internet webpages still can’t replace. What is happening although is that books are being digitized and put on the internet, where they will first be analyzed and then linked accordingly with other resources. In addition to that, search engines already do and will continue to try to use text analysis for all the kinds of documents available in order to get better search results.
In literary research, text analysis presents an opportunity to see what was previously unknown in a more objective manner. In fact, by carefully deciding on the assumptions used in an algorithm, a literary critic will realize what kinds of interpretations he already made.
- David L. Hoover. “TUNING THE WORD FREQUENCY LIST”
- Jonathan Pearce Reeve. “MACRO-ETYMOLOGICAL TEXTUAL ANALYSIS”
- Aaron Plasek and David L. Hoover. “STARTING THE CONVERSATION: LITERARY STUDIES, ALGORITHMIC OPACITY, AND COMPU-TER-ASSISTED LITERARY INSIGHT”