Text analysis represents huge topic
in the field of Digital Humanities. Various and ingenious approaches have been developed in order to study and/or preserve old manuscripts, poems, sonnets, dramas and etc. All of them have the same goal, how to make a best copy of human thinking?
Firstly, we will travel into the era of Newton and swim into study of his corpus of endless alchemy quest (for more about Newton’s alchemy). Newton’s corpus is multilingual in which Newton very often switches from English to Latin and French therefore it represents a hard nut to crack in language identification. Usual approaches assume that the text is monolingual; therefore it cannot be used for analysis of Newton corpus case. In this study, King Levi et al. developed a new method for language identification based on identifying language by single words rather than for complete text. Approach assumes that texts are multilingual and they automatically segment the text into words and based on frequency of n-grams (for more about n-grams) of the word, language can be determined. Results represent fairly successful method with accuracy of ~90%, although this area leaves space for further improvement .
Next story, leads us in time of Shakespeare. Rather bloody and poisonous atmosphere didn’t stop Shakespeare’s dramas to become never-ending resource for humanity researchers. Muralidharan Aditi presents us a new tool for “sensemaking” environment for literature analysis, Wordseer (for more about Wordseer). Program is analyzing the text in cyclic manner, rather than a linear process. With WordSeer you can read a text, find relationships between words and phrases, study grammatical relationships, and examine produced heat map and tree visualizations. Wordseer was tested in the study conducted with on University of Calgary where undergraduate students had the project to analyze a topic within the act of Hamlet (you can find more about student’s results here). Several features of Wordseer have been found as not well supported, such as investigating a group of words together or comparing two and more visualizations side-by-side and etc. Furthermore, Muralidharan present us a new version of Wordseer where these program flaw have been improved .
Final docking puts us in the word field of Latin poetry. Tesserae project (more about Tesserae project) on Buffalo University tries to digitally sniff out literary allusions in classical Latin poetry. Since the literary allusions are based upon text reuse, first step was to collect a large set of textual parallels. Then it was modeled to assess which of the instances of text reuse are meaningful allusions and which are not, judged by group of readers. Different parallels in respective data set have been categorized by group of student and professors. On this set of categorized data modern machine learning techniques were applied. Obtained results showed that ability to differentiate between low rank and high rank parallels are ~80%. Although they have obtained rather modest results, this research shows a big potential for further improvement .
Each of the presented papers have peculiar approach to same topic, text analysis. One can notice that first has rather narrow application, since the use of it is small. Second and third one present new approaches to digital dealing with words in literature. They all present a new step toward better understanding of word digitization and analysis, and all of them have to deal with the ambiguity of human writing which sometimes cannot be defined by program language. One cannot avoid to ask himself, can we make a machine which can function as brain??