One very useful approach in statistics is called exploratory data analysis. This approach, promoted by John Tukey, aims to explore the data in order to suggest to the analyst hypotheses to test. Using various types of visual methods (histograms, box plots, PCA…), it allows to efficiently summarize the main characteristics of the considered data, such as trends, patterns or outliers, that could merit further study. With the recent rise of Big Data, such techniques have become more and more important as the complexity and the scale of the data to process rendered impossible classical on-hand database management. This urge for extracting and reshaping the information from large databases gave birth to a new interdisciplinary field called data mining, at the intersection of artificial intelligence, machine learning and statistics.
If those techniques are now widely used in the area of science and engineering, they have recently extended their scope to more traditionally non-technical fields, such as literature, under the impulse of the so-called digital humanities. In this article, we investigate computer-assisted text analysis tools inspired from data mining techniques and present through a series of real case studies how they can help the analyst to handle large texts.
The first tool we propose to investigate is naturally extending histograms from statistics into text analysis setting. This tool, called Voyant, is one among many online software providing a frequency based analysis of a given text. It proposes a visual interpretation of the text as a word cloud (see figure 1), in which the words are disposed and weighted (with font size or color) according to their relative importance in the document (number of occurrences).
This visualization not only make patterns of style, form and theme easily detectable and interpretable by the analyst, but can also help to explore new humanities questions. For example, Appleford Simon and Tatcher Jason propose in  to understand a widely diffused social phenomena which is the perception by the American public of the causes of the Civil War. In fact, even if a consensus exist among historians, the Americans remain divided on the subject, as revealed by an April 2011 CNN poll which states that 42% of the Americans believe the cause of the Civil War wasn’t the abolishment of slavery.
This conflictual attitude of the population has been for the two scientists a testbed to show the potential of visualization tools such as Voyant in traditional humanistic inquiries. For this study, they collected various types of content from across the web (Twitter and Facebook posts, sites and blogs articles, forum discussions…) referring to the subject through an association of predetermined keywords. Then, they realized a word cloud of the major terms associated with civil war(see figure 2) which revealed several potential avenues for further investigation (for example, it’s interesting to see how important seems to appear the Gettysburg battle).
However, one must be very vigilant while using word clouds, as they really easily divorce the words from their context. In fact, they are based on the fundamental assumption of communicative equivalence (see ) among words : each instance of a particular word retains the same meaning, force and valence of all other instances of that word .This is obviously not always the case and therefore one must always keep track of the context in order to be able to interpret results from word clouds. In the previous study, Appleford and Tatcher claimed to face this issue by being able to explore the appearance a specific word in the cloud by looking at the word cloud generated by this word to understand the connections between this word and the rest of the subject (see figure 2 for the example of the word history). But this solution doesn’t seem very convenient, and one should prefer to word clouds more elaborated tools inspired from clustering techniques of statistics. For example, one could use text networks (see figure 3), allowing the visualization of preponderant words in the text as in word clouds, but also organizing the words in semantic clusters (words are organized in groups according to the density of their connection with the rest of the network) to give to the analyst some insight about the context in which the word is used.
However, if such techniques allow to reveal the difference between homonyms, they remain unable to detect the nuance of metaphors, which is a critical issue in the quantitative analysis of highly figurative text such as poetry. To address this challenge, Algee-Helwitt Mark and Hauser Ryan proposed in  a new approach combining the digital analysis of both the formal structure and the semantic fields within poems in order to reveal the presence of tropological genres such as satire or allegory. To do so, they gathered a corpus of over 1’500 poems, written between 1’700 and 1’800, that have all been identified as belonging to one of the two particular genres. Then, they considered the mismatch between the formal structure and the semantic fields as an evidence of the presence of tropological genres. Roughly speaking, they looked for patterns emerging when a topic is mediated through a form for which it’s traditionally unsuited, which is the sign of allegory or satire. This pilot project allowed to understand how figurative language could be efficiently detected through quantitative analysis methods, avoiding misunderstanding of the meaning of a word by the analyst.
To conclude our study, we present the WordSeer project (see ), an environment for text analysis which claims to fill the gap between interpretation and understanding of the text. This software proposes various kind of tools of search and visualization, encapsulating most of the traditional techniques of text analysis. However, the emphasis is placed here on the interaction of the analyst with the proposed tools (see figure 4), rather than on an algorithmic and automatic processing of the text.
In this perspective, the user is put in the center of the analysis, and the whole software is designed to conveniently accompany the analyst in his task. Thus, the analyst will be able to collect and reorganize the information, investigate groups of words together, compare two or more visualizations side-by-side or even collecting, annotating or tagging items. In development since 2010, this software has been recently used by a class of students in the analysis of the a Shakespeare’s play.
In conclusion, we have been able to better understand how statistical tools could be successfully extended into the text analysis framework, by giving a quick overview of the state-of-the-art tools of this field. We also saw what kind of problems could arise from the analysis of highly figurative texts and what could be the solution to address those issues. Finally, we presented a more user-oriented software, putting the analyst in the center of the procedure. This is a way of reminding the reader that algorithms and data mining inspired procedures should remain toosl and never replace the expertise of the analyst, which is the only one capable of truly exacerbate the capabilities of the tools by a wise usage of them.
- Appleford Simon, Thatcher Jason, Using the Social Web to Explore the Online Discourse and Memory of the Civil War, Short Paper, Embassy Regents F- July 19, 2013.
- Algee-Hewitt Mark Andrew, Hauser Ryan, Tropes, Context and Computation: An approach to digital poetics, Long Paper, July 19, 2013.
- Muralidharan Aditi, WordSeer: An Integrated Environment for Literary Text Analysis, July 17, 2013.