, , , , ,

In this century, we have accumulated a large amount of information. This raises a serious question-“What are we humans supposed to do with this colossal amount of data?” Google counted that there are around 129,864,880 number of books in this world. There is so much of information to process. It is actually impossible for a human to read all this information and data even in his entire lifetime. Franco Morette argued that it is better to understand and analyze large amount of data rather than to just read particular texts and books. Morette termed this process of understanding and analysing literature at large-scale as “distant reading”. According to him, even if a human “closely” read many books, the number of books he read will still be very less in comparision to large amount of data available in this world.

Now a days, DH researchers and scientists are following Morette’s principle of “distant reading” and trying to come up with computationally efficient and accurate approaches to extract important information from vast amount of data. This process of extracting important information from data is termed as text mining or text analysis. Text analysis includes Information visualization, efficiently annotating texts, interactive visual analysis, recognizing patterns in texts, link and association analysis. Recent progress in digitization of books has made it easier to “distance read” the documents by text analysis or text mining tools.

One of the most common approaches of extracting important information from a document is to annotate a document. However, manually annotating a document is time-consuming because it requires discussions as well as agreement among specialized researchers. To facilitate collaborative annotation, [1] developed a web-based interface which automatically suggests annotation and allows annotation of the same document by multiple users simultaneously. They deployed it on a Japanese historical document “Todaiji Yoroku” and achieved 75% precision rate. They found it to be an effective tool because the model suggested good annotations which were actually missed by the researcher. In their model, they extract key features from the document and then suggest relevant annotations. The model used Support vector machine to recognize annotation patterns in an existing annotated text. Estimated word segmentation is used as features for SVM training algorithm. The learnt annotation patterns are then used for extract key phrases and suggesting annotations for a non-annotated text. As there are no spaces in Japanese writing system, authors very intelligently used the boundary between Chinese characters and hiragana characters for doing estimated word segmentation.

Information visualization can be another elegant data interpretation technique. Information visualization abstractions includes network visualization, trees, histograms, pie charts, and all other kinds of diagrams and charts. If this visualization is made more interactive such that users can navigate from visualization to their data sources, then such an interactive visualization can bridge the gap between “distant reading” and “close reading”. In [2], authors developed two approaches to “Interactively” visualize text documents. In first approach, users analyse document based on its inherent hierarchical structure and can navigate from one level to other. Multi-level visual abstraction (chapters->subchapters->passage->lines..) of a single work is provided. And at each level, various static information visualization abstractions are attached. Second approach mainly aims at speeding up the analysis and comparison of different texts in parallel. Multiple texts can be opened in parallel for comparison. With each text , two levels of abstractions are provided which includes annotations of the text to be compared and the position of the selected annotations.

Although DH researchers have come up with various methods to analyse vast literature, but in this world of digitization, along with the outburst of digitized books, there is also an outburst of readers. Authors of [3], posed a question “What Do You Do With A Million Readers?” Authors tried to extract important information about a particular work from reader’s comments. Authors of [3] used reader’s comments on 16 different works of fictions as their target data. They tried to find four types of information about the work from the target data:-1). Main dramatis personae in the novel  2).events in which dramatis person played a role 3). relationship between characters  4). Visualization that captured these relationship . LDA proved successful in separating works. To discover relationship, sentences with pair of entities are discovered and then POS tagging is used to extract verbs between these entities. Directed multi-coloured graphs with edges of varying widths are used to represent relationships.

Thus we have discussed three approaches pertaining to data analysis/interpretaion. An automatic annotation suggestion technique is suggested in Article [1]. Article [2] bridges the gap between distant reading and close reading by providing an “interactive” information visualization technique. In comparison to Article [1], Article [2] suggests a more accurate “distant reading” technique. As we know, automatic data processing steps( for ex: key feature extraction in case of article [1]), annotations, static visual representations can contain errors ,uncertainties, thus leading to misinterpretation of data. “Interactive” visual representations[2] can control and validate the accuracy of data representations and thus is a more transparent,robust data interpretation technique. Authors in [3] developed techniques to analyse and understand literature through reader’s comments. As suggested by authors of article [3] that “the readers comments lack completeness”. Thus, this method will be more erroneous in analysing literature as compared to above two approaches because there is no direct connection between the literature and the interpretation. But at the same time, this approach can be an effective tool to extract important information from crowdsourced response sites. Thus,Digital Humanities Researchers have come up with various techniques to “distant read” literature at scale in order to boost the work efficiency of scholars.


  1. Takafumi Sato, Makoto Goto, Fuminori Kimura, Akira Maeda. “Extracting Key Phrases for Suggesting Annotation Candidates from Japanese Historical Document”.  http://dh2015.org/abstract1
  2. Markus John, Steffen Koch, Florian Heimerl, Andreas Müller, Thomas Ertl, Jonas Kuhn. “Interactive Visual Analysis Of German Poetics”.   http://dh2015.org/abstract2
  3. Roja Bandari, Timothy Roland Tangherlini, Vwani Roychowdhury. “What Do You Do With A Million Readers?”  http://dh2015.org/abstract3
  4.  Moretti, Franco. “Conjectures on world literature.” New left review (2000): 54-68.   http://newleftreview.org/II/1/franco-moretti-conjectures-on-world-literature
  5.  Schulz, Kathryn. “What is distant reading.” The New York Times 24 (2011).          http://www.nytimes.com/2011/06/26.