, , ,

The massive amount of documents and archives available in previous historical periods, such as in the Venice Time Machine project, makes it very difficult for scholars to explore these documents and analyze their contents, even if they were fully transcribed and digitized. In most cases, a lifetime of reading through a corpus is not even enough [1]. One of the distant reading tools that help scholars explore a large amount of digital text is topic modeling, which automatically classifies documents based on their topics. The process is usually divided into two stages: constructing the topic model, and visualizing the results. In this blog post, we will discuss innovative techniques to enhance both stages of topic modeling.

According to S. Wittek et al. [1], one of the major challenges in current topic modeling strategies is the ability to adapt to non-standardized spelling. This is a problem that appears very often in corpora due to the large time span of the historical documents. The authors were analyzing the 8 gigabytes of XML-encoded text of Early Modern English documents from 1475 to 1700, available on Early English Books Online (EEBO), in which spelling variance was particularly high. The approach that was undertaken to address this issue is to convert the words in these historical texts to their modern spelling variant – a process they called “normalization”. VARD 2 was used to do this task. Normalization can be done either automatically or semi-automatically by using a subset of the documents as a training set. Manual intervention could be made afterwards to fine-tune the replaced words. Also, a feature was developed in order to enable users of EEBO to select batches of texts according to keywords, author, publisher, or year, instead of manually selecting documents.  Afterwards, the output is fed to the visualization tool Voyant.

In [2], many drawbacks were identified in the classical topic modeling approach of using a simple keyword search: it requires the scholar to have a prior knowledge of the data, which is not always the case, and it does not provide a browsing feature to explore related documents and topics. The approach proposed is to first define a topic by a set of words, with a specific probability associated to each word. The words with the largest probabilities in a topic describe it. After this, for each document, its probability to be in a given topic is determined and assigned. This means that instead of being simply labeled as a single topic, each document has a list of probabilities to be in a certain topic. This method enables scholars to have an overview of the topics, to list documents according to a chosen topic, and to browse topics or documents related to a chosen document.

However, visualizing the results of this technique is challenging, as presenting complete lists of probabilities of topics and words is inconvenient. Using a simple threshold is not enough: some documents might be well described by only a few words, but more words might be required to properly define other, more complex, topics. The authors suggest sorting the probability density by descending order, identifying pivot parts, and assign a decreasing “importance” for entries. We can see in Figure 1 the use of size and opacity to distinguish the significant data in the visualization of the The Eighteenth Century Collection Online Text Creation Partnership dataset (ECCO-TCP).

Topic visualization of the ECCO-TCP dataset

Figure 1: Topic visualization of the ECCO-TCP dataset

J. Montague et al. emphasize more on challenges concerning the visualization stage of topic modeling in [3], while invariably using the Latent Dirichlet Allocation (LDA) algorithm as the topic modeller. For them, even though there are many methods to visualize data, there is no method that is suitable for large datasets from corpora covering a wide range of topics. Many tools were analyzed, such as charts, network graphs, zoomable tools, and 2D matrices. Most of these tools have a specific target application, which leads to a poor performance on a large diverse corpus. As an attempt to build a more robust tool for visualizing such corpora, a 3D visualization tool was developed using Famo.us, which makes use of the third dimension to provide more useful information, while keeping it user-friendly. This tool is made customizable, and offers the ability to alter the sensitivity of labeling or clustering and to filter useless data. As can be seen in Figure 2, the user has a high level view of the data, and is able to zoom in to analyze more closely a set of data he’s interested in.

Zooming into the data in 3D visualization

Figure 2: Zooming into the data in 3D visualization

From what we have seen in these articles, there is room for improvement on the level of designing the algorithm for the topic modeller, which is the first stage, and on the level of visualizing the data after collecting the large amount of information from the modeller, which is the second stage. Improvements on the visualization level in [3] contributed to provide a more in-depth exploration of the results, without affecting user experience or other parts of the system. However, improvements on the model level caused more complications on the visualization level. In fact, the normalization to account for non-standardized spelling texts in [1] prevents scholars from working with the original spelling in the graph, and the use of a complex algorithm to improve the modeller and provide more features in [2] makes the visualization step much more challenging. Therefore, enhancements in the model construction must be complemented with enhancements in the visualization, such as in [2] where the authors discussed methods to reduce the effect on the visualization, by locating pivots in the probability distribution and using an incremental approach.

Furthermore, various methods discussed throughout these articles can be combined in order to provide improvements to both stages, while tackling all discussed challenges. In this design, the topic modeling algorithm from [2] is used in order to support document selection by topics and document/topic browsing relevant to a document. Also, the VARD 2 tool is used as described in [1] to normalize the documents, thus improving topic labeling. Finally, in order to reduce the impact of these optimizations on the visualization, 3D visualization is used as described in [3], combined with the pivoting technique proposed in [2], in a clever way. The third dimension can also be used to display clusters of matched non-standardized spellings, when the user zooms in sufficiently on a word.


[1] S. Wittek, S. Sinclair, and M. Milner, “DREaM: Distant Reading Early Modernity”, 2015. [Online]

[2] P. Jähnichen, P. Österling, T. Liebmann, G. Heyer, C. Kuras, and G. Scheuermann, “Exploratory Search Through Interactive Visualization of Topic Models”, 2015. [Online]

[3] J. Montague, J. Simpson, G. Rockwell, S. Ruecker, and S. Brown, “Exploring Large Datasets with Topic Model Visualizations”, 2015. [Online]