, , , , , , ,

The advances in data digitization, and the fast growing amount of accessible data on the Internet provide digital humanists with ever-growing, gigantic databases to be analyzed. Even though databases are the programs and algorithms favorite format, they are very difficult for humans to navigate through. In response, a trend in digital humanities is to conceive computer programs capable of producing visual representations of the database contents, enabling access even to neophytes.

The transcription of data into a visual form happens in (at least) two steps: the data has to be treated in algorithms extracting, depending on its nature and the goal of the humanists, networks, topics, or any other mean of connecting and regrouping sets of data in coherent, larger groups. Then it has to be arranged in a visual structure, so to enable the user to easily navigate through the data, and recognize new patterns in it (knowledge discovery). I will discuss three abstracts on the subject from the 2015 annual international digital humanities conference.

In the first article, Exploratory Search Through Interactive Visualization of Topic Models[1], the authors main goal is to enable the user to have a global view of the database, that would be classified automatically in relevant sets or networks, and then to explore and refine the showed data gradually down to its individual elements.  To solve such a problem, the database has to go through a topic generator algorithm, based on a Bayesian hierarchical probabilistic model (the family of mathematical models topic modeling falls into), that would identify possible topics for each text. The algorithm scans over the words of each text, and determines topics by identifying, using probabilistic methods, words that occur often together and groups them as a topic. In result each topic is a set of words, with each word being classified by relevance with the topic with a probability factor. Moreover, for each text is also generated a probability distribution of each identified topic, indicating what are the probable topics involved depending on the position in the text.

Fig. 1: Resolving polysems and homonyms. In each cloud are the connected terms.

Fig. 1: Resolving polysems and homonyms. In each cloud are the connected terms. [1]

Such a profusion of topic information being difficult to display, the authors created a tool based on OpenWalnut, an open source program initially used in the medical area, to render a user-friendly visualization to : “Examine topics, have an overview of the topics, finding different semantics of polysems and homonyms (fig. 1), identify documents covering a topic, find related topics of a document”[1]

Fig. 2: Example of the interactive zoomable interface

Fig. 2: Example of the interactive zoomable interface [2]

In the second article, Exploring Large Datasets with Topic Model Visualizations [2], The authors, wanted to display the detailed underlying networks from 27,536 philosophy journal articles from 1876 to 2008. Like in the previous article, the data has been treated by a topic modeling tool, which is here MALLET (McCallum, 2002), using the Latent Dirichlet Allocation algorithm, which provides raw topic data based on the principle described in the previous article. But here the authors went further, and created a new, flexible open-source JavaScript framework named Famo.us, the inherent advantage of using JavaScript being the instant portability of the visualization tool across all web browser enabled devices. Its aim is to bring the topic modeled data to an interactive visual form, that can be in 3 dimensions, touch-screen capable, and zoomable (fig. 2), indeed enabling the average Internet user to explore quickly and naturally large intricate databases.

Fig. 3:

Fig. 3: “A section of the topic model visualization of the ‘memcons’. In this force-directed diagram, the documents are distributed according to their weight in the topic model and colored according to their former classification status. Formerly ‘Top Secret’ documents are colored blue, ‘Secret’ documents are colored yellow, and ‘Confidential’ are colored magenta.”[3]

The third article, ”Everything on Paper Will Be Used Against Me’: Quantifying Kissinger'[3], is dealing with the data gathered around Henry A. Kissinger, formerly the US national security advisor (1969–1975) and US secretary of state (1973–1977), which was actively involved in the Cold War secret diplomacy at the time. The interest of this article here, is that the author, for his study, has extensively used visual representation of the data, as well as topic modeling. Indeed, the amount of data being so important, up to now, it has been impossible to study the historical information it has. Even though the study around Kissinger is just in its infancy, some interesting information arises from the graphical view of the topic modeling of the secret diplomatic documents (fig. 3), such has triangular secret diplomacy between the US, China and USSR, about Indochina, according to the author.

In conclusion, we can affirm that the research about Big Data visual representations through topic modeling is trending amongst digital humanists, and that it is the logical response to the rapid increase in available, unsorted, digitalized data (Big Data), which is impossible for humans to analyze directly.


[1] Jähnichen P., Österling P., Tom Liebmann T., Heyer G., Kuras C., Scheuermann G. `Exploratory Search Through Interactive Visualization of Topic Models´ DH2015 Sydney

[2] Montague J., Simpson J., Rockwell G., Ruecker S., Brown S. `Exploring Large Datasets with Topic Model Visualizations´ DH2015 Sydney

[3] Kaufman M. `”Everything on Paper Will Be Used Against Me’: Quantifying Kissinger’´ DH2015 Sydney