, , , , ,

Computational text analysis has spread quickly since its inception. Huge amounts of text have been and are being processed everyday allowing researchers to make rich inferences about the text: authorship attribution, sentiment analysis and others. Today text analysis shifts from extracting simple statistics such as word frequencies to discovering high-level models and relationships.

As we read a novel, for example, in our head a model of different characters’ relationships is constructed. While it is an easy task for our brain it is much more complicated for a computer. Given the same piece of text human expert can always extract a more precise model from it than computer. Then why bother with these text analysis methods? Because there is no way human expert can process an amount of text that computers deal with, and that’s why this approach is so fruitful.

One of the examples of such computer built model comes from Stanford based team that tried to extract and interpret character relationships by analyzing movie scripts and plays. For this purposes they created a weighted graph with nodes representing characters, and edges representing character interactions. The bigger edge’s weight is, the stronger the connection between two characters, connected by this edge.


For each such graph authors computed several properties describing how many story lines are in the play/movie, importance of a main character, overall number of characters and others. These numbers were used to find the patterns in works of one author or similar genre.

For example, authors have found strong support for stereotype that “horror” movies mainly feature one simple storyline. Furthermore this numbers were used for training classifiers. When presented with new texts classifiers showed mixed results: very good at Shakespeare’s plays vs Shaw/Galsworthy plays, plays vs movies in general, poor at classifying high-rated vs low-rated movies. Which indicates that Shakespeare’s plays have a distinct pattern in terms of this model, while the model is too simple to capture the difference between good and bad movies.

Another interesting example of extracting complex model from text comes from analysis of collection of 2,094 ancient Assyrian merchants’ letters. The goal of this project was to create a social rank model of  Assyrian society. Interesting detail that rendered this work possible is that Assyrians wrote “from: … to: …” part of a letter very formally — stating the senders and recipients in order of decreasing importance. So, analyzing many of these “epistolary formulas” from different letters it could have been possible to build a social rank tree if relation “Individual A has a rank higher than individual B” had been transitive. This means that if one letter states “A>B” and another “B>C” we can conclude that “A>C”. Unfortunately, that is not generally true since social rank can be perceived differently by different people.

Another two levels of complexity are introduced by the fact that Assyrians had a practice of naming a son after his grandfather (paponomy), and the fact that these letters are scattered over 200 years time period. So, often the same name in different letters refers to different people (authors point out that additional insight on identity may be found by analyzing letters’ contents).

So, because of that in building Assyrian social rank tree researchers have to deal with a lot of uncertainty, and the probabilistic model is needed. That, of course, complicates everything quite a lot comparing to the first example of analyzing movie scripts and plays where one can hardly find two different characters with same names. On the other hand, because of the highly formalized structure of the Assyrian letter parsing is straightforward, while defining character interactions is a nontrivial task.

The third project of this type is extracting information on victims and perpetrators from archives of human rights abuses. Such archives can contain, for example, witness interviews or officials’ reports.  The problem with this kind of task is that information on one individual can appear in multiple documents, and in each of them the details of an accident can differ significantly. The same individual may be reported by name in one document and unnamed in other. These intricacies can lead to overstating the victims’ number or loss of evidence, so proper analysis is required.

Authors worked with data from WTC Task Force interviews conducted after the attacks of 9/11. First step undertaken was parsing the full texts and classifying important phrases into several categories such as Person, Location, Time, Event. After that different phrases were analyzed of being parts of one event and trigraph diagrams have been created:


Edges’ numbers represent the level of confidence of the model that these blocks are linked to each other and to one event.

In this work researchers deal with even more uncertainty than in Assyrian letters example. There is the similar problem with multiple texts describing one person, but since the texts analyzed are narrative transcripts, this person can appear under different names/with no name at all. Time and place can be stated differently, some witnesses may just be incorrect. Another complication with this data that comes from it’s narrative nature is that time, date, location, etc. are stated informally. So, it is a task on it’s own to recognize the phrases referring to these categories. While in movie scripts and plays one has a clear indication of a characters acting, and Assyrian letters have even more formal structure. Because of all this complexity the model has additional functionality to manually connect the blocks into events and change the confidence level of link between the blocks to represent some human expert insights.

Models created from text mining are getting more and more complex. It is becoming possible to capture really high-level concepts. Combining that with the scale that modern computers can achieve, researchers are able to find the data dependencies that are impossible for a human brain to assess. Given these insights about the data human expert can fine tune the model to get an even better result.


1. http://dh2013.unl.edu/abstracts/ab-251.html

2. http://dh2013.unl.edu/abstracts/ab-249.html

3. http://dh2013.unl.edu/abstracts/ab-368.html