Methods for text analysis are evolving all time and now are successfully combined with modern tools from computer science and statistics. This post is devoted to overview the cutting-edge research and approaches for different problems in text analysis.

The first aspect of the text analysis I would like to talk about is a question about words’ etymology. Can we conclude something about the text, about the author, analyzing the etymology of the words that are used?. Over 1/3 of English words came from Latin, especially, terms and names in different sciences. So, maybe, analyzing the origins of the words used in a text is something useful for revealing some facts about it?

Jonathan Reeve (New York University) has created a computer program called “The Macro-Etymological Analyzer” (MEA) that is designed for this kind of purposes. In his paper he is presenting the results of launching his program for various literary works. One of the clearest examples, the running on the novel “The Waves”, written by Virginia Woolf, reaveled, that more educated people in this novel use more Latinate words, then less educated persons do:


In this diagram one of the most educated characters (Bernard and Neville) have the best trade-off between using words from Germanic group and from Latinate (which is pretty natural), while the less educated character Susan (a housewif), as expected, has the worst ratio. It turned out, that analysis of the words’ etymology is a very powerful and effective weapon for text analyzers and the MEA can help people to discover interesting and important facts about texts.

Another problem that can be considered is a problem of automatic clustering texts by the author. As it well known, every writer has his own style of writing, and people in text analysis are interested in finding machine methods for combining texts by different authors into clusters for each author. One of the approach, provided by David Hoover, is based on the coefficient of variation (CoV) which can be calculated as Stdev/(Avg.Freq.)*100% that is also “percentage of the average frequency”. Previously works showed good results and used some methods of machine learning based on Most Frequent Words (MFW) values — just taking the top N words in the list of all words ordered by their frequency in the text, but the method based on Most Variable Words (MVW) using this CoV concept (instead of frequency concept)) significantly improved the results of clustering. Here are some comparison of MFW and MVW methods:


Standard, 800MFW


CoV, 800MVW


Standard, 700MFW


CoV, 700MVW

But dispite the fact, that the methods of text analysis are improving and getting better, we should not forget about the old approaches, because their ideas and understanding the type of thinking might be useful for designing new ones. For example, one of the first interaction system for text analysis ARRAS designed by John Smith has a great impact of all the modern tools and inspired many nowdays projects. In his work John was designing a special language user can interract with the program.

Vast continents of literature or history or other realms of information, much as our ancestors explored new lands.

– Smith 1984, p. 31

One more interesting example of the tools of the past is Glickman’s idea about printing of concordances. He developed the idea of storing (printing) concordances on the cards for 2-ring binders:

Past Analytical 2

This form would be useful, as Glockman believed, for a person: you can easily extract the cards you need and put on the table in front of you.

To sum up, integration the text analysis with various sciences is very important and can be extremely useful. For today’s world it is the key for development effective tools and concepts.