, , , , , , , , , , , , , , , , ,

Data mining of literature has always been a very interesting aspect of Digital Humanities. Poetry, although is just a part of literature, possesses a lot of attractive properties that sometimes even make it stand out from the rest. For instance, a poem, usually (but not necessarily) follows a series of rule that will categorize itself into a group of poetic forms. These rules are not just in terms of number of words, but also in terms of meters and beats which are phonetic properties. Hence it is an intriguing question as to how ones will cope with all of the rules in order to produce a desire data mining output.

The 1st article (1) that we take a look at is a paper focusing on the distance reading of naïve poetry in Russia through a scope of comparison with the classical poetry. A naïve poetry is a poetry written by a non-professional poet that is uploaded on a specialized site. The paper takes a special look at the site “stihi.ru” with a huge collection of this type of works. Then, after a filtering process, including prioritizing and lemmatizing, the project comes up with a collection of the most representative words of the naïve poetry. The next step is to analyze this information, mostly in frequency of usage, and comparing them to the high poetical corpus from two trusted sources that represent the classical poetry. The comparison is done in three frequency lists: 1) Compare the most frequently used words in the 3 domain, 2) Look for the words that have the most different in ranks between the 3 domains, 3) divided the top 100 nouns in abstract semantic groups and compare the frequency of each words in the 3 domains. The results of the 3 comparison give a direct insight into how naïve poetry nowadays transformed from the classical poetry.

The 2nd article (2) that we would like to take a look at is phase 2 of a project done by the Stanford literary lab which take a deep look into data mining of poetry in terms of metrical form. The algorithm based heavily on a series of poetic rules and also on the classification based on probability. The algorithm was then trained to first recognize the individual meter of a line in the poem, and then to generalize the information to get the metrical scheme of the whole poetry. The whole project’s objective is to develop a program to deal with all characteristics of a poem such as the syllable scheme, beat scheme, metrical scheme (which is phase 2) and then combine all of them as whole to give the general information. Those information will not only help us understand individual poem but also shape the history of poetry and then, more advanced, will help us understand the interrelations between different characteristics of a chosen poem.

The 3rd paper (3) follows a different approach than the so-called mainstream text-mining of poetry. While the traditional way of text mining is to first take a look at the poetry data in the form of text, and then to apply different algorithms and methods in order to extract the information of interest, the paper from Dr. Cade-Stewart gives a totally different insight to this matter. In short, instead of extracting information directly from the text, the three-year project (which takes place in King’s College London and have been funded by the British Academy) use the combination of a text-to-speech software (called Mary text-to-speech) and an analyzing script (developed by the project) to take note of the phonetic data from the speech. The text-to-speech software is capable of not only transforming the text to audio-data but can also present the audio data as high-lighted text. This is exactly the information that will be analyzed by the script. It identifies several beats in one line of the poetry, then, with the given information, it can predict the position of the following beats in the next lines. The outcome of the script will be the precious information of a poetry such as number of beats per line, meters and form.

We can see that data mining of poetry is actually a very complex task in which parallel with the traditional word and text mining as is done in (1), we also need to put a lot more effort into extracting other information (especially the phonetic information) of the poem. (2) and (3) differ from (1) in this aspect. (1) focuses mostly on how to extract the words and then how to categorize and analyze them to give meaningful insight into the works related. On the other hand, (2) and (3) take a look at a somewhat harder part of a poem which is the meter of a poem. This requires more work on how to detect a metric rules of a line or of the whole poem. This is also the difference between (2) and (3). While (2) still follows on the road of text mining, (3) actually takes an interesting path, that is to look at the information in the audio domain using a text-to-speech program. In conclusion, those 3 articles give us meaningful insights into how to treat this special type of literature materials with data mining and how the process could and should be done.


(1) Bonch-Osmolovskaya, A., & Orekhov, B. (2014). DISTANT READING OF NAÏVE POETRY: CORPORA COMPARISON AS RESEARCH METHODOLOGY. Digital Humanities Lausanne ’14 Conference Archive. Retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Paper-777.xml

(2) Algee-Hewitt, M., Heuser, R., Kraxenberger, M., Porter, J., Sensenbaugh, J., & Tackett, J. (2014). THE STANFORD LITERARY LAB TRANSHISTORICAL POETRY PROJECT PHASE II: METRICAL FORM. Digital Humanities Lausanne ’14 Conference Archive. Retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Paper-788.xml

(3) Cade-Stewart, M. (2014). MINING POETIC RHYTHM: USING TEXT-TO-SPEECH SOFTWARE TO REWRITE ENGLISH LITERARY HISTORY. Digital Humanities Lausanne ’14 Conference Archive. Retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Paper-640.xml