Tags

, , , , , , , ,

     Whether it is to extract factual information such as social connections in past Chinese dynasties or to give insight about french language evolutions, or again to sort and organize all the current legislative hearing sessions in the US,  text mining is here to give a hand. It is a powerful tool and definitely a core function in the Digital Humanities field. It seems pretty interesting to analyze the various uses of that tool and see how approaches may differ depending on the information deriving objectives. Especially, we’ll focus on the different levels of interpretation data mining can require.

In the abstract [1] « Mining and Discovering Biographical Information in Difangzhi with a Language-Model-based Approach », Peter Bol, Chao-Lin Liu and Hongsu Wang introduce us to a study which goal is to « see how people are connected together, through kinship, social connections, and the places and offices in which they served (…) in the Song through Qing periods ». That means digitalizing historical gazetteers called difangzhi (地方制) and then text-mining them. One particular thing in that study is the underlying challenge of mining traditional mandarin. Characters just go one after another but it is actually pretty hard to parse the sentences because words can be described by one or more characters. Misparsing a mandarin sentence can quickly result in a terrible analysis. Yet here the purpose is to automatically extract factual information, for example : person A is linked with person B.

Some interpretation is however required in the second abstract « Digital Democracy Project: Making Government More Transparent one Video at a Time ». The postulate is pretty striking, stating that it is a pity US citizens can’t access to what is said in public sessions by elected representatives. The goal of the project, Digital Democracy Project, is thus to digitalize audio records from live sessions, and provide « automated, inexpensive, timely, accurate, and informative knowledge » extracted from that. That implies using precise data structures and tag the different contents so that you end up with a web-based search engine to enable citizens, organisations or politicians to dig for what’s being said or done for a given topic at a specific time by a specific politician.

Interpretation is crucial for the third abstract « ‘88milSMS’, A New Digital Corpus Resource Of French Text Messages : Why We Chose To Exclude Full Transcoding and Standardised Tagging » written by Rachel Panckhurst from Paul-Valéry Montpellier University in France. Enjoying a large dataset of nearly 90.000 authentic text messages in French, the project « aims to build a worldwide database and analyse authentic text messages ». Text mining is, here too, at the heart of the initiative. Because of length limitations in SMS, french people ended up using a now-called SMS language that includes shorten words, abbreviations. That prevents using usual text-mining methods that require correct, standardized french. So one may say, let’s « translate » from SMS language to correct french, but the author rises an interesting question : how to draw the line between SMS language, grammar mistake, orthographic mistake, on-purpose mistake to mime a different pronunciation, or again laziness mistakes such as omitting accents « é » and replacing them with an « er » because it is actually faster on a smartphone? All of these different interpretations would draw drastically different conclusions on the SMS author and confirm the subtlety required by text mining in this case.

We’ve seen through these three articles how much Text Mining is a multi-field tool that can be a great asset for many topics of Digital Humanities, such as history, politics, language analysis and evolution. I thought interesting to underline the different scales of interpretation that some projects require, resulting eventually in really complex problems. That was a way for me to confirm that although already an efficient and wide-spread method, Text Mining remains a very challenging topic and might keep feeding researches for years and years…


References

[1] « Mining and Discovering Biographical Information in Difangzhi with a Language-Model-based Approach », Peter Bol, Harvard University (USA) Chao-Lin Liu, National Chengchi University (Taiwan) and Hongsu Wang, Harvard University (USA).

[2]  « Digital Democracy Project: Making Government More Transparent one Video at a Time », Sam Blakeslee, Alex Dekhtyar, Foaad Khosmood, Franz Kurfess, Toshihiro Kuboi, Hans Poshcman, Giovanni Prinzivalli, Christine Roberston, Skylar Durst

[3] « ‘88milSMS’, A New Digital Corpus Resource Of French Text Messages : Why We Chose To Exclude Full Transcoding and Standardised Tagging », Rachel Panckhurst, Université Paul-Valéry Montpellier, France

Advertisements