Automated recognition of key information in written texts and their transcription into a digital format is a challenge faced daily by people working in Digital Humanities. It allows researchers to find and classify relevant information hidden in piles or archives.
It is extremely difficult to obtain a general algorithm who could read any written language from any period of the History and understand what was relevant in the text. But some research teams have managed to build working algorithms useful for specific tasks.
To explain the difficulties that such algorithms have to face, and how to surpass those difficulties, we are going to explore 3 different cases.
The first one is a research group who uses Tolstoy’s extraordinary vast texts production to train an algorithm to understand the construction of sentences1, identify subjects, verbs, etc. and by extension, also identify the characters, their actions and even their relations. In the future, an algorithm like this could parse historical archives and automatically associate the events with the action of important actors, constituting a large database of people’s achievements through history.
The second one is an archaeology school in France which has a collection of so many written research reports that finding information in them has become a slow and painful process.2 Simply scanning them would simplify their access, but it still doesn’t allow to quickly find an information in this data collection. They want an algorithm that could identify important elements in the text, such as date, location and subject of research.
The third case is a US/Chinese research group which tries to expand the China Biographical Database3, by examining historical gazetteers*. The idea is to start by trying to find the people already present in the Database by looking for already known information, to complete missing information and to find their relations and jobs. They are concentrating their effort on officials because government facilities are well documented in the gazetteers. The parsing may revel new officials not present in the Database.
The three research groups, although trying to achieve similar tasks, face different difficulties. Compared to the two latter cases, the first group doesn’t need to scan their dataset. Tolstoy’s work already exists in a digitized form, even translated in multiple languages. Focusing on one author also reduces the range of vocabulary to be learned, and writing patterns are also more consistent, allowing an easier recognition. The greatest difficulty will be to teach the algorithm how to understand a sentence, when a word is the subject of an action, does an action, and so on.
On the other side, the two other groups need to first scan their paper set, then recognize the characters via OCR, which is particularly difficult in Chinese, where over 4000 different kanjis are commonly used, as opposed to the 26 Latin letters used in Western culture. Once this is done, the French and Chinese group have a dataset which is already organized in some aspects, as opposed to Tolstoy’s work. The French reports have a structure which facilitates the identification, information like author and date of discovery, location and name, category and historical period of the discovery can be found by looking at a particular place in the text, but some relevant information can only be found in the description, which needs to be extracted with a similar but less sophisticated technique as the Tolstoy group, that is to train an algorithm to understand basic sentences composition.
The Chinese dataset, like the French one, is also structured, since it is similar to a directory. The main difficulty is the language. In traditional Chinese, compared to western languages, there are no separators of words or sentences, which means there is no way of locating something. It also leads to multiple interpretations, since multiple associations of kanjis can be formed. The kanjis themselves are also a source of misinterpretation, a kanji is phonetic and multiple kanjis share the same phonetic. Therefore there are sometimes multiple ways of writing the same thing.
To solve these challenges, the 3 groups use different methods. The Tolstoy group uses a very advanced text analysis tool developed by ABBYY called Compreno. It is a software that can correctly extract identify the actors and their characteristics as well as actions and their location. It uses an immense set of semantic rules to work. With the help of this software, the research team could establish statistics about the characters that couldn’t be obtained before. For example, they could identify which are the characters that talk the most.
The archaeologist team uses a simpler software than the one of Tolstoy’s team. They developed their own solution to look for precise terms in the text or at precise location in the report structure to get as much useful information as possible. This method works well, as shown below.
The Chinese group’s gazetteers, like every other index, also has a structure. As opposed to the French group which can extract information based on its location in the text, the Chinese group still has to develop a pattern recognition system to analyse their data because they don’t know how many characters a name or an office has. They try to form the longest possible words first, then find the others, and finally check consistency between the names, functions, locations and reigns. If it seems illogical, for example a function which didn’t exist in a certain reign, a new pattern is tried.
They extracted 1260 records and compared them to the existing China Biographical Database. A circle means a match with the existing Database, a cross means either that the information didn’t exist in the database, or that the extract are in contradiction with previous information.Type 1 is only useful for locating the text structure and to get more detailed information than just the names, addresses and functions.
As we have seen, in many occasions the different techniques involved overlapped each other. If we could bind together all these different algorithms into one, we would obtain an universal tool stronger than the algorithms taken separately, capable of extracting information from various kind of written work.
* A gazetteer is a geographical dictionary or index containing information about the geography and the inhabitants of a region.
 Daniil Skorinkin, National research university ‘Higher school of economics’; ABBYY software company; Anastasia Bonch-Osmolovskaya, National research university ‘Higher school of economics’ : Automatic semantic tagging of Leo Tolstoy’s works
 Frederique Melanie-Becquet, LATTICE-CNRS, France; Johan Ferguth, LATTICE-CNRS, France; Katherine Gruel, AOROC-CNRS, France; Thierry Poibeau, LATTICE-CNRS, France : Archaeology in the Digital Age: From Paper to Databases
 Peter Bol, Harvard University, USA; Chao-Lin Liu, National Chengchi University, Taiwan; Hongsu Wang, Harvard University, USA : Mining and Discovering Biographical Information in Difangzhi with a Language-Model-based Approach