, , , , , ,

An interesting topic raised in the Digital Humanities Conference in Hamburg was the transcription and transformation of historical and archeological materials into digital resources. On this note, a Swiss inter-university program stands out: The HisDoc. Presented by Nadia Naji, one of the students participating in this initiative, HisDoc reunites professors and students in 3 Swiss universities (Fribourg, Bern and Neuchatel) with the goal of transforming old manuscripts into entirely searchable digital text, which would be made available as a basis for future researchers.

The principle is rather simple, and tasks have been well divided between the three participant institutions. The idea of transcription and recognition is not new, however the innovative element is brought by the search engine. Moreover, their goal is to create more than just a traditional searching tool, but an efficient an intelligent one (not naïve word matching, but guessing, suggesting etc.). What comes in as the real challenge to this project is its independence from human intervention. That is to say, transcription and text recognition is to be done using artificial intelligence, thus replacing the need to have linguistic experts to read the text and transpose it (long and extremely expensive process).
Being such an audacious initiative, it has been recognized that the construction of such a system should be divided into separate steps. With the help of the Swiss National Science Foundation, this synergy research project ensures feedback between the participant institutions, while allowing for progress to be done on all the three pillars at a time: image analysis, text recognition and information retrieval modules. Up until now, the results achieved are as follows:

  • University of Fribourg handles the layout analysis module (recognition of page elements, columns, rows etc.). They have used the Pyramid Method, with different image sizes, to create different scales of representation for the computers to recognize. Most of the elements have been successfully recognized using their system, and percentage of errors on different types of texts was low. Nonetheless, the main obstacles to overcome was recognition of decoration elements (and categorizing them as such), as well as degradation (and eventual reconstruction of text in the degraded areas).
  • University of Bern is in charge of the manuscript recognition systems. The machine they created is able to read the handwriting and produce a digital text, by “learning” different text samples and being able to recognize special characters according to their previous presentation. Ms. Naji made a small demonstration of how this scanning system worked: on the image of the manuscript, a thin box was sliding through the text – recognizing letters, but also spaces (thus assuming the end of a word). In the same time, it had to match the letters seen with the corresponding text, thus being able to “learn” the shape and size of the characters used and reproduces them in future scans. Newer experiments included the use of Hidden Markov Models as well as Neural Networks to develop the “artificial scientist” and be able to better recognize these texts. For this particular part of the project, given different corpora of text, the error rates were situated under 20%, which is rather impressive.
  • Last, but not least, the University of Neuchatel handles the information retrieval and creation of a search engine based on the data extracted from these images. Ms. Nada Naji is precisely working on this part of this project, which is why she was able to provide a more complete overview of what their work entitles. For the first steps of the project, they have used Single terms queries – one word is matched against the manuscript, and several possible answers are retrieved. Nonetheless, they are facing big challenges concerning the quality of the data extracted from these manuscripts (orthography and grammar not adapted to modern rules, inflectional morphology, term confusion etc.). The output is a set of 7 words (as possible correct results), in order of their likelihood given the input. The error rate for this particular process is around 6%, knowing that the queries are quite short (1-2 words, and not whole sentences). On their side, the innovation comes with a system based on Massive Query Expansion (creating a dictionary and scoring every word according to the query, having as a benchmark what they call the “Ground Truth” – i.e. the error free text, manually transcript by experts). The biggest challenge they face is what Ms. Naji calls “noisy queries”, that is faulty words looked up by the users (either due to confusion or misspelling).

All in all, their aggregate error rate is situated around the figure of 8%, for Latin manuscripts. Nonetheless, it is clear that one of their main challenges will be the different types of handwritings, and linguistic issues they have to “teach” their computers to assess.

Having followed this lecture, we cannot help but observe the massive importance of the “digital humanities”, and it is rather interesting to see how experts in two fields as different as digital technologies and social sciences work together so passionately. Their common goal: to use modern technology in order to preserve old learning and research materials, but also create a common-use database of this information and making it available for other researchers around the world.