Tags

, , , , ,

After digitizing the historical and cultural documents gathered all over the world, one has to analyze and identify them in order to reveal their content. Given the large and still increasing amount of digitized data, such a task would be impossible without the help of efficient tools based on image processing techniques.

During the analysis of a digitized historical document, one of the main steps is to determine what type of document we are dealing with. This step allows us to classify the documents according to their type, so that it eases the next step of transcription once we know the document’s context. With the progress of image analysis techniques, this could be done automatically. For handwritten documents, an algorithm of handwriting recognition process can reveal the structure of the text and extract the text line by line, before performing the following step of transcription. This algorithm operates directly on the scanned image and recognizes and extracts every single line of the analyzed page. This is done by :

  • computing an energy map of the page in a way that text lines have high energy values and gaps between these lines have low energy values
  • finding the minimum energy zones on the energy map and select and connect them in order to create horizontal separations between the text lines

We can see below that on one hand, the partition between two lines is perfectly computed, and on the other hand that the algorithm is really versatile by working on different types of manuscripts, with all kind of qualities and handwriting styles.

DH2014_548_533-2

Fig 1 – Examples of image blocks and their computed seams. [1]

DH2014_548_533-4

Fig 2 – Seam carving results on manuscripts of the 16-th and 18-th century respectively. Even in the lower left manuscript with extreme bleed-through, our algorithm is able to produce a robust result. [1]

Image processing techniques do not only allow us to extract the intern structure of the text but also the page structure. This can reveal really quickly the kind of document we’re analyzing, like human does by just seeing the shape of the document before reading it. For instance, when we deal with a poem, we know it before reading it, just because of its visual structure. This can also be done by a computer and some image processing. First, as we are interested in the shape of the text, we have to extract it by identifying and extracting the zones of pixels related to the text from the background pixels. Once this is done, we end up with a binary image, where the text zones appear in black and the background in white :

DH2014_548_figure1.jpg

Fig 3 – Binary non-poem image snippet; 5×5 blurring [2]

DH2014_548_figure2.jpg

Fig 4 – Binary poem image snippet; 7×7 blurring [2]

 

 

 

 

 

 

 

 

 

 

 

Now,the next step consists  on recognizing the shape of the text. As human beings, we are capable of doing it because we have already seen some before, but a computer needs to learn to achieve it following some criteria :

  • left and right margins
  • vertical white spacing between adjacent lines of text
  • the irregular shape of the end of lines

Then, using machine learning techniques, a classifier has been trained in order to automatically detect if the document analyzed is a poem or not. We can imagine that in the future, this classifier could be trained to detect more types of document and at the end become a very powerful tool in digital humanities.

After the structures of the text, we can look at its font, which can be an indicator of many things like the age of a document for instance. To do so, the PRImA Research Laboratory at the University of Salford developed Aletheia, a tool which can read a scanned typed text and identify lines, words and character within it.

DH2014_495_Aletheia-image

Fig 5 – Aletheia Desktop with identified glyphs and some of their associated Unicode values. [3]

Thanks to this tool, one can extract each glyph corresponding to a specific letter present in the text, and gather them. Then by using a tool called Franken+, it becomes possible, on one hand, to increase the ability of tools such as Tesseract (an open source OCR engine, which Aletheia is based on) by training them to better recognize the typeface being processed. On the other hand, the user could find some historical information about this font with the help of typefaces and book history research teams, such as the books or documents in which it was used and thus lead to a period where the document could have been written.

DH2014_495_Franken+SampleA

Fig 6 – An image from Franken+ of a set of exemplars of the “a” glyph from one document. [3]

To conclude, we have seen three useful image processing based tools among a lot of others that can, used together, make the automatic classification of a given digitized document accurate enough to be applied to the other millions waiting to reveal their cultural and historical content.


References

[1] Arvanitopoulos Darginis Nikolaos and Süsstrunk Sabine (2014). Binarization-free Text Line Extraction for Historical Manuscripts. EPFL, Switzerland

[2] Lorang Elizabeth M, Soh, Leen-Kiat Lunde, Joseph and Thomas Grace (2014). Detection of Poetic Content in Historic Newspapers through Image Analysis. University of Nebraska-Lincoln, United States of America

[3] Christy Matthew, Samuelson, Todd Torabi, Katayoun Tarpley, Bryan and Grumbach Elizabeth (2014). Book History and Software Tools: Examining Typefaces for OCR Training in eMOP. Texas A&M University

Advertisements