Nowadays, digital data is everywhere. Images obviously are one of the most important category among the available data. Figure 1 shows how the video, image and audio data is increasing and will continue increase in near future. Since it is difficult to deal with large amount of data, as the available data increases it becomes crucial to use the artificial tools for various purposes.
Obviously, some part of this huge data volume comes from digitisation of analog data and important portion of this digitised analog data is digitised historical data including images. Ancient images are quite important and valuable resources in the digital humanities. Because through this data, digital humanity scientists can get a rich playground for their research. So,digitisation and distribution of the documents in the form of digital image will have a great impact and will accelerate the ongoing research in digital humanities.
Considering the fact that images have not yet received as much attention as written sources, by using the image processing techniques, we can make the work much easier not only for now, but also for the future when there will be more data and more work to do on the images. Taking into account the sensitivity issues around the available original old documents, we can understand why digitisation is important and the role that the image processing can play in digital humanities. Surely, at first glance, it may seem not really necessary to obtain these tools since we may need complicated algorithms and a lot of funding to make some specific technique in image processing work appropriately while for those tasks these techniques are quite easily dealt by human brain. Nevertheless, in the end,it will benefit more when we deal with huge amount of available data and reduce huge amount of work that would have been done by many people.
The reason why currently images are not as popular as written sources among digital humanity scientists might be that the textual data is relatively older, it has not as much copyright issues as image data has and it is relatively difficult to analyse audiovisual content so far. Scientists working with the image or audio data face limitations in various stages of research in contrast to the scholars who work with the textual data where they have many advanced tools. First problem is the limitations in indexing and searching visual data. We still need improvements in the content based image retrieval. Second is the analysis of visual material. After the first step where we identify the relevant data, we need tools for analysis and interpretation of content where image recognition is needed.
There is an ongoing research on both of the above mentioned problems. Different solutions have already been proposed. For instance, a paper from UC Santa Barbara introduces their solution for the searching and indexing problem of visual data. In the paper,author describes Arch-V (Archive Vision) which delivers image based indexing and searching of digital archives of print materials. Author discusses that even though there exists other advanced tools for classification and recognition, when it comes to applying these techniques to the special set of archives of printed materials some challenges need to be faced. We get less satisfied results in that case. The main problem is that by current computer vision techniques in order to extract recognisable feature points, we do processing in the refraction of surface texture of an object in a digital image. But in printed digital archives, surface texture is not an indicator of the substances present in the print.
The solution developed in C++ by Arch-V normalises different kind of images to common format and after utilising a modified feature point extraction methodology, it combines the feature point extraction with a process of border contour extraction and comparison. After image feature extraction, each image in archive database is indexed for the query engine for image comparison. This solution was implemented already and it turns out that it is producing a collection of feature points for defining the boundaries worked well. Interested people in the area can get the complete code from Git where it is available publicly. After more developments and companion documentation, the team working on the project aims for the ease of implementation so that it can be applied without need of academic technical knowledge to various types of digital archives by people with different background.
Let’s now concentrate on the other limitation which is the analysing of the visual material and see example available solution. Paper by members of University of Nebraska-Lincoln tries to understand the problem of detection of poetic content in historic newspapers through image analysis. Poems are important part of historic newspapers since nearly million poems appeared in American newspapers just between 18th and early 20th centuries. So, it is a valuable data for learning the American culture and poetry of that period and can give detailed information about the given period. For detection of poetic content, human being would need to scan all the pages by the naked eye and then find the areas where it resembles to poetic content. But when it comes to a lot of newspapers, it is not an easy task to analyse by simply using an eye. To show how much work would be needed, it is enough to mention that there were nearly half million daily newspaper pages in total only in year 1860. So, it is clear that we need image processing techniques to help us.
For this, we need to firstly extract the features of typical poetic content in an image and then go through every image of digitised newspaper page to find whether it has this or similar feature or not. It turns out that, poetic content usually has some certain features that it can be differentiated from other texts in newspapers. Since the normal people differentiate the poetic content by just quick look at the pages of newspaper to some degree of accuracy even without understanding the context, made authors to come up with computer algorithms to do the same.
Given problem can be reduced to classification problem in machine learning. In this kind of problems, after collecting data samples, we divide data into to parts: train and test set. We firstly need to train hypothesis with the available train data and get corresponding classifier, then using this classifier we test the result with remaining test data which were not used in training phase. In the preprocessing stage, author explains the need to extract image snippets and decide whether this part contains content with poem or have any similarities. Figures 2 and 3 below show typical binary poem and non-poem image snippets, respectively. As we can see from figure also, poem image snippets have specific pattern.
After representing image by binary values, developers check some attributes such as spacing between lines, the number of left and right columns of the binary image, and other characteristic details for comparing with typical poetic content. For training the classifier, artificial neural networks(ANN) is used. After this, we need to check how the classifier is predicting. For this stage, machine learning technique called ’10-fold cross validation’(called k-fold cross validation in general) is used. This technique divides the data set into 10 parts and uses 9 parts for training the classifier and the last set for testing. By checking the result for all possible permutations of train and test sets we can come up with the overall accuracy of the classifier. After accuracy test, we can use the best performing classifier for detecting the poetic content in old newspaper images.
In conclusion, in the beginning, the importance of the images in the digital humanities was shown. Later, usefulness of image processing techniques was discussed. Two important limitations of the use of image processing techniques in digital humanities was shown and two different short papers containing possible solutions for these limitations were introduced.
1)R. Ordelman, M. Kemman, M. Kleppe, F. de Jong, Sound and (moving) images in focus- how to integrate audiovisual material in digital humanities research ( http://dharchive.org/paper/DH2014/Workshops-914.xml )
2)C. Stahmer, Arch-V: A platform for image-based search and retrieval of digital archives ( http://dharchive.org/paper/DH2014/Paper-325.xml )
3) E.Lorang, L.Soh, J.Lunde, G.Thomas, Detection of poetic content in historic newspapers through image analysis ( http://dharchive.org/paper/DH2014/Paper-851.xml )