, , , , , ,

Nowadays humanity produce a colossal amount of information in comparison to the XX century. In fact the 90% of all recorded data was generated in the last two years. Moreover, recent reports tell us that a completed Google book library will likely to hold more than 10 million items. No one can read so much books even if one spend his whole life on it. Thus, the humanity faced the problem that it cannot use this sea of data in a conventional way. Contemporary data mining tools based on big data techniques bring us closer to the solution of this problem.

One of the most important part of text mining is the visualization of results. There are five principal types of visualization: dendrograms, histograms, networkdiagrams, wordclouds and scatterplots. No type has an absolute advantage over others, each one has its own pros and cons. The biggest drawback of these visualization methods is lack of scalability, in fact for large data sets visualization results became completely unreadable. In paper [3] it is proposed to use multi-type visualization instead of uni-type one. Authors developed a tool that enables data researcher to explore combinations between those five types of visualization to obtain the one that gives required information in the most distinct way. For instance, one can use a histogram to explore a dendrogram. Their tool helps researchers to make sense of big data by intelligently combining techniques that are incapable to clearly analyze large datasets.

Another approach to promptly analyze large volumes of written data is a technique called quantitative analysis [2]. This technique lies in the fact that all data is stored in large database and one can parse and structure the vast strings of data by using SQL queries. In contrary to visualization technique mentioned above, the readability of output results of “good” SQL queries do not become worse with increasing size of data sets. The drawback of this technique is that it requires much more experience and knowledge than visualization one. The quantitative analysis has three useful properties. Firstly, with SQL we can read information in completely new way by writing SQL queries. Secondly, since the computer search works thousands times faster than the search  conducted by human, this technique gives scientists a huge productivity boost. Thirdly, quantitative analysis has a huge potential to unite the science community that is now divided between exact sciences and humanities scholars.

The tools mentioned above are concentrated only on text analysis, but modern digital libraries in addition to text information contain other types of information, especially images like text scans, photos of woodcut impressions, etc. that we also should be able to analyze. During last decade software giants like Google and Facebook made a large breakthrough in the field of image search and recognition. Modern object recognition systems successfully compete with human abilities. On the other hand, state of the art object recognition algorithms highly rely on the refraction of the light across surface texture. This property serves as distraction for the images of print artifacts, because texture belongs only to delivery medium and not to the objects represented on it. Arch-V [1] is a new image recognition automated platform that is devoted to search in the digital archives. It modifies the feature point extraction methodology by adding a process of border contour extraction and comparison. That allows scientists to recognize objects in images of print artifacts by defining the boundaries of objects rather than variations in surface texture.

Contemporary research in data mining field enable Digital Humanities scholars to do a boost in their work efficiency. In this article were presented three up-to-date examples of that – visualization, quantitative analysis and digital image search.


[1] Arch-V: A Platform for Image-Based Search and Retrieval of Digital Archives Stahmer, Carl

[2] Small-Scale Big Data: Experimental Literature and Distributed Computing       Mauro, Aaron

[3] Seeing the Trees & Understanding the Forest                                           Montague, John Joseph ; Rockwell, Geoffrey; Ruecker, Stan; Sinclair, Stéfan; Brown, Susan; Chartier, Ryan; Frizzera, Luciano; Simpson, John