Tags

, , , , , , ,

Introduction

Big data is a field in constant expansion. Nowadays we have a calculation power way greater than before, and a lot of projects which looked completely impossible to do even a few years ago are now doable. A lot of people work in this area, constantly trying to find the new breakthrough. But as we will see in the next three abstracts, they basically have to do everything themselves.

First abstract: Exploring Large Datasets with Topic Model Visualisations

The idea behind this abstract is to explain the methods to explore large datasets, via visualization. In particular, it provides a solution to the fact that even with the large number of tools used for topic model visualization, there still does not exist a perfect tool to explore really large datasets, i.e. datasets with millions of entries over a wide range of time, generating a lot of candidate topics via topic modelling.

The usual method for visualizing topic models is separated in two stages. First, we use a tool to construct topic models on a specific collection. For example, MALLET is a command-line tool that realises this first stage and gives in output text files, which is the default input format for second stage tools. The second stage is where the problem mentioned above appears.

The different types of topic modelling visualization can be separate in four categories. Each tool used for second stage has a specific visualization. The idea is to create a tool that fits to one type of data only, that’s why we can not visualize any type of data with the existing tools. Often we have to design a new tool for each type of data we want to visualize.

Lauren Frederic-Klein said “By forcing all thematic differences into a single two-dimensional presentation, information is inevitably lost”. Therefore the tool explained in this abstract uses the 3D visualization, via the Javascript framework famo.us. The key idea behind their implementation is to use multiple visualizations, and using the third dimension to “fly” through them, allowing to display only the wanted quantity of information (See figure below).

Topic Modelling Visualization 3D

Second abstract: Taking Stylometry to the Limits: Benchmark Study on 5,281 Texts from “Patrologia Latina”

Still in the big data domain, this abstract summaries an experiment of stylometry on a dataset formed with more than 5000 texts from the “Patrologia Latina”. This collection is made with texts from over 700 Latin Church Fathers, during a period of over a thousand years (2nd to 13th century). This is a very good dataset to study, because even if there are more than one topic covered, most of the texts are related to theology. The collection was published in the middle of the 19th century in a 10-years period, which makes it interesting to study, as the publishers did not try to make a very clean publication. Indeed their initial assumption was that the volumes would be replaced later with cleaner versions. Thus the collection is very noisy, allowing the researchers to take Stylometry one step further.

In this study, the benchmark was only focused on authorship distribution. As mentioned above, the “Patrologia Latina” is therefore a good study case, as it’s close to real dataset.

The method applied to do the classification is then classic machine-learning. First they did a preselection of text, deleting the most irrelevant ones (in this case the texts with less than 3000 tokens, and texts by authors with less than three realisations). Then, the data was separated in two sets, the training set (2 texts per author) and the test set (all remaining texts). The idea was to perform multiple experiments, covering every combinations of different parameters (with and without punctuation for example).

The results are quite interesting: the predictions are more accurate when the punctation is taken in account. Those results are a little bit unexpected, knowing that the texts are in ancient Latin, language known for having no use of punctuation (it has been introduced later).

The interesting part of those results is that it has been demonstrated that the punctuation increases the accuracy of the  attribution in most cases. This leads us to the conclusion that even if the punctation had no rules at this time, each author followed a specific pattern, allowing the algorithm to identify the author in a much clearer way, using the punctuation.

Third Abstract: Venice Time Machine: Recreating the density of the past

In the third abstract, the Venice Time Machine is the perfect example of project in the field of big data. This idea is quite simple, recreate the map of the city for each year, since Venice’s creation. The map will be interactive, allowing the user to navigate through the time and the city.

To achieve such a goal, a lot of data is needed. The researchers have to go through Archive di Stato, one of the largest archives in Venice. The city is a very good candidate for this kind of project, due to the fact that his transformations are well documented, there are still visible traces on Venice’s building, and the original shape of the city has not evolved a lot.

The key point of the project is to align the different maps on the actual one. Once the different maps have been created, the researchers have to choose control points to map the historical maps to the actual one. This stage requires a lot of knowledge concerning the city and cartographic conventions in general.

The key thing in big data projects is always to find the best way to use all sources together. The data available about Venice is unique, and comes from six different types of sources. The project is to find relations between all those data points, and deduce historical facts about people, buildings and events during more than 1000 years.

Analysis and Conclusion

Those three abstracts have a lot of common points. The first one explains us how to handle big datasets. We learned that for every dataset a new design of visualization had to be create to be able to navigate into the data in the most efficient way. The second one is a typical application of this theory in Digital Humanities. Even if the visualization is not the goal here, the researchers had to go through all the possibilities of comparisons between a few elements, to actually find a trend. In big data, we can not just apply the rules and the algorithms to every dataset to figure it out.

For each problem, each dataset, there is unique ways to find what we are looking for. In the third abstract we see this in practice. The data about Venice is unique, and nobody has never gone through it entirely. The work of the researchers is new, and they can not rely on anyone to find the answers. They have to find their own way to understand, sort and interpret the data.

In conclusion, working in the big data field is a constant challenge. As seen in second and third abstracts, Digital Humanities make no exception


References

  1. John Montague. John Simpson. Geoffrey Rockwell. Stan Ruecker. Susan Brown.
    Exploring Large Datasets with Topic Model Visualizations
  2. Maciej Eder.
    Taking Stylometry to the Limits: Benchmark Study on 5,281 Texts from “Patrologia Latina”
  3. Isabella di Lenardo. Frédéric Kaplan.
    Venice Time Machine: Recreating the density of the past
Advertisements