, , , , , ,

In this post, an attempt is made to describe the notion of the term “Big Data” first in an abstract way and subsequently through the lens of three practical settings namely: data visualisation, text mining (information retrieval from texts) and social media mining (information and pattern retrieval from social media data).

Under the term “Big Data” there exists no exact technical definition. However, an understanding can be achieved employing the Big Data’s three V’s volume, velocity, and variety. Volume refers to the size of the data, hence Big Data refer to data of such size that cannot be processed in one machine’s memory. Velocity refers to the latency of data processing relative to the demand for interactivity. Finally, Variety refers to the integration of a variety of different data sources, hence Big Data here refers to the integration of so many different types of data that result in the consumption of the critical percentage of the disposable time that is available.

Topic modelling is a text mining algorithm that identifies patterns, and clusters of words in a corpus. A way to visualise the results of this algorithm, as mentioned in paper [1], involves a two-step process. The first step includes a command line tool called MALLET and the output of this is passed to another tool that outputs the visualisation. The advantage of this method is that it requires basic technical skill. The visualisations produced by this method are traditional charts, network graphs, zoomable tools, and 2D matrix. But this method performs poorly when dealing with a large dataset of thousands or millions of entries in a corpus. To overcome this problem a new tool has been designed, using the free open-source JavaScript framework Famo.us, that creates 3D visualisations that enable the users not only understand the output of the algorithm but also discover new information regarding the corpus. The visualisation is fully customised, that is, the user can filter out undesirable features, can use layout and perceptual cues, and delve deeper into the data via a zoomable user interface. The visualisations produced by this tool comprise word clouds, histograms, line graphs, and networks diagrams.

Stylometry is a statistical method that analyses a text to determine the text’s author. A question to be answered is whether a method that effectively deals with a collection of 100 texts can maintain its performance when dealing with thousands of texts. Paper [2] tries to answer that question. The Patrologia Latina (collection of the writings of the Latin Church Fathers from 2nd to 13th centuries) consists of 5821 texts by over 700 authors. It is an example of an uncleaned large dataset that cannot be inspected manually. In this situation the data cleaning is of great importance because editorial corrections, and punctuations are introduced by modern scholars, and some other issues that add noise to the dataset, and makes the author’s attribution problem extremely difficult. In order to analyse the data, the original dataset is preprocessed excluding texts with less than 3000 tokens and authors of fewer than three works. Many supervised classification algorithms were applied to this dataset. The result of data analysis was disappointing. Author’s attribution was unexpectedly poor. This bad performance is caused mainly by the huge dataset that made the data inspection, and cleaning extremely difficult.

Nowadays social media like Twitter are the largest big data producers. As mentioned in paper [3], to analyse these massive datasets, humanities researchers developed new innovative techniques, and collaborated with computer scientists. This collaboration led to the development of a Twitter visualisation tool based on Gephi in the NeCTAR cloud environment. NeCTAR is a cloud service that provides sufficient computing power and disk storage to analyse large datasets. Services like NeCTAR usually require advanced technological skills that computer scientists have. So it is obvious that collaboration is key. Having a versatile team of comprised of social scientists and computer scientists, helps each focus on what they do best. The result of this collaboration is a powerful workflow and research tool that facilitates the exploration of complex follower networks within the Twitter platform.



[1] Exploring Large Datasets with Topic Model Visualizations,

John Montague, University of Alberta, Canada; John Simpson, University of Alberta, Canada; Geoffrey Rockwell, University of Alberta, Canada; Stan Ruecker, Illinois Institute of Technology, USA; Susan Brown, University of Alberta, Canada; University of Guelph, Canada.

[2] Taking Stylometry to the Limits: Benchmark Study on 5,821 Texts from “Patrologia Latina”,

Maciej Eder, Pedagogical University, Krakow, Poland; Institute of Polish Language, Polish Academy of Sciences.

[3] Social Media Data: Twitter Scraping on NeCTAR,

Jonathon Hutchinson, The University of Sydney; Jeremy Hammond, Intersect, Australia; Flora Martin, The University of Sydney; Daniel Yazbek, Intersect, Australia.