In this post, an attempt is made to describe the notion of the term “Big Data” first in an abstract way and subsequently through the lens of three practical settings namely: data visualisation, text mining (information retrieval from texts) and social media mining (information and pattern retrieval from social media data).
Under the term “Big Data” there exists no exact technical definition. However, an understanding can be achieved employing the Big Data’s three V’s volume, velocity, and variety. Volume refers to the size of the data, hence Big Data refer to data of such size that cannot be processed in one machine’s memory. Velocity refers to the latency of data processing relative to the demand for interactivity. Finally, Variety refers to the integration of a variety of different data sources, hence Big Data here refers to the integration of so many different types of data that result in the consumption of the critical percentage of the disposable time that is available.
Stylometry is a statistical method that analyses a text to determine the text’s author. A question to be answered is whether a method that effectively deals with a collection of 100 texts can maintain its performance when dealing with thousands of texts. Paper  tries to answer that question. The Patrologia Latina (collection of the writings of the Latin Church Fathers from 2nd to 13th centuries) consists of 5821 texts by over 700 authors. It is an example of an uncleaned large dataset that cannot be inspected manually. In this situation the data cleaning is of great importance because editorial corrections, and punctuations are introduced by modern scholars, and some other issues that add noise to the dataset, and makes the author’s attribution problem extremely difficult. In order to analyse the data, the original dataset is preprocessed excluding texts with less than 3000 tokens and authors of fewer than three works. Many supervised classification algorithms were applied to this dataset. The result of data analysis was disappointing. Author’s attribution was unexpectedly poor. This bad performance is caused mainly by the huge dataset that made the data inspection, and cleaning extremely difficult.
Nowadays social media like Twitter are the largest big data producers. As mentioned in paper , to analyse these massive datasets, humanities researchers developed new innovative techniques, and collaborated with computer scientists. This collaboration led to the development of a Twitter visualisation tool based on Gephi in the NeCTAR cloud environment. NeCTAR is a cloud service that provides sufficient computing power and disk storage to analyse large datasets. Services like NeCTAR usually require advanced technological skills that computer scientists have. So it is obvious that collaboration is key. Having a versatile team of comprised of social scientists and computer scientists, helps each focus on what they do best. The result of this collaboration is a powerful workflow and research tool that facilitates the exploration of complex follower networks within the Twitter platform.
John Montague, University of Alberta, Canada; John Simpson, University of Alberta, Canada; Geoffrey Rockwell, University of Alberta, Canada; Stan Ruecker, Illinois Institute of Technology, USA; Susan Brown, University of Alberta, Canada; University of Guelph, Canada.
Maciej Eder, Pedagogical University, Krakow, Poland; Institute of Polish Language, Polish Academy of Sciences.
Jonathon Hutchinson, The University of Sydney; Jeremy Hammond, Intersect, Australia; Flora Martin, The University of Sydney; Daniel Yazbek, Intersect, Australia.