, , ,

The digitalization of text documents, from parchments to books, has been one of the key activities of Digital Humanities. It has allowed the worldwide access to historical documents covering very large timeframes and locations. In that context, the activity of newspaper digitalization is responsible for the creation of massive datasets containing information highly representative of socio-political environments. Indeed, newspapers, apart from describing current events, are reflections of the populations they address. Between the lines resides therefore information that could be very useful for many fields of study. Several groups have found ways to extract some of this precious information. This post observes the findings of three of them through their respective abstracts archived in the context of the 2014 Digital Humanities conference.

The first paper [1] focuses on a method called exploratory thematic analysis. The goal of the study is to create topic models out of newspaper archives in order to facilitate the thematic exploration of documents. The problematic behind this study was the absence of insights in today’s documentation with newspaper archives. As newspapers have never been completely neutral and unbiased, an understanding of the historical context as well as the author or editor’s situation is important. Jacob Eisenstein and his colleagues have therefore developed visualization and analysis tools based on topic modeling. They used archives of the newspaper The Anti-Slavery Bugle, published in New Lisbon, Ohio, between 1845 and 1861 for the study and aim henceforth to apply them on metadata.

The second paper [2] investigates identity formation among content from newspaper archives. Starting from the database of the Dutch national library, Hieke Huistra and Toine Pieters used a recent text mining tool, Texcavator, to extract information about the public conception of overweight people. They were able to identify an evolution throughout history of the criteria defining excessive weight in the public mind. These findings illustrate how newspapers, by traversing long time periods, are rich sources of information about public trends and identity formation.

Finally, the third paper [3] provides a methodology for detection and extraction of poems out of newspapers. Poems in newspapers have had multiple functions throughout time, but have always been representative of the culture. Their automated identification within newspapers allows therefore their centralization and wider distribution.

As it has been shown above, these three papers are relatively distant in scope. However, they are all at the base of new processes providing future researchers with tools allowing them to efficiently extract previously hard to reach information. As stated in the paper [1], “linking technical innovation with real humanistic inquiry” is an activity worthy of attention that could drastically improve research. In that matter, the three papers are aiming to make the research methodology more complete and help it provide better and wider information. Furthermore, the paper [1] offers an insight of the future steps necessary to take the improvements further. According to the authors, these steps will allow to even better understand the movements of ideas throughout time among sociopolitical environments. We will therefore be contemplating major development in knowledge obtained from digitalized archives of newspaper in the near future.


  1. Eisenstein, J. , Sun, I. and Klein, L. (2014). Exploratory thematic analysis for historical newspaper archives
  2. Huistra, H. , Pieters, T. (2014). Using digitized newspaper archives to investigate identity formation in long-term public discourse
  3. Lorang, E. , Soh, L. , Lunde, J. and Thomas, G. (2014). Detection of poetic content in historic newspapers through image analysis