, , , , , , ,

Crowd-sourcing data

The Internet has enabled people to communicate and share more than ever. Some tools enable people to collaborate and contribute to one same project, even if they are not related at all. Two famous projects, Wikipedia and OpenStreetMap heavily rely on this philosophy to provide free access to knowledge, following this recipe:

  1. Provide meaningful information
  2. Encourage people to take part of the project
  3. Let everyone add or edit data

Those projects rely on the large amount of visitors they have, the crowd, to create the data, believing that most people will do relevant contributions that will lead the project forward rather than destroying. Also, providing an edit-history make it easy to roll back to an earlier non-vandalized version.

Crowd-sourcing the work for Digital Humanities

Revealing links

The Linked Jazz project [1] tried to reveal what are the links between the jazz artists, studying written documents, mostly interviews. The graph obtained by this automated analysis is relevant, but only to an extent. The project wants a deeper understanding of those links.

This relationship could be anything from close friendship and collaboration to just knowing the other person exists.

Pattuelli, M. Cristina [1]

Solving this issue is a really challenge because the machine analyzing would need to actually understand what is written, which is harder that simply counting occurrences of words in a corpus of texts.

That is where this project has involved crowd-sourcing: computers do a first analysis, and the results are sharpened with a human analysis. A web-application was made for the users, from jazz enthusiasts to academic jazz scholars so they could choose the kind of relationship between two jazz artists, and how relevant this relationship is.

Image from the abstract of the Linked Jazz project
The result of this project is freely available on their website, with a rendering of the entire network linking jazzmen.

Massive digitizing

The second project presented at the international conference of the Alliance of Digital Humanities Organizations of 2013 is crowd-sourcing the Medici Archives [2]. The goals are the following:

  1. digitizing around four million handwritten letters
  2. publishing digital images of these historical documents
  3. providing a transcription of each document
  4. having an English synopsis of each document

It is quite similar to other projects, but with some interesting differences:

Unlike the above projects, however, the Medici Archive Project faces specific challenges based on the following:

  • Size of collection: ca. four million handwritten letters
  • High level of technical expertise: paleography, language, historical training
  • Varying languages, nationalities, and cultural backgrounds of community

Allori, Lorenzo [2]

To solve this issue, they restrict the “crowd” to a smaller subset of scholars with a strong background in the related domains, with some rankings between each users. Also, there is no anonymous or pseudonymous contributions as every user has to be registered and use their real name. Doing so, this project want to reach a higher level of expertise involving known individuals and fact-checking.

This approach is close to the Google Knol service which failed. Its aim was to compete with Wikipedia to provide a reliable encyclopedia, with the reward of having the author name linked to each of his article. After having seen that it was a failure, Google closed the service. This happened probably because Wikipedia lowers the entry barrier, letting everyone contribute, from typing mistakes anyone can find to very high-level explanations only an expert can provide. The workload is distributed more evenly on a bigger base of users.

As the Medici Archives has no competitor on this field, it can not fail, but I think opening the contributions to everyone could speed-up the process, without degrading the final quality, as shows this recent publication about Wikipedia (Read Replies to common objections → Trustworthiness for sourced details). For example, anonymous contributions could be accepted as drafts for more experienced reviewers, instead of just being forbidden.

Studying crowd-sourcing

The last excerpt from Lynne Siemens [3] explains that not many studies has analyzed how to use crowd-sourcing for an academic project, respecting a time schedule, budget and quality standard.

The definition given matches what we have seen previously with the first project:

[…] there must be an organization, a particular task to be completed […], and a community, comprised of both experts and novices, which is willing to do the work for little or no money. These interactions are facilitated through an internet-based platform. The interested organization must make a series of decisions regarding the type of expertise, qualification and/or knowledge required, the presence of a contributors, the mechanisms by which they will participate and contribute, project remuneration, motivators to keep participants engaged, and quality control mechanisms

Siemens, Lynne [3]

For the second project, the missing part the presence of novices, as it is a community-sourced project. It’s a choice they did as a part of their “quality insurance” policy.

The team involved in the study has contacted several crowed-sourced projects, trying to list all the challenged for such projects. It turn out that handling the infrastructure needed, there is a lot of work for the community management, starting for setting a proper work-flow to keeping the enthusiasm of the crowd. Crossing the results of this study with the Medici Archives project, it might open new horizons to mass digitization for larger projects, such as the Venice State Archive.


Crowd-sourcing data for Digital Humanities is really interesting because it eases processing of data which is not mathematically exact [3]. For example, the first project [1] presented tried to find links between jazzmen, which it did rather easy. Nevertheless, understanding those links is harder, that’s why they crowd-sourced this analysis, enabling manual processing big amounts of data at a rather high-speed, low-cost and reliably.

For the second project [2], they restricted the crowed-sourcing to a narrower community-sourcing of information, coming from identified individuals. In a way, they chose a slower process, hoping for a higher level of data quality. As there is no competitor, this can’t be checked, but previous similar works showed no major differences between “the crowd” and “a panel of experts”.


  1. Linked Jazz 52nd Street: A LOD Crowdsourcing Tool to Reveal Connections among Jazz Artists
  2. Opening Aladdin’s cave or Pandora’s box? The challenges of crowdsourcing the Medici Archives
  3. The Crowdsourcing Process: Decisions about Tasks, Expertise, Communities and Platforms