, , ,

Two important topics we read about frequently on technological and other news sites are information retrieval from datasets- such as documents, images or other media- and usage of linked data. Ιnformation retrieval allows you to discover new aspects on previous data and draw further conclusions. Linked data is equally necessary because it allows you to process data from various sources, understand their connections and retain the ability to add more information in the future.

In order to compare and contrast the articles correctly, it is mandatory to give a brief summary of each abstract, giving an emphasis on the two topics mentioned earlier.

The first paper[1] of interest is “Accidental Discovery, Intentional Inquiry: Leveraging Linked Data to Uncover the Women of Jazz” which uses linked data to better understand the role of women in jazz throughout history. In order to achieve this, the project mainly uses interview transcripts, on which it applies various parsing and extraction techniques and results in a dataset of names and connections. The creation of a list of proper names proved especially difficult, and was done using information from various sources and resulted in the creation of URI which mapped to a list of names (real name, on stage names, nicknames, and more). During this process, it quickly came to light that there is a huge skew in the ratio of male and females, greatly favoring the male side. This unprecedented finding was the underlining start of a process to better represent the gender aspect in the linked-data dataset being created. Using the various tools for parsing and cross-referencing data, besides the gender representation, many other additional properties needed to be added, such as instrument played. In order for this to be done, this project made use of the RDF protocol, allowing it to add linked metadata to the original dataset, and allow further work and discovery to take place on the data.

The second paper [2] “Creating Linked Data within Archival Description: Tools for Extracting, Validating, and Encoding Access Points for Finding Aids” deals with adding semantic tagging, in a mostly automated way to archival descriptions. The way this is done is by, parsing the finding aids and pinpointing name entities and terms of interest, using semantic analysis services to process these entities and terms, cross checking the names of the extracted identities, and finally, using Uniform Resource Identifiers (URIs) provide a way to validate the encoded entities and topics. Furthermore, the research team plans to encompass even more services, such as the ability to automate the procedure and not require manual intervention in most steps, as well as the ability to link the newly created data with other data sources and finally take advantage of other data sources to reinforce the quality of the semantic tags.

The third paper [3]“Sounding it Out: The Mariposa Folk Festival and a Linked Open Data Digital Library Prototype” focuses on the problem that non textual materials such as music or video, in the context of libraries, are analyzed and process in a similar manner to text-based items, missing out on the extra information of this different media. This problem can be overcome by using Linked Open Data (LOD), in the sense that instead of having static documents, an entire interconnected system can be created, encompassing the relations of each item, and providing structured metadata for each link. By taking advantage of the potential offered by the Semantic Web technologies, the proposed system will expose a set of new relationships and contextual information, currently unknown and not utilized in library catalogs. This rich encoding allows the data to interact with other data sources as well, allowing for stronger and more flexible search systems. The problem of using these standardized systems, despite the advantages they offer, is that they force the encoded data to follow a fixed schema, discarding data that does not fit. In the interest of this paper, music, is not easily represented in textual or image format, making the extraction of metadata harder than the classic models used for texts and images. Furthermore, music in general also includes complex interpersonal relationships, since an entire team can be required to create a song, currently not being exploited in library systems. In order to create such a system and implement the first prototypes, it is required to create the conceptual models and the respective datasets. Following the construction of these conceptual models, the team reinforces the model by using standard schemas and ontologies and only then can the team can move on to the prototypes, allowing new insights on the technical space and theoretical ideas of the information.

From all of the papers, it is clear that using linked data, we can greatly enrich the quality of the information provide for a certain dataset, especially in the relationships the objects have between themselves and other data. This ability to infuse data with relationships allows further research to happen with greater ease, since all connections are already available and easier to spot out, and also provides a system for adding more data and relations as they occur. Another interesting factor is the ability to intelligently and efficiently search the data, using the linked data metadata tags created to better filter the results. Also, by using linked data, the insight gained by these approaches can be added to other datasets or likewise incorporating other datasets into the dataset of interest, allowing new and wider connections to be made, previously not possible or not known.

The other aspects of the papers is that the current models of parsing, extracting and analyzing data in many cases is either constrained a few types of media, namely text based documents and images but also, even for text-based documents, many times fail to correctly correlate the data. This one-dimensional approach greatly hinders the possible data available and misses out on important connections. This can be clearly seen in the case of the paper dealing with music, when exploring the connections involved in creating, playing and recording a song included many different people and many connections. Likewise, in the case of the finding aids and the analysis of women in jazz, we see that a different approach to data extraction, using multiple sources and combining the data found, creates a new dataset with much richer metadata and many new relationships between the data.

It is important to note that all these cases prove that when using correct systems of analysis, extraction and model creation, for a given problem, it is possible to create a much richer dataset, filled with metadata and relations to the data itself and to other connected datasets.
Closing, I would like to give some focus to a one final though, and that is that but the underlying problem in all these cases is that they require manual calibration for each problem, and not a global system which can automate. It remains to be seen if such an automated system can ever be created. In the paper[3], it is shown that the use of standardized systems for linked data give us the benefits of compatibility with other datasets using the same system, but may constrict the freedom to create express the data in the desired way, hindering the benefit of linked data itself. Nevertheless, the issue of linked data and data extraction are currently under heavy research, and given enough time the systems should get even more sophisticated and allow for continuously better representation of the data.


[1] Accidental Discovery, Intentional Inquiry: Leveraging Linked Data to Uncover the Women of Jazz,
M. Cristina Pattuelli, Pratt Institute, United States of America; Matthew Miller, New York Public Library, United States of America; Karen Hwang, Pratt Institute, United States of America.
[2] Creating Linked Data within Archival Description: Tools for Extracting, Validating, and Encoding Access Points for Finding Aids,
Karen F. Gracy, School of Library and Information Science, Kent State University, United States of America; Marcia Lei Zeng, School of Library and Information Science, Kent State University, United States of America.
[3] Sounding it Out: The Mariposa Folk Festival and a Linked Open Data Digital Library Prototype,
Stacy Allison-Cassin, York University, Canada; MJ Suhonos, Ryerson University, Canada; Nick Ruest, York University, Canada; Anna St. Onge, York University, Canada.