Tags

, , , ,

Last several decades most systems that were designed to effectively store, retrieve and manage dataset have been following relational data model – tabular model. Relational model was first proposed by E.F Codd in his famous paper published in 1970 and for long period it was dominant strategy in implementation of database management systems. However, by starting “Big data” era people from all area of science including also Digital Humanities encountered limitations due to large data sets – it is difficult to work with big data using traditional systems and models including relational model. Nowadays there are many systems that implement different data structures (key-value, document, graphs) in order to make more flexible and scalable tools.

One of such database systems is SylvaDB implemented by the researchers of CulturePlex Lab at Western University Canada. The essential objective to create SylvaDB was to overcome problem of data storage, management and analysis that occurs in Digital Humanities context. Data infrastructure is important factor in order to able to run advanced and complex analysis algorithms in Digital Humanities. However, to create and manage this infrastructure is not trivial task for non – technical, more specifically non – programmer people. By providing user-friendly interface and easy to use functionality SylvaDB handle this limitation.

SylvaDB is using Node4j as underlying technology. Set of functionalities that provides SylvaDB is what humanity researchers look for to overcome challenges caused by uncertain, messy and noisy data. SylvaBD makes available set of tools to create flexible data model that might be modified or changed very easily – which is very hard in classical relational database systems. Some times our data is highly interconnected and it makes computational complexity of calculations very high. SylvaDB uses efficient graph algorithms to support native graph queries that could traverse millions of vertexes and edges in milliseconds.

Another very interesting project called GeoModelText was created in order to find out solution for how to represent texts in computer systems. GeoModelText is database system in context of hypertext systems.

The most used text models are Graph based, Linear and Hierarchical. There are plenty of systems that designed for primarily work with one of such models. GeoModelText was created to generate all three models at them same time. GeoModelText was fully implemented in Java. Everything is stored using hierarchical markup system – XML. However, sometimes hierarchical nature of XML makes difficult problems. GeoModelText was specially designed in such way that it could avoid underlying XML storage anytime – it keeps XML as internal information structure and depending on needs inserts new non – hierarchical modules.

We discussed in previous paragraphs how different internal storage schemas affect overall system design. Digital Humanities not only work with single database instance, mostly, researchers need to interconnect different logical and geographical data sources. This connected databases give possibilities for higher dimension data querying and analyzing. The main functionality of project Heurist  is to make this interconnection easier. Heurist creators decided to concentrate on entities itself rather than internal structures of database system. They realized that by creating general set of descriptions and fields we might end up covering all entities – this is called database profile. Heurist uses database profile as agreement between interconnected databases. Of course there are some specific properties of entities that will not be included in database profile. But by avoiding to use some specific fields we gain more productivity and almost optimality. Heurist already has two successful integration in big systems.

In general all systems we have seen have some kind tradeoff between functionalities they provide. Graph systems has great capability to work with big data and map like structures. However it is only applicable for narrow set of tasks. In contrast XML provides great flexibility for all kind of problems but disadvantage is that hierarchical models for representation is limited in comparison to the relational model. We don’t have general model or implementation that supports all Digital Humanities demands – optimal way in creating storage mechanism is to identify project requirements and make decision according their priorities.

References:

  1. http://www.informatik.uni-trier.de/~ley/db/journals/cacm/Codd70.html
  2. http://dharchive.org/paper/DH2014/Workshops-904.xml
  3. http://dharchive.org/paper/DH2014/Paper-639.xml
  4. http://dharchive.org/paper/DH2014/Poster-781.xml
Advertisements