, , , , , , ,

Through the history, lacking electrical ways of keeping information, people were forced to write down everything in order to keep trace of activities of interest. Tons of  journals, account ledgers, bills, receipts, polices from the past survived until today. Somebody may think, a lot of unnecessary information that can’t be used. On the contrary, we can use those documents to obtain important information about specific customs, relationships, events in a certain community or wider.

Astonishing growth in digitizing documents happened in the last few years and it is very important for conducting idea given above. We have tendency to convert all literal legacy: books, maps, documents of all kinds to digital form by scanning. Big projects like “Venice time machine” are starting all over the world, in order to make that huge amount of information available to people, through web and digital libraries.

Thus we will have access to loads of official data like this. Crucial point is how to interpret these documents accurately both in encoding and text analysis sense. We will discuss some of the articles that use these techniques.

In the article “Encoding Historical Financial Records” authors proposed a way of encoding old financial documents using generalized TEI markups. In the second article authors dealt with understanding slavery in Washington by analyzing documents issued after April act when slave owners required compensation for freed slaves. It tried to give an evaluation of today’s equivalence in estimated value of the slaves. Third article describes analysis of old Assyrian merchants’ letters. By analyzing them it tries to infer social rank between merchants.

First two articles use TEI markup for encoding documents of interest while the third one doesn’t deal with encoding, it only concerns text analysis. In document concerning slavery from the second article, TEI markup is used to identify the names of people and places, important dates and values of the slaves, places left blank in document. All analyzed documents have the same form, so TEI developed here is very specific; it searches for well defined information. TEI markup used in financial records encoding in the first article tries to be more general because of the higher variety in the form of these documents. It includes 3 levels of data: layout, textual expression and financial semantics [1]. Thus it will be more robust and able to analyze different kinds of financial records with no necessary same form. Both of the approaches have good reasoning, classifying the encoding problem to two approaches: exact – when we know the form of document and searching for exact information, and general – when we need to go deeper in understanding one in order to be more robust and to try to get desired information that can vary in formal shape.

On the other hand, third article concerns analyzing useful data rather than obtaining it. It tries to derive structure of the trade network by analyzing orders and contracts between merchants. It makes a network of individual merchants and ranks them by their importance in the trade network. However, a lot of mistakes have been made here, because of non-conventional addressing between merchants – one merchant can have more different nicknames or more of them can have the same name, thus model faced a lot of inconsistencies. This was one example that shows how hard it is to derive nicely classified information in a big system. Proposal in the article is to observe rather smaller instances of the system and later to combine them into a global outline that fits best. For example, observe pairs of merchants and rank them mutually in order to combine rankings into a global one that is most logical.

To conclude, there is big variety of different encoding and text analysis techniques; here we underlined some of them that seem to be promising in the future work. As we said, upcoming mass digitization will force these techniques to be developed rapidly in the next years, hoping that soon we will be able to infer important conclusions analyzing formal text.


[1] Encoding Historical Financial Records


[2] Expanding the Interpretive and Analytical Possibilities for Understanding Slavery and Emancipation in Washington, DC


[3] Inferring Social Rank in an Old Assyrian Trade Network