Tags

, , ,

To enhance interoperability and permit efficient search and exchange of textual information, a representation standard that everyone would follow may be created. At the base of those two concepts of search and exchange is in fact one such standard, namely XML. Short for Extensible Markup Language, XML is used to encode documents in way that can be easily communicated between different people or software applications. Encoding here simply means augmenting the text with “markup” to understand what each portion is. For example, the following part of an XML file indicates that we have a book with a title “Great Book”:

 
<book>
<title> Great Book </title>
</book>

In this post, two XML-based standards used for text documents like books and articles are presented. The first is DCMI which is a standard for metadata that simplifies searching in archives and databases. The second is TEI which is a standard for encoding that simplifies the exchange of documents.

So for searching, metadata is quite useful since it allows each object, in this case documents, to describe itself. This means that there is no need to “open” each document to search for the information; rather a quick look at the metadata will do. However, each publisher will now use what they think best describes their document and use different terms in doing so. For instance, two publishers might use “writer” or “author” to refer to the content creator which makes the job of a search engine more difficult.
To overcome these problems, Dublin Core Metadata Initiative (DCMI) introduces a “core” metadata to be used such as title, language, and publisher; the full set of terms can be found here. Dublin Core can actually be used for any resource such as photographs and not necessarily text documents. Moreover, it can also be used alongside other metadata from more specific contexts.
One way to add the Dublin Core metadata is by using XML; though other options are also available like RDF. The following is an example of DCMI expressed in XML:

<dc:title>Great Book</dc:title>
<dc:language>en</dc:language>

After having used the metadata to find the required resource, we may want the actual resource; so the format is now the focus. The Text Encoding Initiative (TEI) is a consortium that has developed a set of guidelines, the TEI Guidelines, for encoding text. Basically TEI defines markup for the features usually found in text like sentences, paragraphs, line breaks, and punctuation. And interestingly, this markup at different granularities enables us to perform text analysis. Furthermore, using TEI allows us to export the text to any format such as PDF. The following shows an example from a TEI document with a paragraph.

<p>
This is a simple <hi>example<hi>.
</p>

In addition, it shows a highlighted word as indicated by <hi> . The full set of elements can be found here.

To sum up, Dublin Core is a metadata standard for any resource and can be expressed in XML; while TEI is a text encoding standard based on XML. Using both of them facilitates the search and transfer of information.

Advertisements