We all more or less know about XML (Extensible Language Markup). A short definition would be:

–          « it’s a semi-structured format of documents defined by the W3C organization that is used really widely over the web to transfer information between services. »

Its principal quality is that it’s both readable by human and machine and it’s easy to represent an arbitrary data structure. Here is a small example where we describe a note with a sender, a receiver, a title and a body :

<?xml version=”1.0″ encoding=”UTF-8″ ?>
<to>My Friend Paul</to>
<body>Don’t forget me this weekend!</body>

We can see that everything is very structured and in order to give the possibility to other people and specially machine to proceed it and extract the information they need, we need to make this file compliant with a specific standard. That way everyone following it will be able to read the document. So we have the right to ask the question:

–          What is a standard?
–           And what kind of validation do we use?

Every file format and more generally language over the web follows some rules. These rules are all describes in a description document call “DTDs” for Document Type Definition. With this document, you can create and define your own language and use it to transmit information to anyone. Every language on the web has an associated DTD: HTML, CSS, ASPX, PHP and so many more…

How do you make it work? It’s simple: You just need to give the DTD with the document you’re sending.  In order to be a bit more explicit, let’s follow an example with the XML format description. First of all, we need to know what format of XML we are using. This is specified at the top of the file:

<?xml version=”1.0″ encoding=”UTF-8″ ?>
Here we can see that here, we use the version « 1.0 » with the encoding type « UTF-8 » which is the standard international encoding.  With this, we can go to the W3C organization (that standardize and give the final definition of the languages) and ask for the DTD of XML with this version. The document in our possession, we are now able to read the XML file.

Let’s go now deeper in the DTD. We can see mainly 2 things:

  • Definition – that give the attributes and names that are allowed
  • Content Model – that explain how they appear in the document in relation to each other.

Let’s see how they are used. Here is definition for the basic element in XML:

<!ELEMENT elementname (contentmodel) >

Note: Contentmodel is a regular expression that determine which other elements are allowed to apprear within (or below) the element. It also give the order and the multiplicity.

We also have atomic contents containing text that are specific type of elements, i.e., specials cases as for instance an empty element or an element with anything inside:

Element content:
– <!ELEMENT example (a) >

Empty Element:
-<!ELEMENT exemple EMPTY>

If text content occurs together with user-defined elements in the content model, this is called mixed content. Starting form those atomic elements and definitions, one can construct complex and composite content models that will end up in a full language.

In term of validation, a document following all the rules given in the DTD is called « valid ». To make a validation, you simply need to check if the document you have is consistent with the DTD associated with it. The W3C organization is providing a lot of validation tool that will help you checking the validity of your documents.

You can find the XML validator here : http://validator.w3.org/

This is how we are creating and defining new languages.