History of XML

'In 1969 . . . researchers at IBM developed a generalised markup language that dealt with the problem of multiple, incompatible markup languages. Essentially, a generalized markup language would consist of tools needed to define a document type and the requisite tags. Anyone who wanted to mark up a document could look to the document type definition (DTD) (a formal document specifying what tags are allowed) to learn how to perform the markup correctly.'[1]

From this beginning ANSI created a committee in 1978 to develop a standard GML. SGML provides a way of defining the structure of a document through a DTD (Document Type Definition). 'A DTD specifies the structure of a document type as well as the tags that can be used to mark this structure.' [2]. The assumption is made that every document has a hierarchical structure. Each node or branch can have none, one or several sub-branches coming from it. This makes this tree structure very flexible. The tags used to define the structure also provide some meaning semantically. Just as user would be unlikely to label subdirectory folders in a directory tree as folder1, folder2, etc, but would rather give them meaningful names, so SGML uses tags that provide information about the data that are contained within those same tags. The tags mean something. It is possible for these tags to be used by search engines to provide more relevant information to users accessing the document via the Internet. However, SGML is not widely used or known because it is not an easy language to master. Researchers and developers are amongst the few to have made any use of SGML. One such researcher was Tim Berners-Lee who, in 1990, developed the first version of HTML.

Current use of HTML

HTML is currently the de facto 'language' of the WWW. It is easy to learn and provides the computing-unskilled with the ability to produce visually impressive websites. To exchange information across a WAN, such as the Internet, a document must be text-based. HTML, like SGML, is text-based. This is one of the major keys to its success. Different users use a variety of computers, ranging from MACs to PCs. Each of these have their own proprietory word processors, etc using incompatible formats. However, all of these computers can read plain text and (best of all) because HTML files are written in plain text you don't even need any special programs to create them. You can do it all in notepad! [In fact there are many programs currently available (e.g. FrontPage, Claris HomePage, DreamWeaver, etc) to help you write HTML - but these just take the druggery out of the work to a greater or lesser extent.]

As mentioned above, HTML was built for information exchange. Since the users wanting to exchange information were not computing experts the 'language' used had to be simple to use and forgiving if the syntax was not followed exactly. The tags in HTML are very reminiscent of the word processors of the 1980s.

<b> means turn on bold and </b> means turn off bold.

The tags, therefore, had some meaning. What they did not have was semantic meaning. The tags in HTML are primarily concerned with presentation. While this made presenting information easier (and more tags were introduced separately by Microsoft and Netscape to achieve greater functionality at the presentation level) it also made it difficult for search engines to find out anything meaningful about a document.[3] For instance, <a H1> tag in HTML is the same whether it is referring to a heading in a scientific paper or a pornographic site. Search engines instead rely on keywords typed separately from the main document in a meta tag. This means that the document itself cannot be queried, just the list of words supplied in a meta tag. In HTML the contents of a page are inseparable from the presentation aspects of that page. In order to achieve greater searchability content and presentation should be separated. This was partially achieved through the use of CSS (Cascading Style Sheets), which took out most of the presentation aspects of HTML. However, it did not add any meaning to the content. In order to achieve the greatest flexibility of the page the content and presentation needed to be completely separate. This was one of the original aims of GML.

Summary: problems of HTML

Simplicity HTML's simplicity was its strength at its inception. However, users now want to access the structure of their documents and HTML is too simple for this requirement.

New tags You can't invent your own tags. You must use predefined tags supplied with your browser - which may be different to the browser which is being used by a person visiting your page.

Searching Since HTML has no semantic labels, it cannot provide much information to search engines.

The solution to these problems was to 'bring to the Internet a simplified version of SGML, one that would enable people to create their own markup languages.'[4] This solution is called XML (eXtensible Markup Language).

Current use of XML

XML started in 1996 when the W3C formed a working committee to discuss the problems of SGML and opportunities in the Internet. XML 1.0 was adopted for recommendation in 1998. (W3C Recommendation). XML, like SGML, is essentially a tool for creating markup languages. However, unlike SGML, it is easy to use. As Liberty and Kraley say 'The idea [of XML] is to maintain the 50% of SGML that is used 90% of the time.'[5] It must be remembered that XML tags are not concerned with how to render the data. XML aims to semantically mark-up text or data so that the tags give information about the contents.

<b> in HTML tells you that the text is bold.
but <employee> gives you more information about the text.

Of course, in order to semantically tag a document it must be possible to create new tags. XML allows this. XML has to be processed and checked for well-formedness (i.e. conforms to the 'syntax' rules of XML) before it can be displayed. However, it does not have to be valid. That is, it does not have to conform to a DTD (Document Type Definition). In fact you do not even need to create a DTD in order to display XML.

In many ways XML markup of the structure of a document is rather like creating a database. But XML has far more flexibility than a database since it is based on a tree structure. This can be seen by looking at an XML document in IE5. Example (Note: At present only Microsoft have a browser with the capabilities to read XML.) You should note that the structure can be reduced down to the 'root' by clicking the + and - symbols.

Since XML is not concerned with presentation, another mechanism for displaying the document is needed. XML can be displayed by using a CSS (Cascading Style Sheet) or XSL (XML Style Language). Now that the content is separate from the presentation it becomes possible to display the XML document in many different ways without changing the XML file. All the examples linked from the Implementation section are based on the same XML document.

This flexibility of display is not the only benefit that XML offers. Its greatest benefit is that it is capable of being dynamically generated. For instance, a user could query a database, get the results back in XML, which in turn is displayed in the browser using XSL. An example of this type of application is shown on Microsoft's site: Here

Another advantage of XML over HTML is that the information in the XML document does not have to be displayed in the same order as it is written. In HTML if a part of a document is not required for any reason, it must be commented out or removed entirely. In XML this can be achieved by the XSL simply not referring to the respective node. Data from the XML document can be selected (by using some criteria or conditions) either through XSL itself or by using a scripting language. XSL example

Future use of XML

XML will provide the means for more specialised mark-up languages. Many have already been submitted to W3C for consideration. These range from VML for producing graphics, CML (Chemical Markup Language) for producing chemical symbols and formulas, to mathML for producing mathematical formulas to MNML (Music Notation Markup Language) for producing manuscript music. The different applications to which XML could be used is enormous.

XML will soon be making a difference to the book trade. Currently many bookshops use EDI (Electronic Data Interchange) to achieve electronic trading with their suppliers and customers. Many EDI providers are now considering XML as the format for their information exchange. 'The main advantage of XML over traditional EDI formats is that a single XML message can contain sample text and image content, if desired, as well as bibliographical and commericial data.'[6]

Footnotes
[1] Pfaffenberger, page 24-5
[2] Pfaffenberger, page 27
[3] Powell, Thomas A, page 51
[4] Pfaffenberger, page 51
[5] Liberty & Kraley, page 11
[6] Cave, Francis, XML and the book trade - a view from the bridge from Latest Edition (The Newsletter from First Edition), March 2000, Issue no. 8

Summary: problems of HTML
Simplicity	HTML's simplicity was its strength at its inception. However, users now want to access the structure of their documents and HTML is too simple for this requirement.
New tags	You can't invent your own tags. You must use predefined tags supplied with your browser - which may be different to the browser which is being used by a person visiting your page.
Searching	Since HTML has no semantic labels, it cannot provide much information to search engines.