XML: The eXtensible Markup Language

XML is a markup language. Markup is something added to a text to specify semantics and presentation.


To understand what XML is and does, you need to understand what is meant by syntax, semantics and presentation in the wider context.

Syntax
Rules for combining symbols; in linguistics, rules for combining words, as opposed to rules for building words (morphology).
Semantics
The purpose or intended use of an element or piece of text, such as being a section, a quotation, a date, a warning etc.
Presentation
How an element looks on a screen or a piece of paper: size and weight of its characters, indenting, background colour and so on. By the way, in XML this is usually achieved through referencing a separate file called a stylesheet.

In XML, text is marked up by placing runs of text between tags. The preceding tag is called an opening tag, and the succeeding tag is called a closing tag. A third kind of XML tag is the standalone tag, of which we'll see just one example.

XML is not a programming language. It is dual purpose. It is designed to be used for writing and for storing and retrieving information.


Look at the following example.

<p>a <dfn>cheetah</dfn> is a large feline of Africa and SW Asia, the swiftest mammal, having very long legs, non-retractile claws, and a black-spotted, light-brown coat.</p>

I have used HTML tag names in this example. (HTML is the most common XML vocabulary.) The whole text is a paragraph, which is placed between one opening and one closing tag whose name is p (for paragraph, although other vocabularies might settle for longer names, like para, and the more typing you are required to do).

Please, note the punctuation marks: if you misplace a single <, > or / the documents will be ill-formed and some tools might stop working.

Also, the word cheetah appears between dfn tags to denote that it is a definition (semantics). That piece of information could turn out useful for searching the document (it must contain the definition of cheetah!) as well as for presenting the text suitably by, say, italizing all definition words or typesetting all paragraphs in a specific manner (presentation).

Some other, non-XML formats make do without tags, such as the JavaScript Object Notation.

There are many other kinds of marks in XML besides tags. Look at the following example.

<!DOCTYPE html>
<html>
  <head>
    <title>the c++ programming language</title>
    <meta charset="UTF-8"/>
    <link rel="stylesheet" type="text/css" href="stylesheet.en.css"/>
  </head>
  <body>
    <h1>The C++ Programming Language</h1>
    <p>C++ is a general-purpose programming language developed as an extension of C.</p>
    [...]

...

This document seems to contain , ...

Please note the syntax in the line <meta charset="UTF-8"/>: it is a single standalone tag, ending in />!

Actually, XML proper must start with an XML declaration like

<?xml version="1.0" encoding="UTF-8"?>

Some of this stuff might be dismissed as paperwork, a kind of bureaucracy, but most of it plays some meaningful rôle, though.

The good news is that I don't need to remember all this whenever I write a new document: I just copy it off another document. Besides, if you write a web page in HTML, the most common XML vocabulary, the browser will still show your page properly if you make some minor mistakes.

Consider this. If you want to write a story and upload it to the Web, you just need two tags: h1 for the the title, and p for a paragraph. May be you want to italize some word later on (i tag). Most writing does not require many more tags.

The Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

XML comprises The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards.

On Being a Metalanguage: eXtensible

XML per se does not define a set of tags with their own semantics and presentation expectations. That is why my presentation of XML may have felt a bit too abstract.

As already said, XML...

Some XML languages have already been defined, besides HTML.

Copy and Knowledge Representation

XML is good for two purposes

Writing
From literature to technical articles, arts reviews, reports etc.
Encoding Information
As long as it is human-readable, you can use a format of your own designing for representing data so that another programme can process it easily.

Most DBMS's (database management systems) can handle XML as fields and some (cells in their rows) and their are at least two open source XML-oriented DBML's: eXist-db and BaseX. They use the XQuery language (instead of good old SQL to, well, query XML documents.

Searching and Transforming

As long as a document follows the XML syntax it can be transformed into another document in a different XML vocabulary or even into a non-XML document, such as a JSON document, TeX/LaTeX for high-quality typesetting, Mark Down etc.

XML is the only text language for which such comprehensive transformations exist (eXtensible Stylesheet Transformations Language, or XSTL).

XML Namespaces

XML Namespaces provide a method to avoid element name conflicts. In XML, element names are defined by the developer. This often results in a conflict when trying to mix XML documents from different XML applications.


Name conflicts in XML can easily be avoided using a name prefix.

This XML carries information about an HTML table, and a piece of furniture:

<h:table>
  <h:tr>
    <h:td>Apples</h:td>
    <h:td>Bananas</h:td>
  </h:tr>
</h:table>

<f:table>
  <f:name>African Coffee Table</f:name>
  <f:width>80</f:width>
  <f:length>120</f:length>
</f:table>

The xmlns Attribute

When using prefixes in XML, a namespace for the prefix must be defined.

The namespace can be defined by an xmlns attribute in the start tag of an element.

The namespace declaration has the following syntax: xmlns:prefix="URI".

<root>

<h:table xmlns:h="http://www.w3.org/TR/html4/">
  <h:tr>
    <h:td>Apples</h:td>
    <h:td>Bananas</h:td>
  </h:tr>
</h:table>

<f:table xmlns:f="https://www.w3schools.com/furniture">
  <f:name>African Coffee Table</f:name>
  <f:width>80</f:width>
  <f:length>120</f:length>
</f:table>

</root>

In the example above:

  • The xmlns attribute in the first <table> element gives the h: prefix a qualified namespace.
  • The xmlns attribute in the second <table> element gives the f: prefix a qualified namespace.

When a namespace is defined for an element, all child elements with the same prefix are associated with the same namespace.


Namespaces can also be declared in the XML root element:

<root xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="https://www.w3schools.com/furniture">

<h:table>
  <h:tr>
    <h:td>Apples</h:td>
    <h:td>Bananas</h:td>
  </h:tr>
</h:table>

<f:table>
  <f:name>African Coffee Table</f:name>
  <f:width>80</f:width>
  <f:length>120</f:length>
</f:table>

</root>

Uniform Resource Identifier (URI)

A Uniform Resource Identifier (URI) is a string of characters which identifies an Internet Resource.

The most common URI is the Uniform Resource Locator (URL) which identifies an Internet domain address. Another, not so common type of URI is the Uniform Resource Name (URN).

Default Namespaces

Defining a default namespace for an element saves us from using prefixes in all the child elements. It has the following syntax:

xmlns="namespaceURI"

This XML carries HTML table information:

<table xmlns="http://www.w3.org/TR/html4/">
  <tr>
    <td>Apples</td>
    <td>Bananas</td>
  </tr>
</table>

and this XML carries information about a piece of furniture:

<table xmlns="https://www.w3schools.com/furniture">
  <name>African Coffee Table</name>
  <width>80</width>
  <length>120</length>
</table>

Namespaces in Real Use

XSLT (eXtensible Stylesheet Language Transformations) is a[n XML-syntax] language that can be used to transform XML documents into other formats.

The XML document below, is a document used to transform XML into HTML.

The namespace "http://www.w3.org/1999/XSL/Transform" identifies XSLT elements inside an HTML document:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
<html>
<body>
  <h2>My CD Collection</h2>
  <table border="1">
    <tr>
      <th style="text-align:left">Title</th>
      <th style="text-align:left">Artist</th>
    </tr>
    <xsl:for-each select="catalog/cd">
    <tr>
      <td><xsl:value-of select="title"/></td>
      <td><xsl:value-of select="artist"/></td>
    </tr>
    </xsl:for-each>
  </table>
</body>
</html>
</xsl:template>

</xsl:stylesheet>

The Components of an XML Document

XML Prolog (Optional but recommended)

Defines the version and encoding (e.g., <?xml version="1.0" encoding="UTF-8"?>).

DTD (Optional)

Defines the document structure, elements, and attributes.

Root Element

The single, topmost tag enclosing all other content.

Child Elements

Nested elements within the root.

Attributes

Key-value pairs providing additional info about an element (e.g., <book category="fiction">).

Comments
<!-- comment -->

The Key Structural Rules are:

The Doctype

First, we have the Document Type Declaration, or doctype. This is simply a way to tell the browser — or any other parser — what type of document it's looking at. In the case of HTML files, it means the specific version and flavor of HTML. The doctype should always be the first item at the top of any HTML file. Many years ago, the doctype declaration was an ugly and hard-to-remember mess. For XHTML 1.0 Strict:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

And for HTML4 Transitional:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

Although that long string of text at the top of our documents hasn't really hurt us (other than forcing our sites' viewers to download a few extra bytes), HTML5 has done away with that indecipherable eyesore. Now all you need is this:

<!doctype html>

Simple, and to the point. The doctype can be written in uppercase, lowercase, or mixed case. You'll notice that the "5" is conspicuously missing from the declaration. Although the current iteration of web markup is known as "HTML5," it really is just an evolution of previous HTML standards — and future specifications will simply be a development of what we have today.

Because browsers are usually required to support all existing content on the Web, there's no reliance on the doctype to tell them which features should be supported in a given document. In other words, the doctype alone is not going to make your pages HTML5-compliant. It's really up to the browser to do this. In fact, you can use one of those two older doctypes with new HTML5 elements on the page and the page will render the same as it would if you used the new doctype.

Entities

Entities are placeholders in XML. You declare an entity in the document prolog or in a DTD, and you can refer to it many times in the document. Different types of entities have different uses. You can substitute characters that are difficult or impossible to type with character entities. You can pull in content that lives outside of your document with external entities. And rather than type the same thing over and over again, such as boilerplate text, you can instead define your own general entities.

An entity consists of a name and a value. When an XML parser begins to process a document, it first reads a series of declarations, some of which define entities by associating a name with a value. The value is anything from a single character to a file of XML markup. As the parser scans the XML document, it encounters entity references, which are special markers derived from entity names. An entity reference consists of an ampersand (&), the entity name, and a semicolon (;). For each entity reference, the parser consults a table in memory for something with which to replace the marker. It replaces the entity reference with the appropriate replacement text or markup, then resumes parsing just before that point, so the new text is parsed too. Any entity references inside the replacement text are also replaced; this process repeats as many times as necessary.

Entities may be:

single character: either predefined or numeric

The predefined character entities are: lt gt apos quot amp for less than (<), greater than (>), apostrophe (', quote ("), and Amper's and (&). These five characters are reserved in XML syntax.

And here are some common numeric entities:

non-breaking space      &nbsp;  &#160;
¢       cent    &cent;  &#162;
£       pound   &pound;         &#163;
¥       yen     &yen;   &#165;
€       euro    &euro;  &#8364;
©       copyright       &copy;  &#169;
®       registered trademark    &reg;   &#174;
Internal entities

Internal mixed-content entities are most often used to stand in for oft-repeated phrases, names, and boilerplate text. Not only is an entity reference easier to type than a long piece of text, but it also improves accuracy and maintainability, since you only have to change an entity once for the effect to appear everywhere.

This is an example of an internal entity:

<!ENTITY bobco "Bob's Bolt Bazaar, Inc.">
External entities
An external entity is an entity whose replacement text exists in another file.
Unparsed Entities
An unparsed entity holds content that should not be parsed because it contains something other than text or XML and would probably confuse the parser.

Here are two instances of entity declaration:

<!DOCTYPE book [
  <!ENTITY nwalsh "Norman Walsh">
  <!ENTITY chap1 SYSTEM "chap1.xml">
]>

And this is how to use them. To include file chap1.xml, just type: &chap1; where you want it in your container file.

External Entities

Sometimes you may need to create an entity for such a large amount of mixed content that it is impractical to fit it all inside the entity declaration. In this case, you should use an external entity, an entity whose replacement text exists in another file. External entities are useful for importing content that is shared by many documents, or that changes too frequently to be stored inside the document. They also make it possible to split a large, monolithic document into smaller pieces that can be edited in tandem and that take up less space in network transfers.

External entities effectively break a document into multiple physical parts. However, all that matters to the XML processor is that the parts assemble into a perfect whole. That is, all the parts in their different locations must still conform to the well-formedness rules. The XML parser stitches up all the pieces into one logical document; with the correct markup, the physical divisions should be irrelevant to the meaning of the document.

External entities are a linking mechanism. They connect parts of a document that may exist on other systems, far across the Internet. The difference from traditional XML links (XLinks) is that for external entities the XML processor must insert the replacement text at the time of parsing.

External entities must always be declared so the parser knows where to find the replacement text. In the following example, a document declares the three external entities &part1;, &part2;, and &part3; to hold its content:

<?xml version="1.0"?>
<!DOCTYPE doc SYSTEM "http://www.dtds-r-us.com/generic.dtd"
[
  <!ENTITY part1 SYSTEM "p1.xml">
  <!ENTITY part2 SYSTEM "p2.xml">
  <!ENTITY part3 SYSTEM "p3.xml">
]>
<longdoc>
  &part1;
  &part2;
  &part3;
</longdoc>

Whenever possible, make each subdocument contain at most one XML tree. While you can't validate a subdocument on its own, you can usually perform a well-formedness check if it has no more than one tree. The parser will think it's looking at a lone document without a prolog.

Unparsed Entities

An unparsed entity holds content that should not be parsed because it contains something other than text or XML and would probably confuse the parser. The only place from which unparsed entities can be referred to is in an attribute value. They are used to import graphics, sound files, and other noncharacter data.

The declaration for an unparsed entity looks similar to that of an external entity, with some additional information at the end. For example:

This declaration differs from an external entity declaration in that there is an NDATA keyword following the system path information. This keyword tells the parser that the entity's content is in a special format, or notation, other than the usual parsed mixed content. The NDATA keyword is followed by a notation identifier that specifies the data format. In this case, the entity is a graphic file encoded in the GIF format, so the word GIF is appropriate.

<!DOCTYPE doc [
<!ENTITY mypic SYSTEM "photos/erik.gif" NDATA GIF>
]>
<doc>
<para>Here's a picture of me:</para>
<graphic src="&mypic;" />
</doc>