XML: The eXtensible Markup Language

XML is a markup language. Markup is something added to a text to specify semantics and presentation.

To understand what XML is and does, you need to understand what is meant by syntax, semantics and presentation in the wider context.

Syntax: Rules for combining symbols; in linguistics, rules for combining words, as opposed to rules for building words (morphology).
Semantics: The purpose or intended use of an element or piece of text, such as being a section, a quotation, a date, a warning etc.
Presentation: How an element looks on a screen or a piece of paper: size and weight of its characters, indenting, background colour and so on. By the way, in XML this is usually achieved through referencing a separate file called a stylesheet.

In XML, text is marked up by placing runs of text between tags. The preceding tag is called an opening tag, and the succeeding tag is called a closing tag. A third kind of XML tag is the standalone tag, of which we'll see just one example.

XML is not a programming language. It is dual purpose. It is designed to be used for writing and for storing and retrieving information.

Look at the following example.

<p>a <dfn>cheetah</dfn> is a large feline of Africa and SW Asia, the swiftest mammal, having very long legs, non-retractile claws, and a black-spotted, light-brown coat.</p>

I have used HTML tag names in this example. (HTML is the most common XML vocabulary.) The whole text is a paragraph, which is placed between one opening and one closing tag whose name is p (for paragraph, although other vocabularies might settle for longer names, like para, and the more typing you are required to do).

Please, note the punctuation marks: if you misplace a single <, > or / the documents will be ill-formed and some tools might stop working.

Also, the word cheetah appears between dfn tags to denote that it is a definition (semantics). That piece of information could turn out useful for searching the document (it must contain the definition of cheetah!) as well as for presenting the text suitably by, say, italizing all definition words or typesetting all paragraphs in a specific manner (presentation).

Some other, non-XML formats make do without tags, such as the JavaScript Object Notation.

There are many other kinds of marks in XML besides tags. Look at the following example.

<!DOCTYPE html>
<html>
  <head>
    <title>the c++ programming language</title>
    <meta charset="UTF-8"/>
    <link rel="stylesheet" type="text/css" href="stylesheet.en.css"/>
  </head>
  <body>
    <h1>The C++ Programming Language</h1>
    <p>C++ is a general-purpose programming language developed as an extension of C.</p>
    [...]

...

This document seems to contain , ...

a DOCTYPE (document type) declaration,
a title (to be shown on the frame of the window where the document is displayed),
a charset specification (essentially, to establish how non-English letters like ñ are to be printed),
a pointer to a stylesheet, that is another document that specifies the look or presentation of your elements, such as red letters for definition terms (you can point to the same stylesheet in tens of pages that you write and they will present a uniform, consistent appearance that you have defined just once), and
a header (between h1 tags).

Please note the syntax in the line <meta charset="UTF-8"/>: it is a single standalone tag, ending in />!

Actually, XML proper must start with an XML declaration like

<?xml version="1.0" encoding="UTF-8"?>

Some of this stuff might be dismissed as paperwork, a kind of bureaucracy, but most of it plays some meaningful rôle, though.

The good news is that I don't need to remember all this whenever I write a new document: I just copy it off another document. Besides, if you write a web page in HTML, the most common XML vocabulary, the browser will still show your page properly if you make some minor mistakes.

Consider this. If you want to write a story and upload it to the Web, you just need two tags: h1 for the the title, and p for a paragraph. May be you want to italize some word later on (i tag). Most writing does not require many more tags.

The Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

XML comprises The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards.

On Being a Metalanguage: eXtensible

XML per se does not define a set of tags with their own semantics and presentation expectations. That is why my presentation of XML may have felt a bit too abstract.

As already said, XML...

Some XML languages have already been defined, besides HTML.

Mathml: for typesetting mathematical formulas
SVG: or Scalable Vector Graphics
Docbook: a richer vocabulary than HTML for writing books, esp. technical books
MusicML: used by music software like MuseScore, Hydrogen etc.
TEI: the Text Encoding Initiative, for marking up texts in the Humanities such as plays and poetry, or for annotating ancient texts.

Copy and Knowledge Representation

XML is good for two purposes

Writing: From literature to technical articles, arts reviews, reports etc.
Encoding Information: As long as it is human-readable, you can use a format of your own designing for representing data so that another programme can process it easily.

Most DBMS's (database management systems) can handle XML as fields and some (cells in their rows) and their are at least two open source XML-oriented DBML's: eXist-db and BaseX. They use the XQuery language (instead of good old SQL to, well, query XML documents.

Searching and Transforming

As long as a document follows the XML syntax it can be transformed into another document in a different XML vocabulary or even into a non-XML document, such as a JSON document, TeX/LaTeX for high-quality typesetting, Mark Down etc.

XML is the only text language for which such comprehensive transformations exist (eXtensible Stylesheet Transformations Language, or XSTL).

XML Namespaces

XML Namespaces provide a method to avoid element name conflicts. In XML, element names are defined by the developer. This often results in a conflict when trying to mix XML documents from different XML applications.

Name conflicts in XML can easily be avoided using a name prefix.

This XML carries information about an HTML table, and a piece of furniture:

<h:table>
  <h:tr>
    <h:td>Apples</h:td>
    <h:td>Bananas</h:td>
  </h:tr>
</h:table>

<f:table>
  <f:name>African Coffee Table</f:name>
  <f:width>80</f:width>
  <f:length>120</f:length>
</f:table>

The `xmlns` Attribute

When using prefixes in XML, a namespace for the prefix must be defined.

The namespace can be defined by an xmlns attribute in the start tag of an element.

The namespace declaration has the following syntax: xmlns:prefix="URI".

<root>

<h:table xmlns:h="http://www.w3.org/TR/html4/">
  <h:tr>
    <h:td>Apples</h:td>
    <h:td>Bananas</h:td>
  </h:tr>
</h:table>

<f:table xmlns:f="https://www.w3schools.com/furniture">
  <f:name>African Coffee Table</f:name>
  <f:width>80</f:width>
  <f:length>120</f:length>
</f:table>

</root>

In the example above:

The xmlns attribute in the first <table> element gives the h: prefix a qualified namespace.
The xmlns attribute in the second <table> element gives the f: prefix a qualified namespace.

When a namespace is defined for an element, all child elements with the same prefix are associated with the same namespace.

Namespaces can also be declared in the XML root element:

<root xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="https://www.w3schools.com/furniture">

<h:table>
  <h:tr>
    <h:td>Apples</h:td>
    <h:td>Bananas</h:td>
  </h:tr>
</h:table>

<f:table>
  <f:name>African Coffee Table</f:name>
  <f:width>80</f:width>
  <f:length>120</f:length>
</f:table>

</root>

Uniform Resource Identifier (URI)

A Uniform Resource Identifier (URI) is a string of characters which identifies an Internet Resource.

The most common URI is the Uniform Resource Locator (URL) which identifies an Internet domain address. Another, not so common type of URI is the Uniform Resource Name (URN).

Default Namespaces

Defining a default namespace for an element saves us from using prefixes in all the child elements. It has the following syntax:

xmlns="namespaceURI"

This XML carries HTML table information:

<table xmlns="http://www.w3.org/TR/html4/">
  <tr>
    <td>Apples</td>
    <td>Bananas</td>
  </tr>
</table>

and this XML carries information about a piece of furniture:

<table xmlns="https://www.w3schools.com/furniture">
  <name>African Coffee Table</name>
  <width>80</width>
  <length>120</length>
</table>

Namespaces in Real Use

XSLT (eXtensible Stylesheet Language Transformations) is a[n XML-syntax] language that can be used to transform XML documents into other formats.

The XML document below, is a document used to transform XML into HTML.

The namespace "http://www.w3.org/1999/XSL/Transform" identifies XSLT elements inside an HTML document:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
<html>
<body>
  <h2>My CD Collection</h2>
  <table border="1">
    <tr>
      <th style="text-align:left">Title</th>
      <th style="text-align:left">Artist</th>
    </tr>
    <xsl:for-each select="catalog/cd">
    <tr>
      <td><xsl:value-of select="title"/></td>
      <td><xsl:value-of select="artist"/></td>
    </tr>
    </xsl:for-each>
  </table>
</body>
</html>
</xsl:template>

</xsl:stylesheet>