Sunday, October 22, 2006

Creating plain text from CML source

Recently the fourth release of the Blue Obelisk Data Repository (BODR) was released. This data repository provides basic data about elements and isotopes, such as high resolution mass information. The data is taken from literature, and the repository is used by several projects, like OpenBabel, Kalzium, and the CDK.

The data is distributed in CML format to ensure semantic markup, but in some cases it might be easier to work with text files containing bits of information. This blog shows how XSLT can be used to create such plain text files from the XML search.

Consider the below XSLT code, which extracts information from the elements.xml in the BODR distribution:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:cml="http://www.xml-cml.org/schema">

<xsl:output method="text"/>

<xsl:template match="*">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()"/>

<xsl:template match="cml:atom">
<xsl:value-of select="cml:label[@dictRef='bo:symbol']/@value"/>
<xsl:text> </xsl:text>
<xsl:value-of select="cml:scalar[@dictRef='bo:exactMass']"/>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>


XSLT is the transformation specification of XSL, and provides means to convert XML documents into other documents, either XML or plain text. The above XSLT creates plain text, as indicated by the <xsl:output> element.

The three templates define how the XML source is transformed. The first template tells the processor to apply templates for all elements in the XML source (match="*"). The second template matches all text nodes on the XML source, and ensures that all text from the source is ignored, so that we can tune in detail the output.

The third template is the most interesting one; it defines which information from the source XML is extracted and outputed to the plain text file. The <xsl:value-of> elements define which content from the XML source is send to the output. The above XSLT contains two such elements, and it extracts the element symbol and its exact mass. The logic is define by XPath statements given in the @match attribute of the <xsl:template> element and the @select attribute of the <xsl:value-of> element.

This third template applies to all <atom> elements in the CML source, as indicated by the @match attribute. From this element, the <xsl:value-of> elements extract the element symbol using the XPath query cml:label[@dictRef='bo:symbol']/@value. Or, in natural language: take the content of the @value attribute, of the <cml:label> element which has an @dictRef attribute of which the content is 'bo:symbol'. The second query is cml:scalar[@dictRef='bo:exactMass'], and says: take the content of the <cml:scalar> element which has an @dictRef attribute of which the content is 'bo:exactMass'.

Check the elements.xml source code, and it will be clear how to extract other elemental properties. Good luck!