Wednesday, November 01, 2006

CML id's picky on allowed characters

Yesterday, I was hacking on the atom type lists in the CDK (MM2, MMFF94, PDB and a few customs ones) which the project has stored in CML. The files had not been updated to CML 2.3, which is the most recent specification we use in the CDK (see bug #1562342).

When doing this I found a problem which I have seen more often: the problem was that the MM2 and MMFF94 atom type lists contained '=' and other forbidden characters in the <atomType> @id's. The content of this @id attribute is defined by the idType in the XML Schema for CML 2.3. It extends the XML Schema datatype xsd:string, but is also require to match this regular expression: [A-Za-z][A-Za-z0-9\.\-_]*. That is, it has to start with a upper or lower case letter, and then consists of alphanumerics, dots (.), a hyphen (-), or an underscore (_).

Now, the atom type lists used to identifiers from the MMFF94 and MM2 force field atom type lists, giving code like:

<atomType id="N=C">
<!-- in imines -->
<atom elementType="N" formalCharge="0">
<scalar dataType="xsd:double" dictRef="cdk:maxBondOrder">2.0
<scalar dataType="xsd:double" dictRef="cdk:bondOrderSum">3.0
<scalar dataType="xsd:integer" dictRef="cdk:formalNeighbourCount">2
<scalar dataType="xsd:integer" dictRef="cdk:valency">5
</atom>
<scalar dataType="xsd:string" dictRef="cdk:hybridization">sp2
<scalar dataType="xsd:string" dictRef="cdk:DA">A
<scalar dataType="xsd:string" dictRef="cdk:sphericalMatcher">N-[1-3];[H]{0,2}+[A-Za-z]*+=[CN].*+
</atomType>


The solution is to put the original identifier, "N=C" in this example, into a <label> element and fix the @id:

<atomType id="NdoubleBondedC">
<!-- in imines -->
<label value="N=C"/>
</atomType>


It did require to fix the atom type perception too, because those were using the original identifiers to look up atom types.

Sunday, October 22, 2006

Creating plain text from CML source

Recently the fourth release of the Blue Obelisk Data Repository (BODR) was released. This data repository provides basic data about elements and isotopes, such as high resolution mass information. The data is taken from literature, and the repository is used by several projects, like OpenBabel, Kalzium, and the CDK.

The data is distributed in CML format to ensure semantic markup, but in some cases it might be easier to work with text files containing bits of information. This blog shows how XSLT can be used to create such plain text files from the XML search.

Consider the below XSLT code, which extracts information from the elements.xml in the BODR distribution:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:cml="http://www.xml-cml.org/schema">

<xsl:output method="text"/>

<xsl:template match="*">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()"/>

<xsl:template match="cml:atom">
<xsl:value-of select="cml:label[@dictRef='bo:symbol']/@value"/>
<xsl:text> </xsl:text>
<xsl:value-of select="cml:scalar[@dictRef='bo:exactMass']"/>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>


XSLT is the transformation specification of XSL, and provides means to convert XML documents into other documents, either XML or plain text. The above XSLT creates plain text, as indicated by the <xsl:output> element.

The three templates define how the XML source is transformed. The first template tells the processor to apply templates for all elements in the XML source (match="*"). The second template matches all text nodes on the XML source, and ensures that all text from the source is ignored, so that we can tune in detail the output.

The third template is the most interesting one; it defines which information from the source XML is extracted and outputed to the plain text file. The <xsl:value-of> elements define which content from the XML source is send to the output. The above XSLT contains two such elements, and it extracts the element symbol and its exact mass. The logic is define by XPath statements given in the @match attribute of the <xsl:template> element and the @select attribute of the <xsl:value-of> element.

This third template applies to all <atom> elements in the CML source, as indicated by the @match attribute. From this element, the <xsl:value-of> elements extract the element symbol using the XPath query cml:label[@dictRef='bo:symbol']/@value. Or, in natural language: take the content of the @value attribute, of the <cml:label> element which has an @dictRef attribute of which the content is 'bo:symbol'. The second query is cml:scalar[@dictRef='bo:exactMass'], and says: take the content of the <cml:scalar> element which has an @dictRef attribute of which the content is 'bo:exactMass'.

Check the elements.xml source code, and it will be clear how to extract other elemental properties. Good luck!

Monday, September 11, 2006

InChI's in CML

I've seen some incorrect uses of <identifier> for adding InChI's in CML. The bad code I have seen in the wild looks like:

<molecule>
<identifier convention="iupac:inchi">InChI=1/CH4/h1H4</identifier>
</molecule>

However, the <identifier> does not allow content other then elements. In XML terms: it does not allow mixed content. You might wonder why it allows any element. This is because the first InChI betas had an XML syntax (InChI still has an XML syntax too).

If you want the one-line InChI format most of us know, it needs to be put into the @value attribute. Thus:

<molecule>
<identifier convention="iupac:inchi" value="InChI=1/CH4/h1H4"/>
</molecule>


Sunday, September 03, 2006

Scalar data types

The CML <scalar> element has an attribute @dataType which allows one to state what the type of the scalar value. Consider the following fragment, which could define a nick name for an atom:

<scalar dataType="xsd;string" title="NickName">carbonRulez</scalar>

The CML schema suggests to use the XML Schema Datatypes, as defined in the XML Schema Part 2: Datatypes specification. For these data types, commonly the xsd prefix is used, as in the above example.

Now, this specification allows restricting the content of the string in many ways. For example, that it defines a year:

<scalar dataType="xsd;year" title="DateOfDiscovery">1891</scalar>

It defines fourteen primitive datatypes: string, boolean, float, double, decimal, timeDuration, recurringDuration, binary, uriReference, ID, IDREF, ENTITY, NOTATION, and QName. Of these, the first five have obvious meaning. The specification also defines a number of derived datatypes. which include these useful types: language, integer, positiveInteger, negativeInteger, data, month and year.

The CML schema also defines a large set of custom data types. For example, it defines an elementTypeType which derives from the xsd:string and allows only the one/two/three character element name abbreviations ("C", "N", etc), and Dummy, Du and R.

The use of datatypes in XML Schema's is really powerful, and the CML schema allows to define custom datatypes if needed. But I'll save that for a later time.

Saturday, September 02, 2006

CML blog from the source

Peter Murray-Rust started a blog on CML too. I do plan to continue to give small CML tips and comments in this blog.

Monday, August 21, 2006

Crystal structures in Jmol

The following CML is indicates how unit cell information can be represented. By using the titles 'a', 'b', 'c', 'alpha', 'beta' and 'gamma', Jmol is able to read the unit cell properties:

<?xml version="1.0"?>
<molecule xmlns="http://www.xml-cml.org/schema">
<crystal>
<scalar title="a">8.781</scalar>
<scalar title="b">8.845</scalar>
<scalar title="c">10.171</scalar>
<scalar title="alpha">90.00</scalar>
<scalar title="beta">101.34</scalar>
<scalar title="gamma">90.00</scalar>
</crystal>
<atomArray>
<atom id="Fe1" xFract="1.0" yFract="0.0" zFract="1.0" elementType="Fe"/>
<atom id="N1" xFract="0.7619" yFract="0.0548" zFract="0.9299" elementType="N"/>
</atomArray>
</molecule>

However, if you want to make to of the full power of CML, you would not use the @title attributes on the <crystal>'s <scalar> elements, but use @dictRef instead. Dictionary references allow to make the semantics complete: it would explicitely say what you mean with 'a', or 'beta'. The CML project has setup a simple CML dictionary (namespaced: http://www.xml-cml.org/dict/cmlDict) which contains the concepts for these six unit cell properties. The example then becomes:

<?xml version="1.0"?>
<molecule xmlns="http://www.xml-cml.org/schema">
<crystal xmlns:cmldict="http://www.xml-cml.org/dict/cmlDict">
<scalar dictRef="cmldict:a">8.781</scalar>
<scalar dictRef="cmldict:b">8.845</scalar>
<scalar dictRef="cmldict:c">10.171</scalar>
<scalar dictRef="cmldict:alpha">90.00</scalar>
<scalar dictRef="cmldict:beta">101.34</scalar>
<scalar dictRef="cmldict:gamma">90.00</scalar>
</crystal>
<atomArray>
<atom id="Fe1" xFract="1.0" yFract="0.0" zFract="1.0" elementType="Fe"/>
<atom id="N1" xFract="0.7619" yFract="0.0548" zFract="0.9299" elementType="N"/>
</atomArray>
</molecule>

Jmol currently does not recognize, and thus ignore, this dictionary (bug #1543784).

Sunday, August 20, 2006

CML Explained

There seems to be a niche in explaining the use of Chemical Markup Language, so here goes. This blog will discuss frequently and less frequently asked questions.