Wednesday, June 27, 2007

There can be only one (namespace)

Today I had a discussion with indexing CML files, such as extraction of molecular formula, author, title, compound names etc. And we came to the point of CML namespaces. And CML has seen a few in the past, listed below. However, only the latest should be used: http://www.xml-cml.org/schema.

http://www.xml-cml.org/schema
This is the namespace of the latest CML schema. This one must be used for creating any new CML file.

http://www.xml-cml.org/schema/cml2/core
This was the first namespace used in the XML Schema definition for CML. It has been used by at least OpenBabel and the CDK, but is outdated and should no longer be used.

http://www.xml-cml.org/dtd/cml1_0_1.dtd
I have no idea if this namespace has actually been used, but it is mentioned in this official FAQ. I think it might actually be remnant from the early days of XML Schema: CML 1.0 (DOI:10.1021/ci990052b) was DTD based, and DTD has no concept like XML Namespaces. But namespaces are oh so useful, and used in CML 2.0 (DOI:10.1021/ci0256541) applications like CMLRSS (DOI:10.1021/ci034244p). I am not aware of applications using this cml1_0_1.dtd CML namespace.

Thursday, March 08, 2007

Chemical Formulas in CML

Molecular formulas are useful to have in CML, especially when connectivity is not know. Working on Computer Aided Structure Elucidation (CASE) based on NMR and MS data, this is of particular interest, as the input I have is in CML.

CML provides a few options to contain such, all using the <formula> element (xsd). The first option is to use a concise formula, which is a formal specification with all elements that occur on the formula given once, and where the count is explicitly given (all whitespace separated):
<formula concise="C 5 H 10 O 2"/>
Alternatively, you can use the @inline attribute, and refer to some convention, as in here:
<formula convention="iucr:_chemical_formula_structural" inline="Sn (C2 O4) K F"/>

More flexible formats are possible too, such as in this example:
<molecule id="CuprammoniumSulfate">
<formula id="f2" title="[Cu(NH3)4]2+ SO42-]">
<formula id="f21" formalCharge="2">
<atomArray id="aa1" elementType="Cu" count="1"/>
<formula id="f22" count="4">
<atomArray elementType="N H" count="1 3"/>
</formula>
</formula>
<formula id="f3" formalCharge="-2">
<atomArray id="aa2" elementType="S O" count="1 4"/>
</formula>
</formula>
</molecule>

You can see it in action in the chem-file project

Wednesday, February 14, 2007

Writing up CML conventions in LaTeX

CML is a rather flexible format, which allows, however, to specify it's specifics in form of conventions, as specified in the @convention attribute. Such conventions can put extra restrictions on hierarchy and expected content. For example, we may define a 'simpleMolecule' convention which only allows <atomArray> and <bondArray> inside a <molecule>, and <atom> in <atomArray> and <bond> in <bondArray>:
<molecule convention="simpleMolecule">
<atomArray>
<atom id="c1" elementType="C" hydrogenCount="3">
<atom id="n1" elementType="N" hydrogenCount="2">
</atomArray>
<bondArray>
<bond atomRefs="c1 n1" order="s"/>
</bondArray>
</molecule>

Now, such conventions need to be specified. One way of doing this is writing it up in a LaTeX document, which would include explanations and likely code examples. To ensure that these code examples are actually valid CML, the following setup may be used, which separates code examples from the LaTeX source.

Including CML examples in the LaTeX source

Because the framework makes use of the LaTeX package listings, it cannot make use of the \include command to include the code examples into the PDF. Instead, it makes use of a preprocessor that includes the examples instead. This is the directory layout I have:
.
|-- Makefile
|-- spec.tex.in
|-- examples
| |-- Makefile
| |-- schema.xsd
| |-- simple1d.valid.xml
`-- preproces.pl

The Makefile creates the PDF from the LaTeX source by first running preproces.pl on the .tex.in file, and then running pdflatex on the created .tex file:
all: spec.pdf

spec.pdf: spec.tex
pdflatex spec.tex
pdflatex spec.tex
pdflatex spec.tex

spec.tex: spec.tex.in
perl preproces.pl < spec.tex.in > spec.tex

The LaTeX source in the .tex.in file looks like:

\begin{lstlisting}[language=XML,
caption={Simple 1D ${^13}C$ NMR spectrum.},
label={list:simple1d}]
% INPUT: simple1d.valid.cml
\end{lstlisting}

The string "% INPUT:" is picked up by the Perl script to include that file. The full script looks like:
#!/usr/bin/perl

use diagnostics;
use strict;

while (my $line = <STDIN>) {
if ($line =~ /^\%\sINPUT:\s(.*)/) {
my $file = $1;
die "Cannot find file 'examples/$file' to insert!\n" if (!(-e "examples/$file"));
open (INPUT, "<examples/$file");
while (<INPUT>) { print STDOUT $_; };
} else {
print STDOUT $line;
}
}

CML Validation

Now that the examples are split out, but tightly integrated in the LaTeX source, validation of the CML examples is easy. I use a simple Makefile for that, which makes use of xmllint:

all: validate

validate: *.valid.cml
@for f in *.valid.cml; do \
echo "** Validating $${f} against XML Schema..."; \
xmllint --noout --schema schema.xsd $${f}; \
done

Wednesday, November 01, 2006

CML id's picky on allowed characters

Yesterday, I was hacking on the atom type lists in the CDK (MM2, MMFF94, PDB and a few customs ones) which the project has stored in CML. The files had not been updated to CML 2.3, which is the most recent specification we use in the CDK (see bug #1562342).

When doing this I found a problem which I have seen more often: the problem was that the MM2 and MMFF94 atom type lists contained '=' and other forbidden characters in the <atomType> @id's. The content of this @id attribute is defined by the idType in the XML Schema for CML 2.3. It extends the XML Schema datatype xsd:string, but is also require to match this regular expression: [A-Za-z][A-Za-z0-9\.\-_]*. That is, it has to start with a upper or lower case letter, and then consists of alphanumerics, dots (.), a hyphen (-), or an underscore (_).

Now, the atom type lists used to identifiers from the MMFF94 and MM2 force field atom type lists, giving code like:

<atomType id="N=C">
<!-- in imines -->
<atom elementType="N" formalCharge="0">
<scalar dataType="xsd:double" dictRef="cdk:maxBondOrder">2.0
<scalar dataType="xsd:double" dictRef="cdk:bondOrderSum">3.0
<scalar dataType="xsd:integer" dictRef="cdk:formalNeighbourCount">2
<scalar dataType="xsd:integer" dictRef="cdk:valency">5
</atom>
<scalar dataType="xsd:string" dictRef="cdk:hybridization">sp2
<scalar dataType="xsd:string" dictRef="cdk:DA">A
<scalar dataType="xsd:string" dictRef="cdk:sphericalMatcher">N-[1-3];[H]{0,2}+[A-Za-z]*+=[CN].*+
</atomType>


The solution is to put the original identifier, "N=C" in this example, into a <label> element and fix the @id:

<atomType id="NdoubleBondedC">
<!-- in imines -->
<label value="N=C"/>
</atomType>


It did require to fix the atom type perception too, because those were using the original identifiers to look up atom types.

Sunday, October 22, 2006

Creating plain text from CML source

Recently the fourth release of the Blue Obelisk Data Repository (BODR) was released. This data repository provides basic data about elements and isotopes, such as high resolution mass information. The data is taken from literature, and the repository is used by several projects, like OpenBabel, Kalzium, and the CDK.

The data is distributed in CML format to ensure semantic markup, but in some cases it might be easier to work with text files containing bits of information. This blog shows how XSLT can be used to create such plain text files from the XML search.

Consider the below XSLT code, which extracts information from the elements.xml in the BODR distribution:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:cml="http://www.xml-cml.org/schema">

<xsl:output method="text"/>

<xsl:template match="*">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()"/>

<xsl:template match="cml:atom">
<xsl:value-of select="cml:label[@dictRef='bo:symbol']/@value"/>
<xsl:text> </xsl:text>
<xsl:value-of select="cml:scalar[@dictRef='bo:exactMass']"/>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>


XSLT is the transformation specification of XSL, and provides means to convert XML documents into other documents, either XML or plain text. The above XSLT creates plain text, as indicated by the <xsl:output> element.

The three templates define how the XML source is transformed. The first template tells the processor to apply templates for all elements in the XML source (match="*"). The second template matches all text nodes on the XML source, and ensures that all text from the source is ignored, so that we can tune in detail the output.

The third template is the most interesting one; it defines which information from the source XML is extracted and outputed to the plain text file. The <xsl:value-of> elements define which content from the XML source is send to the output. The above XSLT contains two such elements, and it extracts the element symbol and its exact mass. The logic is define by XPath statements given in the @match attribute of the <xsl:template> element and the @select attribute of the <xsl:value-of> element.

This third template applies to all <atom> elements in the CML source, as indicated by the @match attribute. From this element, the <xsl:value-of> elements extract the element symbol using the XPath query cml:label[@dictRef='bo:symbol']/@value. Or, in natural language: take the content of the @value attribute, of the <cml:label> element which has an @dictRef attribute of which the content is 'bo:symbol'. The second query is cml:scalar[@dictRef='bo:exactMass'], and says: take the content of the <cml:scalar> element which has an @dictRef attribute of which the content is 'bo:exactMass'.

Check the elements.xml source code, and it will be clear how to extract other elemental properties. Good luck!

Monday, September 11, 2006

InChI's in CML

I've seen some incorrect uses of <identifier> for adding InChI's in CML. The bad code I have seen in the wild looks like:

<molecule>
<identifier convention="iupac:inchi">InChI=1/CH4/h1H4</identifier>
</molecule>

However, the <identifier> does not allow content other then elements. In XML terms: it does not allow mixed content. You might wonder why it allows any element. This is because the first InChI betas had an XML syntax (InChI still has an XML syntax too).

If you want the one-line InChI format most of us know, it needs to be put into the @value attribute. Thus:

<molecule>
<identifier convention="iupac:inchi" value="InChI=1/CH4/h1H4"/>
</molecule>


Sunday, September 03, 2006

Scalar data types

The CML <scalar> element has an attribute @dataType which allows one to state what the type of the scalar value. Consider the following fragment, which could define a nick name for an atom:

<scalar dataType="xsd;string" title="NickName">carbonRulez</scalar>

The CML schema suggests to use the XML Schema Datatypes, as defined in the XML Schema Part 2: Datatypes specification. For these data types, commonly the xsd prefix is used, as in the above example.

Now, this specification allows restricting the content of the string in many ways. For example, that it defines a year:

<scalar dataType="xsd;year" title="DateOfDiscovery">1891</scalar>

It defines fourteen primitive datatypes: string, boolean, float, double, decimal, timeDuration, recurringDuration, binary, uriReference, ID, IDREF, ENTITY, NOTATION, and QName. Of these, the first five have obvious meaning. The specification also defines a number of derived datatypes. which include these useful types: language, integer, positiveInteger, negativeInteger, data, month and year.

The CML schema also defines a large set of custom data types. For example, it defines an elementTypeType which derives from the xsd:string and allows only the one/two/three character element name abbreviations ("C", "N", etc), and Dummy, Du and R.

The use of datatypes in XML Schema's is really powerful, and the CML schema allows to define custom datatypes if needed. But I'll save that for a later time.

Saturday, September 02, 2006

CML blog from the source

Peter Murray-Rust started a blog on CML too. I do plan to continue to give small CML tips and comments in this blog.

Monday, August 21, 2006

Crystal structures in Jmol

The following CML is indicates how unit cell information can be represented. By using the titles 'a', 'b', 'c', 'alpha', 'beta' and 'gamma', Jmol is able to read the unit cell properties:

<?xml version="1.0"?>
<molecule xmlns="http://www.xml-cml.org/schema">
<crystal>
<scalar title="a">8.781</scalar>
<scalar title="b">8.845</scalar>
<scalar title="c">10.171</scalar>
<scalar title="alpha">90.00</scalar>
<scalar title="beta">101.34</scalar>
<scalar title="gamma">90.00</scalar>
</crystal>
<atomArray>
<atom id="Fe1" xFract="1.0" yFract="0.0" zFract="1.0" elementType="Fe"/>
<atom id="N1" xFract="0.7619" yFract="0.0548" zFract="0.9299" elementType="N"/>
</atomArray>
</molecule>

However, if you want to make to of the full power of CML, you would not use the @title attributes on the <crystal>'s <scalar> elements, but use @dictRef instead. Dictionary references allow to make the semantics complete: it would explicitely say what you mean with 'a', or 'beta'. The CML project has setup a simple CML dictionary (namespaced: http://www.xml-cml.org/dict/cmlDict) which contains the concepts for these six unit cell properties. The example then becomes:

<?xml version="1.0"?>
<molecule xmlns="http://www.xml-cml.org/schema">
<crystal xmlns:cmldict="http://www.xml-cml.org/dict/cmlDict">
<scalar dictRef="cmldict:a">8.781</scalar>
<scalar dictRef="cmldict:b">8.845</scalar>
<scalar dictRef="cmldict:c">10.171</scalar>
<scalar dictRef="cmldict:alpha">90.00</scalar>
<scalar dictRef="cmldict:beta">101.34</scalar>
<scalar dictRef="cmldict:gamma">90.00</scalar>
</crystal>
<atomArray>
<atom id="Fe1" xFract="1.0" yFract="0.0" zFract="1.0" elementType="Fe"/>
<atom id="N1" xFract="0.7619" yFract="0.0548" zFract="0.9299" elementType="N"/>
</atomArray>
</molecule>

Jmol currently does not recognize, and thus ignore, this dictionary (bug #1543784).

Sunday, August 20, 2006

CML Explained

There seems to be a niche in explaining the use of Chemical Markup Language, so here goes. This blog will discuss frequently and less frequently asked questions.