CML Explained

Friday, September 18, 2009

Bond order information and aromaticity

CML allows you to mark a bond as single (S), double (D), triple (T), but also as aromatic (A). The latter has the downside that one looses the localization of the double bonds, often used in 2D diagrams, e.g. in JChemPaint.

The CDK addresses this by storing the aromaticity as additional property on the bond:

<bond atomRefs='a1 a2' order='D'>
  <bondType dictRef="cdk:aromaticBond"/>
</bond>

Sunday, September 23, 2007

Chemical Markup Language for spectra: the JCIM publication

Stefan Kuhn wrote a paper on the use of the Chemical Markup Language for markup of spectra: Chemical Markup, XML, and the World Wide Web. 7. CMLSpect, an XML Vocabulary for Spectral Data (doi:10.1021/ci600531a).

The abtract:

CMLSpect is an extension of Chemical Markup Language (CML) for managing spectral and other analytical data. It is designed to be flexible enough to contain a wide variety of spectral data. The paper describes the CMLElements used and gives practical examples for common types of spectra. In addition it demonstrates how different views of the data can be expressed and what problems still exist.

Various projects from the authors already put CMLSpect in production use, such as the NMRShiftDB, Bioclipse, JSpecView.

Wednesday, June 27, 2007

There can be only one (namespace)

Today I had a discussion with indexing CML files, such as extraction of molecular formula, author, title, compound names etc. And we came to the point of CML namespaces. And CML has seen a few in the past, listed below. However, only the latest should be used: http://www.xml-cml.org/schema.

http://www.xml-cml.org/schema
This is the namespace of the latest CML schema. This one must be used for creating any new CML file.

http://www.xml-cml.org/schema/cml2/core
This was the first namespace used in the XML Schema definition for CML. It has been used by at least OpenBabel and the CDK, but is outdated and should no longer be used.

http://www.xml-cml.org/dtd/cml1_0_1.dtd
I have no idea if this namespace has actually been used, but it is mentioned in this official FAQ. I think it might actually be remnant from the early days of XML Schema: CML 1.0 (DOI:10.1021/ci990052b) was DTD based, and DTD has no concept like XML Namespaces. But namespaces are oh so useful, and used in CML 2.0 (DOI:10.1021/ci0256541) applications like CMLRSS (DOI:10.1021/ci034244p). I am not aware of applications using this cml1_0_1.dtd CML namespace.

Thursday, March 08, 2007

Chemical Formulas in CML

Molecular formulas are useful to have in CML, especially when connectivity is not know. Working on Computer Aided Structure Elucidation (CASE) based on NMR and MS data, this is of particular interest, as the input I have is in CML.

CML provides a few options to contain such, all using the <formula> element (xsd). The first option is to use a concise formula, which is a formal specification with all elements that occur on the formula given once, and where the count is explicitly given (all whitespace separated):

<formula concise="C 5 H 10 O 2"/>

Alternatively, you can use the @inline attribute, and refer to some convention, as in here:

<formula convention="iucr:_chemical_formula_structural" inline="Sn (C2 O4) K F"/>

More flexible formats are possible too, such as in this example:

<molecule id="CuprammoniumSulfate">
 <formula id="f2" title="[Cu(NH3)4]2+ SO42-]">
  <formula id="f21" formalCharge="2">
   <atomArray id="aa1" elementType="Cu" count="1"/>
   <formula id="f22" count="4">
    <atomArray elementType="N H" count="1 3"/>
   </formula>
  </formula>
  <formula id="f3" formalCharge="-2">
   <atomArray id="aa2" elementType="S O" count="1 4"/>
  </formula>
 </formula>
</molecule>

You can see it in action in the chem-file project

Wednesday, February 14, 2007

Writing up CML conventions in LaTeX

CML is a rather flexible format, which allows, however, to specify it's specifics in form of conventions, as specified in the @convention attribute. Such conventions can put extra restrictions on hierarchy and expected content. For example, we may define a 'simpleMolecule' convention which only allows <atomArray> and <bondArray> inside a <molecule>, and <atom> in <atomArray> and <bond> in <bondArray>:

<molecule convention="simpleMolecule">
  <atomArray>
    <atom id="c1" elementType="C" hydrogenCount="3">
    <atom id="n1" elementType="N" hydrogenCount="2">
  </atomArray>
  <bondArray>
    <bond atomRefs="c1 n1" order="s"/>
  </bondArray>
</molecule>

Now, such conventions need to be specified. One way of doing this is writing it up in a LaTeX document, which would include explanations and likely code examples. To ensure that these code examples are actually valid CML, the following setup may be used, which separates code examples from the LaTeX source.

Including CML examples in the LaTeX source

Because the framework makes use of the LaTeX package listings, it cannot make use of the \include command to include the code examples into the PDF. Instead, it makes use of a preprocessor that includes the examples instead. This is the directory layout I have:

.
|-- Makefile
|-- spec.tex.in
|-- examples
|   |-- Makefile
|   |-- schema.xsd
|   |-- simple1d.valid.xml
`-- preproces.pl

The Makefile creates the PDF from the LaTeX source by first running preproces.pl on the .tex.in file, and then running pdflatex on the created .tex file:

all: spec.pdf

spec.pdf: spec.tex
        pdflatex spec.tex
        pdflatex spec.tex
        pdflatex spec.tex

spec.tex: spec.tex.in
        perl preproces.pl < spec.tex.in > spec.tex

The LaTeX source in the .tex.in file looks like:


\begin{lstlisting}[language=XML,
  caption={Simple 1D ${^13}C$ NMR spectrum.},
  label={list:simple1d}]
% INPUT: simple1d.valid.cml
\end{lstlisting}

The string "% INPUT:" is picked up by the Perl script to include that file. The full script looks like:

#!/usr/bin/perl

use diagnostics;
use strict;

while (my $line = <STDIN>) {
  if ($line =~ /^\%\sINPUT:\s(.*)/) {
    my $file = $1;
    die "Cannot find file 'examples/$file' to insert!\n" if (!(-e "examples/$file"));
    open (INPUT, "<examples/$file");
    while (<INPUT>) { print STDOUT $_; };
  } else {
    print STDOUT $line;
  }
}

CML Validation

Now that the examples are split out, but tightly integrated in the LaTeX source, validation of the CML examples is easy. I use a simple Makefile for that, which makes use of xmllint:


all: validate

validate: *.valid.cml
        @for f in *.valid.cml; do \
          echo "** Validating $${f} against XML Schema..."; \
          xmllint --noout --schema schema.xsd $${f}; \
        done

Wednesday, November 01, 2006

CML id's picky on allowed characters

Yesterday, I was hacking on the atom type lists in the CDK (MM2, MMFF94, PDB and a few customs ones) which the project has stored in CML. The files had not been updated to CML 2.3, which is the most recent specification we use in the CDK (see bug #1562342).

When doing this I found a problem which I have seen more often: the problem was that the MM2 and MMFF94 atom type lists contained '=' and other forbidden characters in the <atomType> @id's. The content of this @id attribute is defined by the idType in the XML Schema for CML 2.3. It extends the XML Schema datatype xsd:string, but is also require to match this regular expression: [A-Za-z][A-Za-z0-9\.\-_]*. That is, it has to start with a upper or lower case letter, and then consists of alphanumerics, dots (.), a hyphen (-), or an underscore (_).

Now, the atom type lists used to identifiers from the MMFF94 and MM2 force field atom type lists, giving code like:


<atomType id="N=C">
  <!-- in imines -->
  <atom elementType="N" formalCharge="0">
    <scalar dataType="xsd:double" dictRef="cdk:maxBondOrder">2.0
    <scalar dataType="xsd:double" dictRef="cdk:bondOrderSum">3.0
    <scalar dataType="xsd:integer" dictRef="cdk:formalNeighbourCount">2
    <scalar dataType="xsd:integer" dictRef="cdk:valency">5
  </atom>
  <scalar dataType="xsd:string" dictRef="cdk:hybridization">sp2
  <scalar dataType="xsd:string" dictRef="cdk:DA">A
  <scalar dataType="xsd:string" dictRef="cdk:sphericalMatcher">N-[1-3];[H]{0,2}+[A-Za-z]*+=[CN].*+
</atomType>

The solution is to put the original identifier, "N=C" in this example, into a <label> element and fix the @id:


<atomType id="NdoubleBondedC">
  <!-- in imines -->
  <label value="N=C"/>
</atomType>

It did require to fix the atom type perception too, because those were using the original identifiers to look up atom types.

Sunday, October 22, 2006

Creating plain text from CML source

Recently the fourth release of the Blue Obelisk Data Repository (BODR) was released. This data repository provides basic data about elements and isotopes, such as high resolution mass information. The data is taken from literature, and the repository is used by several projects, like OpenBabel, Kalzium, and the CDK.

The data is distributed in CML format to ensure semantic markup, but in some cases it might be easier to work with text files containing bits of information. This blog shows how XSLT can be used to create such plain text files from the XML search.

Consider the below XSLT code, which extracts information from the elements.xml in the BODR distribution:


<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:cml="http://www.xml-cml.org/schema">

  <xsl:output method="text"/>

  <xsl:template match="*">
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="text()"/>

  <xsl:template match="cml:atom">
    <xsl:value-of select="cml:label[@dictRef='bo:symbol']/@value"/>
    <xsl:text> </xsl:text>
    <xsl:value-of select="cml:scalar[@dictRef='bo:exactMass']"/>
    <xsl:text>
</xsl:text>
  </xsl:template>

</xsl:stylesheet>

XSLT is the transformation specification of XSL, and provides means to convert XML documents into other documents, either XML or plain text. The above XSLT creates plain text, as indicated by the <xsl:output> element.

The three templates define how the XML source is transformed. The first template tells the processor to apply templates for all elements in the XML source (match="*"). The second template matches all text nodes on the XML source, and ensures that all text from the source is ignored, so that we can tune in detail the output.

The third template is the most interesting one; it defines which information from the source XML is extracted and outputed to the plain text file. The <xsl:value-of> elements define which content from the XML source is send to the output. The above XSLT contains two such elements, and it extracts the element symbol and its exact mass. The logic is define by XPath statements given in the @match attribute of the <xsl:template> element and the @select attribute of the <xsl:value-of> element.

This third template applies to all <atom> elements in the CML source, as indicated by the @match attribute. From this element, the <xsl:value-of> elements extract the element symbol using the XPath query cml:label[@dictRef='bo:symbol']/@value. Or, in natural language: take the content of the @value attribute, of the <cml:label> element which has an @dictRef attribute of which the content is 'bo:symbol'. The second query is cml:scalar[@dictRef='bo:exactMass'], and says: take the content of the <cml:scalar> element which has an @dictRef attribute of which the content is 'bo:exactMass'.

Check the elements.xml source code, and it will be clear how to extract other elemental properties. Good luck!

Technorati tags: XML CML XSLT XPath Blue Obelisk isotope data