Sunday, September 23, 2007

Chemical Markup Language for spectra: the JCIM publication

Stefan Kuhn wrote a paper on the use of the Chemical Markup Language for markup of spectra: Chemical Markup, XML, and the World Wide Web. 7. CMLSpect, an XML Vocabulary for Spectral Data (doi:10.1021/ci600531a).

The abtract:
    CMLSpect is an extension of Chemical Markup Language (CML) for managing spectral and other analytical data. It is designed to be flexible enough to contain a wide variety of spectral data. The paper describes the CMLElements used and gives practical examples for common types of spectra. In addition it demonstrates how different views of the data can be expressed and what problems still exist.

Various projects from the authors already put CMLSpect in production use, such as the NMRShiftDB, Bioclipse, JSpecView.

Wednesday, June 27, 2007

There can be only one (namespace)

Today I had a discussion with indexing CML files, such as extraction of molecular formula, author, title, compound names etc. And we came to the point of CML namespaces. And CML has seen a few in the past, listed below. However, only the latest should be used:
This is the namespace of the latest CML schema. This one must be used for creating any new CML file.
This was the first namespace used in the XML Schema definition for CML. It has been used by at least OpenBabel and the CDK, but is outdated and should no longer be used.
I have no idea if this namespace has actually been used, but it is mentioned in this official FAQ. I think it might actually be remnant from the early days of XML Schema: CML 1.0 (DOI:10.1021/ci990052b) was DTD based, and DTD has no concept like XML Namespaces. But namespaces are oh so useful, and used in CML 2.0 (DOI:10.1021/ci0256541) applications like CMLRSS (DOI:10.1021/ci034244p). I am not aware of applications using this cml1_0_1.dtd CML namespace.

Thursday, March 08, 2007

Chemical Formulas in CML

Molecular formulas are useful to have in CML, especially when connectivity is not know. Working on Computer Aided Structure Elucidation (CASE) based on NMR and MS data, this is of particular interest, as the input I have is in CML.

CML provides a few options to contain such, all using the <formula> element (xsd). The first option is to use a concise formula, which is a formal specification with all elements that occur on the formula given once, and where the count is explicitly given (all whitespace separated):
<formula concise="C 5 H 10 O 2"/>
Alternatively, you can use the @inline attribute, and refer to some convention, as in here:
<formula convention="iucr:_chemical_formula_structural" inline="Sn (C2 O4) K F"/>

More flexible formats are possible too, such as in this example:
<molecule id="CuprammoniumSulfate">
<formula id="f2" title="[Cu(NH3)4]2+ SO42-]">
<formula id="f21" formalCharge="2">
<atomArray id="aa1" elementType="Cu" count="1"/>
<formula id="f22" count="4">
<atomArray elementType="N H" count="1 3"/>
<formula id="f3" formalCharge="-2">
<atomArray id="aa2" elementType="S O" count="1 4"/>

You can see it in action in the chem-file project

Wednesday, February 14, 2007

Writing up CML conventions in LaTeX

CML is a rather flexible format, which allows, however, to specify it's specifics in form of conventions, as specified in the @convention attribute. Such conventions can put extra restrictions on hierarchy and expected content. For example, we may define a 'simpleMolecule' convention which only allows <atomArray> and <bondArray> inside a <molecule>, and <atom> in <atomArray> and <bond> in <bondArray>:
<molecule convention="simpleMolecule">
<atom id="c1" elementType="C" hydrogenCount="3">
<atom id="n1" elementType="N" hydrogenCount="2">
<bond atomRefs="c1 n1" order="s"/>

Now, such conventions need to be specified. One way of doing this is writing it up in a LaTeX document, which would include explanations and likely code examples. To ensure that these code examples are actually valid CML, the following setup may be used, which separates code examples from the LaTeX source.

Including CML examples in the LaTeX source

Because the framework makes use of the LaTeX package listings, it cannot make use of the \include command to include the code examples into the PDF. Instead, it makes use of a preprocessor that includes the examples instead. This is the directory layout I have:
|-- Makefile
|-- examples
| |-- Makefile
| |-- schema.xsd
| |-- simple1d.valid.xml

The Makefile creates the PDF from the LaTeX source by first running on the file, and then running pdflatex on the created .tex file:
all: spec.pdf

spec.pdf: spec.tex
pdflatex spec.tex
pdflatex spec.tex
pdflatex spec.tex

perl < > spec.tex

The LaTeX source in the file looks like:

caption={Simple 1D ${^13}C$ NMR spectrum.},
% INPUT: simple1d.valid.cml

The string "% INPUT:" is picked up by the Perl script to include that file. The full script looks like:

use diagnostics;
use strict;

while (my $line = <STDIN>) {
if ($line =~ /^\%\sINPUT:\s(.*)/) {
my $file = $1;
die "Cannot find file 'examples/$file' to insert!\n" if (!(-e "examples/$file"));
open (INPUT, "<examples/$file");
while (<INPUT>) { print STDOUT $_; };
} else {
print STDOUT $line;

CML Validation

Now that the examples are split out, but tightly integrated in the LaTeX source, validation of the CML examples is easy. I use a simple Makefile for that, which makes use of xmllint:

all: validate

validate: *.valid.cml
@for f in *.valid.cml; do \
echo "** Validating $${f} against XML Schema..."; \
xmllint --noout --schema schema.xsd $${f}; \