Wednesday, November 01, 2006

CML id's picky on allowed characters

Yesterday, I was hacking on the atom type lists in the CDK (MM2, MMFF94, PDB and a few customs ones) which the project has stored in CML. The files had not been updated to CML 2.3, which is the most recent specification we use in the CDK (see bug #1562342).

When doing this I found a problem which I have seen more often: the problem was that the MM2 and MMFF94 atom type lists contained '=' and other forbidden characters in the <atomType> @id's. The content of this @id attribute is defined by the idType in the XML Schema for CML 2.3. It extends the XML Schema datatype xsd:string, but is also require to match this regular expression: [A-Za-z][A-Za-z0-9\.\-_]*. That is, it has to start with a upper or lower case letter, and then consists of alphanumerics, dots (.), a hyphen (-), or an underscore (_).

Now, the atom type lists used to identifiers from the MMFF94 and MM2 force field atom type lists, giving code like:

<atomType id="N=C">
<!-- in imines -->
<atom elementType="N" formalCharge="0">
<scalar dataType="xsd:double" dictRef="cdk:maxBondOrder">2.0
<scalar dataType="xsd:double" dictRef="cdk:bondOrderSum">3.0
<scalar dataType="xsd:integer" dictRef="cdk:formalNeighbourCount">2
<scalar dataType="xsd:integer" dictRef="cdk:valency">5
<scalar dataType="xsd:string" dictRef="cdk:hybridization">sp2
<scalar dataType="xsd:string" dictRef="cdk:DA">A
<scalar dataType="xsd:string" dictRef="cdk:sphericalMatcher">N-[1-3];[H]{0,2}+[A-Za-z]*+=[CN].*+

The solution is to put the original identifier, "N=C" in this example, into a <label> element and fix the @id:

<atomType id="NdoubleBondedC">
<!-- in imines -->
<label value="N=C"/>

It did require to fix the atom type perception too, because those were using the original identifiers to look up atom types.