ConcreteChemViz: February 2007

PDB heuristics

The PDB file format has many advantages, it is well documented, it permits description of many different kind of information... But has also a lot of shortcomings, not least the fact that you need heuristics to extract the real structure stored inside.

I just stumped over the paper: PDB: Cruft to Content that collect a long list of heuristics needed to extract the real structure descibed by some PDB file avoiding the errors and inconsistencies present in many files deposited in the Protein Data Bank.

As soon as possible, I will implement some of them in my STM3 PDB file reader.

Labels: data formats

Documented file formats

Where is the ultimate documentation on chemistry file formats? Here is what I have found

MOL2 - It is from a commercial company so it is well documented
Gaussian cube - There is a note in the VMD documentation that change a little the meaning of some info contained in the file
XYZ - again the molfile VMD page is a gold mine of file formats descriptions
MOL - Well documented
PDB - Well documented
PDB-Q - An old version of PDB, prior to 1992. Should be documented somewere...
CHGCAR - The insight about the format arrives not from the usual VMD page but directly reading the VMD reader plugin code
DCD - I remember a difference between what has been described and the real file format
DL_POLY The HISTORY file is well documented in the manual
Gulp
POSCAR
Shel-X
Siesta - Seems no more active
XDATCAR

Labels: data formats

Misplaced creativity in file formats

Just finished revamping my chemistry data readers for STM4. Ang again I'm thinking about why a lot of creativity is misplaced in inventing new, similar un-parseable file formats!

Here are the formats STM4 supports for now. After the list I try to collect some of the nice (groan!) features of each of them. Why chemistry programs developers do not try to converge on a single format?

CHGCAR - VASP format containing also volume data
Gaussian cube - Gaussian containing also volume data
DCD - FMD, add to PDB trajectory data
DL_POLY - DL_POLY HISTORY file
FpStudio - FullProf Suite
Gulp - GULP input file
MOL - From MDL
MOL2 - From Tripos
PDB - Protein Data Bank
PDB-Q - An old version of PDB
POSCAR - VASP, also as concatenated file
SHEL-X - Shel-X crystallography program
Siesta - Siesta
XDATCAR - Another VASP animated format
XYZ - The simple xyz format, animated also
XYZ plus unit cell - idem plus a supporting file containing the unit cell

And now some of the complains:

CHGCAR - Could contain one or two sets of volumetric data. Why should be so difficult and unreliable to find the start of the second block? And why a division of the values by the cell volume based on the file name? And why no sensible extension to the file? And no atom type in the file.
Gaussian cube - Measurement units: Angstrom or Bohr?
DCD - It is a binary format, so almost works
FpStudio - Why two structures in the same file with different methods to describe symmetry? Why contains rendering options mixed with structural options?
Gulp - It is more a human readable input format
MOL - At least it is documented, but why uses fixed width numeric fields?
MOL2 - Documented
PDB - Atom numbers in fixed width fields, number of atoms limited to 99999, creativity a go-go in the atom name field (obviously without putting the element type in the appropriate field)
PDB-Q - A column is 10 bytes shorter, so another reader is needed
POSCAR - Simple format, but the kind of atoms is not in the file
SHEL-X - No big problems, except understanding symmetry definitions
Siesta - No problems
XDATCAR - No problems
XYZ - No problem
XYZ plus unit cell - No problem

The saga continues...

Labels: chemistry visualization, data formats

ConcreteChemViz

February 17, 2007

PDB heuristics

February 05, 2007

Documented file formats

February 02, 2007

Misplaced creativity in file formats

About Me

Links

Previous Posts

Archives