February 17, 2007

PDB heuristics

The PDB file format has many advantages, it is well documented, it permits description of many different kind of information... But has also a lot of shortcomings, not least the fact that you need heuristics to extract the real structure stored inside.

I just stumped over the paper: PDB: Cruft to Content that collect a long list of heuristics needed to extract the real structure descibed by some PDB file avoiding the errors and inconsistencies present in many files deposited in the Protein Data Bank.

As soon as possible, I will implement some of them in my STM3 PDB file reader.

Labels:

February 05, 2007

Documented file formats

Where is the ultimate documentation on chemistry file formats? Here is what I have found
  • MOL2 - It is from a commercial company so it is well documented
  • Gaussian cube - There is a note in the VMD documentation that change a little the meaning of some info contained in the file
  • XYZ - again the molfile VMD page is a gold mine of file formats descriptions
  • MOL - Well documented
  • PDB - Well documented
  • PDB-Q - An old version of PDB, prior to 1992. Should be documented somewere...
  • CHGCAR - The insight about the format arrives not from the usual VMD page but directly reading the VMD reader plugin code
  • DCD - I remember a difference between what has been described and the real file format
  • DL_POLY The HISTORY file is well documented in the manual
  • Gulp
  • POSCAR
  • Shel-X
  • Siesta - Seems no more active
  • XDATCAR

Labels:

February 02, 2007

Misplaced creativity in file formats

Just finished revamping my chemistry data readers for STM4. Ang again I'm thinking about why a lot of creativity is misplaced in inventing new, similar un-parseable file formats!

Here are the formats STM4 supports for now. After the list I try to collect some of the nice (groan!) features of each of them. Why chemistry programs developers do not try to converge on a single format?

  • CHGCAR - VASP format containing also volume data
  • Gaussian cube - Gaussian containing also volume data
  • DCD - FMD, add to PDB trajectory data
  • DL_POLY - DL_POLY HISTORY file
  • FpStudio - FullProf Suite
  • Gulp - GULP input file
  • MOL - From MDL
  • MOL2 - From Tripos
  • PDB - Protein Data Bank
  • PDB-Q - An old version of PDB
  • POSCAR - VASP, also as concatenated file
  • SHEL-X - Shel-X crystallography program
  • Siesta - Siesta
  • XDATCAR - Another VASP animated format
  • XYZ - The simple xyz format, animated also
  • XYZ plus unit cell - idem plus a supporting file containing the unit cell
And now some of the complains:
  • CHGCAR - Could contain one or two sets of volumetric data. Why should be so difficult and unreliable to find the start of the second block? And why a division of the values by the cell volume based on the file name? And why no sensible extension to the file? And no atom type in the file.
  • Gaussian cube - Measurement units: Angstrom or Bohr?
  • DCD - It is a binary format, so almost works
  • FpStudio - Why two structures in the same file with different methods to describe symmetry? Why contains rendering options mixed with structural options?
  • Gulp - It is more a human readable input format
  • MOL - At least it is documented, but why uses fixed width numeric fields?
  • MOL2 - Documented
  • PDB - Atom numbers in fixed width fields, number of atoms limited to 99999, creativity a go-go in the atom name field (obviously without putting the element type in the appropriate field)
  • PDB-Q - A column is 10 bytes shorter, so another reader is needed
  • POSCAR - Simple format, but the kind of atoms is not in the file
  • SHEL-X - No big problems, except understanding symmetry definitions
  • Siesta - No problems
  • XDATCAR - No problems
  • XYZ - No problem
  • XYZ plus unit cell - No problem
The saga continues...

Labels: ,