February 02, 2007

Misplaced creativity in file formats

Just finished revamping my chemistry data readers for STM4. Ang again I'm thinking about why a lot of creativity is misplaced in inventing new, similar un-parseable file formats!

Here are the formats STM4 supports for now. After the list I try to collect some of the nice (groan!) features of each of them. Why chemistry programs developers do not try to converge on a single format?

  • CHGCAR - VASP format containing also volume data
  • Gaussian cube - Gaussian containing also volume data
  • DCD - FMD, add to PDB trajectory data
  • DL_POLY - DL_POLY HISTORY file
  • FpStudio - FullProf Suite
  • Gulp - GULP input file
  • MOL - From MDL
  • MOL2 - From Tripos
  • PDB - Protein Data Bank
  • PDB-Q - An old version of PDB
  • POSCAR - VASP, also as concatenated file
  • SHEL-X - Shel-X crystallography program
  • Siesta - Siesta
  • XDATCAR - Another VASP animated format
  • XYZ - The simple xyz format, animated also
  • XYZ plus unit cell - idem plus a supporting file containing the unit cell
And now some of the complains:
  • CHGCAR - Could contain one or two sets of volumetric data. Why should be so difficult and unreliable to find the start of the second block? And why a division of the values by the cell volume based on the file name? And why no sensible extension to the file? And no atom type in the file.
  • Gaussian cube - Measurement units: Angstrom or Bohr?
  • DCD - It is a binary format, so almost works
  • FpStudio - Why two structures in the same file with different methods to describe symmetry? Why contains rendering options mixed with structural options?
  • Gulp - It is more a human readable input format
  • MOL - At least it is documented, but why uses fixed width numeric fields?
  • MOL2 - Documented
  • PDB - Atom numbers in fixed width fields, number of atoms limited to 99999, creativity a go-go in the atom name field (obviously without putting the element type in the appropriate field)
  • PDB-Q - A column is 10 bytes shorter, so another reader is needed
  • POSCAR - Simple format, but the kind of atoms is not in the file
  • SHEL-X - No big problems, except understanding symmetry definitions
  • Siesta - No problems
  • XDATCAR - No problems
  • XYZ - No problem
  • XYZ plus unit cell - No problem
The saga continues...

Labels: ,

0 Comments:

Post a Comment

<< Home