Shape Data Format
An XML format for storing shape-decomposed data. As agreed between the PRODECOMP group (Martin Billeter, Doroteya Staykova), the MDD group (Vladislav Orekhov, Victor Jaravine) and CCPN (Rasmus Fogh).
Standard XML format for shape-decomposed data. Intended for use as export format for MDD and PRODECOMP, and for standard data storage format for CCPN. What follows is a commented example:
<-- All XML attribute names are in all lower case -->
<-- Element identifiers (e.g. Decomposition number n, axis number a, component number c, ...
are arbitrary non-negative integers. They uniquely identify that element within its containing element) -->
<-- The Decomposition element must be the either top element of the document, or contained directly within the top element. and there can be only one per file. -->
<Decomposition n=0 method=MDD name=something nprojsets=3 nshapes=4 nregions=25 ncomponents=97 reconstructable=false resolved=true refexperiment=H{CA|Cca}coNH>
<-- n is the Decomposition number (identifier)
- nprojsets gives the number of projection sets in the Decomposition, nshapes the number of shapes per component, nregions the number of regions, and ncomponents the total number of components. If there are no projsets nprojsets can be omitted. nregions counts the number of different regionids in use; since the Region element is optional the number of Region elements may be less (or zero).
- if reconstructable is true, all components can be summed to give the reconstructed spectrum; if it is false or absent this is not possible (e.g. because the decomposition contains partially overlapping components from different calculations). Note that components in a region can always be summed to reconstruct that region, and non-overlapping regions can always be summed together.
- If resolved is true the set of clean components is supposed to represent all real signals in the spectrum; if it is false or absent signal identification is not finished. Any decomposition without raw components should count as resolved.
- refExperiment is the name of the CCPN RefExperiment or the CCPN NmrExpPrototype. It is optional and is only set where it is valid for all projsets (if any). If the name denotes an NmrExpPrototype the actual experiment could be any of the relevant RefExperiments. For instance you might give the "refexperiment" as 'HC_CH.TOCSY' if you did not know whether your experiment was 'Hc_CH.TOCSY' or 'HC_cH.TOCSY'
- method is optional. Values could be MDD or PRODECOMP -->
<Axis a=0 name=C nucleus=13C domain=time type=complex size=64 swppm=80.0 startppm=84.2 sfo=201.135773 carppm=44.2/>
<Axis a=1 name=N nucleus=15N domain=time type=complex size=256 blocksize=8 swppm=30.0 startppm=132.1 acqtimesec=0.372 rdims=6 recursion="2 2 2 2 2 2" sfo=81.059357 carppm=117.1/>
<Axis a=2 name=HN nucleus=1H domain=freq type=real size=178 swppm=4.6 startppm=10.1 sfo=799.870544 carppm=4.7 origsize=512/>
...
<-- a is the axis number (identifier)
- domain can be 'time' or 'freq' - the default is 'freq'.
- type can be 'real' or 'complex'
- size is the size in points
- blocksize is the block size for external binary files (if used) Optional. Defaults to the same value as size.
- swppm is the spectroscopical sw in ppm (for size 128 it is the distance between point 1 and point 129). For time dimensions swppm refers to sw after Fourier transformation.
- startppm is the frequency in ppm of point 1, the lowest numbered point. E.g. for an HN dimension you might have swppm=4.6, startppm=10.2
- nucleus is mandatory, required for making the file independently readable.
- acqtimesec is optional, also for time dimensions
- sfo is mandatory, required for making the file independently readable. It is the spectrometer frequency in MHz.
- 'endppm', the ppm value of the right hand edge of the axis can be calculated as
endppm = startppm - swppm - origsize is optional. For axes truncated after Fourier transformation it is the original untruncated size in points. The offset of the truncation can be calculated from startppm, swppm, size, and carppm.
- carppm is optional. It is the ppm value of the carrier. For non-truncated dimensions it is the frequency of point 1+n/2 after Fourier transformation, e.g. of point 65 where origsize is 128.
<Datafile numbertype=float nbyte=4 headersize=0 blockheadersize=0 hasblockpadding=false
isbigendian=true complexstoredby=dimension filetype=Bruker>
<-- DataFile describes the storage parameters for external binary data files. It can be omitted if external data files are not used, or if all parameters are equal to their defaults.
- numbertype ('int' or 'float'). Type of numbers in file. Defaults to float.
- nbyte. Number of bytes for numbers in external file. Defaults to 4
- headersize. Size of file header in bytes. Defaults to 0
- blockheadersize. Size of block header in bytes. Defaults to 0
- hasblockpadding (boolean). Are partial bllocks padded out to full size? Defaults to false.
- isbigendian. Are number stored as big endian? Defaults to true.
- complexstoredby ('dimension', 'timepoint or 'quadrant'). Storage order of real and complex points.
- 'timepoint' means that hypercomplex points are stored together (all phases change before time values)
- 'quadrant' means that quandrants (e.g. r'rrr', 'iri', 'rri', ...) are stored together (all time values change before phases)
- 'dimension' means that phase1 changes fastest, then time1, then phase2, then time2,
Defaults to 'dimension'. Only relevant if external files contain complex data. - filetype ('Azara', 'Bruker', 'Felix', 'NMRPipe', 'NMRView', 'UCSF', 'Varian'). Type of external data files. Optional (defalts to unknown)
<Projset s=0 name=blah refexperiment=H{CA|Cca}coNH ndims=5 nproj=16 >
<-- Projset describes projection sets (projections acquired with the same pulse sequence). It is used only where relevant (i.e. for Prodecomp but not MDD)
s=0 is the number of the projection set and p=0 is the number of the projection within the set. They work the same way as a for axes and c for components
Projset and its contents are optional.- refexperiment is the name of the CCPN RefExperiment. It is optional. The refexperiment corresonds to the axes in the projection set after demultiplexing of the original data. There is no guaranteed one-to-one correspondence between the projection set and the original pulse sequence, because of the way Prodecomp treats dimensions with both positive and negative peaks; accordingly the refexperiment attribute may not identify the pulse sequence correctly, though it will define the molecular conectivity. See comment to Decomposition.refexperiment for further details.-->
<Projdim d=0 axes=0/>
<Projdim d=1 axes="1 2 3 4"/>
<-- For each dimension in the projection set there is a Projdim with an identifier (d) and the list of axis identifiers a for the axes that appear in this dimension -->
<Projection p=0 name="1107" factors="1 0 0 -1 0 0 0 1"/>
<-- Each projection has an identifying name and a list of factors.
For each Axis in the Decomposition (in order) is given the factor that is applied to the frequency from this axis in the projection. Note that there is a factor (which must be 0) also for axes that do not form part of any dimension of this projection set.-->
<Projection p=1 name="1108" factors="1 0 0 1 0 0 0 -1"/>
...
</Projset>
<Region r=1 ncomp=10 name="SHABBA"/>
<-- A Region describes a set of Components calculated together with the same sizes of shape and the same parameters - this assumption is not enforced in the file I/O.
Regions are optional and serve to hold information about individual calculations (name and optionally other information). They do not contain any other elements. Components can have a regionid to identify which region they come from, but there need not be a corresponding Region element. Components that belong to the same region can always be summed to reconstruct the region of the spectrum that they cover. Components that belong to the same region should have the same shape on each axis.
- ncomp is the number of components in this file corresponding to the region
Regions can be edited after creation by adding or removing components and correspondingly change ncomp. ncomp=0 is allowed (empty region)
-->
<Component c=0 ampl=1.0e3 regionid=17 status=clean annotation=main>
<Shape a=0>
1.2753233 2.00045, ...
</Shape>
<Shape a=1 rdims=4 recursion="2 2 2 8">
0.33345 7.111695, ...
</Shape>
<Shape a=2 size=12 offset=20>
0.000755, 11.51339, ...
<Peaks list=pk,intensity,pos,width>
...
</Peaks>
</Shape>
<Shape a="3 4" file=Comp99_Asp127>
0.100755, 16.51339, ...
<Peaks list=pk,intensity,pos,pos,width,width>
</Peaks>
</Shape>
<Shape a="5 6" size="17 256" offset="10 0" rdims="3 5"
recursion"2 2 8 2 2 2 2 6">
</Shape>
</Component>
<Component status=raw>
</Component>
<--Component:
- c is the index (starting at 0) of the component in the current file
- ampl is the amplitude of the component.
- regionid is the Region ID of the region he componet belongs to. It is optional, and the Region element pointed to need not actually exist.
- status describes the status of the component. It can have the following values:
- 'clean' is a component that represents a signal in the spectrum.
All clean components in a Decomposition must be independent (i.e. no duplicates) so that the sum of all clean components represent all known signals in the spectrum. - 'noise' is a component that represents noise or artifacts. It serves for reconstructing the full spectrum but does not contain signal information.
- 'border' is a component that contains signal that corresponds to one or more clean components elsewhere. Border components could eb shoulders of Components from another interval, or duplicates of one or more components present elsewhere. When Prodecomp calculates an interval for every peak it should (when fully analyzed) contain one clean component with the rest being either noise or border.
- 'raw' is a component that has not yet been analyzed and assigned to one of the other categories.
- 'data' is a component that is *not* he result of a decomposition but simply a
subregion of the spectrum with an unknown number of signals. Used for files
where only certain regions are transformed, e.g. by partial ('slow') FT.
Shape:
- size, offset, rdims, recursion are optional. If not given, they default to, respectively, the axis size, 0, 1, no recursion
- The attribute 'file' is optional. It is the name of the file that contains the actual data. If file is given data will be in the external file and there will be no data inside the element. By convention the data file(s) must be in the same directory as the XML file.
- name, nucleus, domain, type, and carppm for a shape is always identical to the values for the corresponding axis
- size and offset are used for shapes that are cut down relative to the original axis. If size is not given for the shape, size, swppm, startppm, endppm, are the same as for the original axis.
If size is given, the values can be calculated as follows:
Shape.swppm = Axis.swppm * Shape.size / Axis.size
Shape.startppm = Axis.startppm - Axis.swppm * Shape.offset / Axis.size
Shape.endppm = Shape.startppm - Shape.swppm
-->
<-- Shapes can have more than one dimension. The simplest case is given as
<Shape a="3 4"> and other parameters are deduced from the original axes
Other parameters can be given, simply by concatenating the values for the
dimensions. The most complicated case would be the <Shape a="5 6"...
Note that this means that axis 2 in this shape has rdims=3 and
recursion="2 2 8", while axis 3 has rdims=5, recursion="2 2 2 2 6" -->
<-- peaks are put inside the shape they refer to. They must appear *after*
the numbers that define the shapes themselves
In parallel to intensity, pos, and width, the parameters dintensity, dpos, dwidth, describe
a measure of the error in the corresponding parameter. These are of course optional.-->
<Ndpeaks
list=pk,c,intensity,pos,pos,pos,pos,pos,width,width,width,width,width>
...
</Ndpeaks>
<Ndpeaks a="1 3 4" list=pk,c,intensity,pos,pos,pos,,width,width,width,>
...
</Ndpeaks>
</Region> <-- Components can be either within regions or directly within the Decomposition-->
</Decomposition>
General comments:
- For development purposes it is always possible to add new XML attributes as well as new elements in the 'list=' of Peaks and Ndpeaks.
- Apart from the tags in the example there are 'posppm' and 'widthppm'. 'pos' and 'width will be positions and widths in points. dposppm and dwidthppm are the corresponding error parameters..
- XML attributes and list elements are generally given without surrounding quotes. Code from CCPN will read files correctly whether codes are present (as they should be for standard XML) or absent. Quotes are mandatory for values that contain whitespace.
- Storing shapes in external files.
With 2D and higher dimensional shapes, storage in text XML becomes problematic. Here the xml file can be used as header, pointing to external binary files for storing individual shapes. The rules are
- A shape stored in an external file has an extra attribute 'file' that gives the file name. The external file must be in the same directory as the xml file. The characteristics of the external file are given by the attributes of the <Datafile> element, and by the 'blocksize' attribute of the <Shape> element, or by their defaults. Note that these descriptions are valid for all external files pointed to from a given xml file. For shapes stored in external files any content fo teh Shape element is ignored.
- Use for general storage of sparsely processed matrices:
The format can be used to store arbitrary subcubes of a high-dimensional matrix. This serves for instance to store individual planes that are Fourier transformed (using a 'slow' transform) in a high-dimensional spectrum acquired by non-uniform sampling. This use can use wither in-xml or external file storage for the actual data. For this usage you should:
- Set the Component.status attribute to 'data', signifying that the component is not the result of a decomposition.
- Have all shapes except one with size=1, and with the actual shape equal to 1.0. The offset will then define the position of the subcube.
- Put the subcube in a single shape, defining the appropriate axes, offsets, and sizes. Note that the axes must be given in the order of the dimensions in the external matrox, fastest varying dimension first.