Minutes, Cambridge 10-12 Sep 2007
Decision summary of EUROCarbDB-CCPN meeting
Present:
Martin Frank, Magnus Lundborg, Ana Arda Freire, Thomas Luettke, Bas Leeflang, Siegfried, Schloissnig, Rasmus Fogh, Wim Vranken, Wayne Boucher, Tim Stevens.
APIs and code:
EuroCarbDB needs:
A project meeting is coming up in November. The NMR part is behind and tangible progress would be good for this meeting.
The main need is for the Java/SQL API, ChemComps, and data import. Some need (Magnus Lundborg) for the C API.
Data will be partly static. Most data will be uploaded in batch, but there is need for corrections, editing, and adding some fields through applications. NMR data are a separate block, in parallel to MS and HPLC data. The whole is linked to a core structure, 'molecule', 'evidence', references and bookkeeping, through a few links to a central table. It may or may not be necessary to add a new table for selecting sets of CCPN data, comparable to the NmrEntyry package. If so this will be no problem.
Agreed plans:
Finish trunk Java/XML API first. This will allow coding to start and prototyping, and provide tangible results for November
For Java/SQL API, draw on Matt Harrison for Hibernate expertise, and set up Hibernate-based API. Hibernate is well established, stable, and used by many, including EuroCarbDB and PIMS. We count on Matt to know enough to tell which options should be used, which files need generating, how they should look etc. Taking this information directly from him should save CCPN a couple of months of finding out and deciding. Within this, plans are:
1) Decide on database schema to use and write schema generator. This provides more tangible results for November.
2) generate Hibernate mappings
3) Generate Java/SQL API on top of Hibernate.
A Hibernate implementation may have problems, including speed. It is accepted to start by getting it working and to solve problems as they appear.
A first meeting with Matt did not throw up any particular problems. It was decided that CCPN should:
- Buy the standard Hibernate reference that has high quality detailed discussions on the expected problems.
- Write a Database schema generator based on the existing generators.
- Pass the schema to Matt, who will use it to generate a default hibernate mapping and Java API.
- Use the generated mappings as source and inspiration to write a Hibernate mapping generator.
- Contact Matt for advice as needed.For C API we should build on the Python trunk API, as hitherto. We should consider calling some of the more complex Python functions from C instead of recoding them. 1.1 C API could be produced fairly quickly. Discussions between Wayne and Magnus solved some problems, and confirmed the approach of calling complex Python functions from C. Magnus seems happy with the branch 4 C API for now, so a version 1.1 is not immediately urgent.
Other issues:
Java Typing: The standard Java approach is to use 'superclass typing' for overriding functions in subclassses. This is also the least messy option, and Java programmers (unlike Python programmers) should not find it irritating (thanks to Matt)
Java Exceptions: No strong preferences between runtime and full exceptions.
Static variables should be avoided in Java/SQL API , for threading reasons.
Database features, like concurrency, access control, audit, roll-back, client-server set-up, 'for'-all-class-instances' queries etc. are not urgent considerations at the moment. To be considered later.
ChemComp names are too long to be useful for assignment and selection menus in Analysis. A system with user-selected synonyms that can be used instead needs to be implemented. Synonynms would come from the carbohydrate databases or be set directly by users.
ChemComps and data import:
Base unit/chemComp level
The basic setup we agreed upon is that the monoSaccharideDB server
(MSS) will provide all the reference information that is necessary to
generate the chemComps in CCPN. When a user wants a carbohydrate
chemComp in CCPN, the following happens (in a separate 'carbohydrate
creation' plugin):
- The user selects the base unit for the monosaccharide (from a list provided by the MSS)
- The user can then select substituents and add them to the base unit, if required
- The plugin generates the monoSaccharideDB code
- Is this code available from the standard CCPN distribution?
- YES: use the chemComp from the standard distribution
- NO:
- Does the user want standardised data?
- YES:
- query the MSS server
- information is stored and processed there
- get back a 'standardised' chemComp XML file with glycoCtCode and monoSaccharideDB
- NO:
- the chemComp is generated from the reference data, but no glycoCtCode or monoSaccharideDB code is set
The ccpCode will likely be different from the monoSaccharideDB code, although this is not certain yet.
This system requires the following initial steps:
- XML files with all information (atom names, bonds, coordinates, stereo codes, linkage sites, ...) for base units (Martin/Thomas)
- XML files with all information (also connection possibilities) forsubstituents. This should include 'preferred' name aliases for substituents (e.g use N2 for a NAc connected to a C2) (Thomas/Martin)
- Python code to generate CCPN chemComp from this information (takes monoSaccharideDB code as single input, pops up window where preferred atom names can be set for substituents only) (Wim)
- Change ccpCode to 80 Char length (length of Line) to handle long monosaccharide names (Rasmus)
Problems/issues to be addressed later are:
- Mapping between monoSaccharideDB codes and PDB codes (Thomas)
- Are phosphates, ... substituents or separate residues? This is valid for all possibly bivalent substituents. In CCPN, this can be handled in both ways: as a substituent with linkage for another carbohydrate, or as a separate chemComp (of type 'other') with two linkEnd sites. (all)
- If possible the substituent library should also be usable for non-carbohydrates, e.g. Proteins, ... How this might work should be taken into consideration from the start.
- OMe on the anomeric carbon is NOT a substituent (is handled as such in CCPN right now). This will become a separate chemComp of type 'other'. Need to be able to handle old projects of Magnus when this change happens! (Wim) OMe on non-anomeric carbons is handled as Me substituent on sugar O.
- The final atom names for the substituent should NOT overlap with the ones for the base units. The original proposal was to use a prefix separated by underscore, but these might conflict with LinkAtom names. Instead substituents will be given chemically sensible atom names, and the names in the merged ChemComp will have the suffix ':n' added, where n is the substituted position. E.g. an S-CH2-CH3 substituent at position C4 might have names S:4, C1:4 and C2:4.
- Analysis should use the 'preferred' EuroCarbDB name for substituent atoms if available (Tim)
Possibly done later:
- Code to generate (pro-)R/(pro-)S stereochemistry codes correctly within CCPN from chemComp(Coord) information? (CCPN)
- Code to generate good coordinates for substituted monosaccharides (Martin)
Molecule level
Molecule level information on linkages and monosaccharide units will be exchanged in the GLYDE format. A layer will be added to this to
handle the monosaccharideDB names.
By default, the carbohydrate units get a sequence code based on their linkage path (e.g. 344)
Things to do:
- Code to generate the 'linkage seqCode' in CCPN (Wim) (accessible from button in Analysis (Tim))
Code to read/write GLYDE as part of FC setup (Wim)
Code to build good 3D coordinates for a molecule (Martin)
Conversion of glycosciences.de NMR data
- The data will first be cleaned up internally with respect to atom names and sequence codes (Martin/Thomas)
- This will then be exported, per entry, as GLYDE for molecule info, and custom files for shift/coupling constant info (Martin/Thomas)
- All data will then be converted into CCPN projects with a large scale setup (Wim)