|
Collaboratory for MS3D
Data Portal Enabling New Protein Structure Collaboration: Overview
Principal Investigator: Carmen Pancerella
Institutional Points-Of-Contact:
- Sandia National Laboratories: Carmen Pancerella
- University of Maryland, Baltimore County: Dan Fabris
- University of California, San Francisco: Irwin Kuntz
We are building on an established data-centric collaboration infrastructure,
the Knowledge Environment for Collaborative Science (KnECS), and adaptively
adding new data management and analysis capabilities to enable emerging
research communities tackling innovative approaches in the biomedical
community. This resulting infrastructure will incorporate emerging middleware
standards for data/metadata management, security, application integration, and
collaboration. In the longer term, the project will target a standards-based
knowledge synthesis and management capability. This work is being carried out
in direct collaboration with scientists leading the development of MS3D[1, 2],
a new method that combines intra-molecular chemical crosslinking with
high-resolution mass spectrometry to glean structural information about
proteins and other biological macromolecules. The 'Collaboratory for MS3D',
or C-MS3D, will integrate the evaluation of the tools and the measurement of
their impact on a newly developing community. The tools developed will be
open sourced, and made available as a 'collaboration tool kit' for other
interested communities.
The specific aims of this project are:
-
To build an extensible portal for sharing data and tools, supporting
both public and private group collaborations of geographically distributed
biologists
-
To enable information interoperability by creating new community data
schemas and tools for sharing data in the MS3D domain, taking advantage of
existing and emerging standards and technologies, where possible.
-
To modify existing tools that generate and analyze data to enable the
creation and storage of new MS3D metadata in a format that allows
interoperability with other tools and collaboratory functions; and to create
new tools as the portal and data schemas mature.
-
To research and develop methods for automating the capture of data
provenance and workflow, towards the goal of a comprehensive knowledge
management system.
-
To demonstrate the impact and effectiveness of the portal to enable
new science by piloting these developments with collaborating scientists in
the MS3D community.
This work is taking advantage of previous work by the Collaboratory for
Multi-scale Chemical Science (CMCS)
[http://cmcs.org/], a multi-institution
project funded by the U.S. Department of Energy to develop and pilot an
advanced collaborative community data system for chemical science. CMCS is an
open, public resource supporting a systems approach to chemical science
including sub-disciplines from quantum chemistry to reacting-flow simulations
of chemical combustion. KnECS is the discipline-independent infrastructure
that was built in the CMCS project. C-MS3D will inherit these KnECS
technologies:
-
A collaboration infrastructure to enable real-time and asynchronous
collaborative development of standards for data/metadata description,
multi-discipline scientific communication, geographically distributed
collaboration, and project management.
-
Repositories to store data and metadata in a way that preserves data
integrity and allows web access.
-
Tools to browse, search and query metadata, and to retrieve, analyze,
and visualize data across all scales, disciplines, and locations.
-
APIs to enable new and existing scientific tools to generate, access,
and store data and metadata in the repositories.
KnECS is built on a web-based portal using the CompreHensive collaborativE
Framework (CHEF)
[http://www.chefproject.org/], which itself leverages the
Apache Jetspeed portal framework. As a data and metadata management framework,
KnECS employs the Scientific Annotation Middleware (SAM) to provide federated
data/metadata access, extensible metadata annotation, and transformations of
data and metadata [3]. This capability enables scientific knowledge
management.
The initial design of C-MS3D capabilities has been guided by an overarching
use case scenario. Based on this guiding use case, we are targeting several
domain applications for C-MS3D portal integration. These include the
following tools:
-
Automatic support for new mass spectra being stored directly in the
shared data repository in an interoperable format.
-
Integration of developing crosslink assignment tools, with built-in
data interfaces for acquisition of all facets of input data, automatically in
most cases by taking advantage of detailed of annotation of mass list data.
-
Integration of bio-molecular structure modeling tools. Extension of these
tools to support the analysis of distance constraints,
initially through consistency checks, later through incorporating these
constraints into the structure optimization algorithms.
-
Tabular and graphical visualization tools for mass spectra, mass
lists, assignment lists, and partial and total protein structures.
-
Interface for the development of a collaborative crosslinking
chemistry knowledge base.
-
A workflow environment that integrates tools and data.
Rapid development of this field dictates an adaptive strategy of reviewing
requirements at regular intervals and directing efforts opportunistically.
For example, it is anticipated that other types of low-resolution or otherwise
qualified structural information (such as reactivity data, partial structures,
sparse NMR data) would add significant value to the bio-molecular structure modeling
process. An open source model for the integrated bio-molecular structure modeling
tools might provide a forum for a growing community to contribute to the speed
and accuracy of 3D structure discovery.
The CMS3D project will open source its software following relevant NIH and
Sandia National Laboratories guidelines, pending the approval of those
organizations. Members of our team have experience with open sourcing project
software through their current CMCS efforts. We anticipate the C-MS3D project
software (consisting of the informatics infrastructure, data interoperability
technologies, annotation and provenance management software, workflow
management tools, and those domain applications that C-MS3D gains the right to
distribute) to be sufficiently mature to be open-sourced by the mid-project
timeframe. Once the software is open-sourced, the C-MS3D team will manage the
ongoing open-source project, direct community involvement and contributions,
and provide incremental releases of the software suite.
-
Young, M.M., et al., High-Throughput Structure Determination: Rapid
Identification of Protein Folds Using Mass Spectrometry and Intramolecular
Cross-linking. Proc Natl Acad Sci U S A, 2000. 97(11): p. 5802-6.
-
Schilling, B., et al., MS2Assign, automated assignment and nomenclature of
tandem mass spectra of chemically crosslinked peptides. J Am Soc Mass
Spectrom, 2003. 14(8): p. 834-50.
-
Myers, J.D., et al., Re-Integrating the Research Record, in IEEE Computing in
Science and Engineering. 2003. p. 44-50.
|