Protein Structure Initiative (PSI)

Steering Subcommittee on Goals and Milestones

January 2007 (updated 8.30.07) 

Mission Statement. The long-range goal of the Protein Structure Initiative is to make the three-dimensional atomic-level structures of most proteins easily obtainable from knowledge of their corresponding DNA sequences. 

Broad Overall Goals.  The National Institutes of Health-National Institute of General Medical Sciences (NIH-NIGMS) Protein Structure Initiative (PSI) was created to expand the impact and value of the Human Genome Project, and other genome sequencing projects, using three-dimensional (3D) protein structure analysis.  The primary goals of the PSI include (i) large-scale protein structure determination by X-ray crystallography and nuclear magnetic resonance (NMR) methods, along with broad structural coverage of protein sequences through homology modeling, (ii) development of new technologies and infrastructure that accelerate the process of 3D protein structure analysis, and (iii) community outreach.  

The most important goal of the PSI is to maximize the coverage of protein sequences with structural information. Selected gene products are being systematically prioritized with the aim of attaining structural coverage of every major protein domain family found in nature.  Within this comprehensive target selection program, emphasis is placed on obtaining 3D structures of human proteins, fundamental and disease-causing proteins from bacterial, fungal, protozoal, and viral pathogens, proteins from model organisms, such as M. musculus, C. elegans, D. melanogaster, S. cerevisiae, as well as proteins from gram-positive and gram-negative bacteria. These include many proteins thought to represent drug discovery targets.   

By 2010, the PSI aims to deliver more than 4,000 new 3D structures of proteins to the biological and biomedical research community, including more than 1,000 structures produced in the initial five-year pilot phase. Through the leverage provided by the Computational Modeling and the Knowledge Base Centers, these experimental 3D structures will be used to generate structure/function information for millions of gene products. In many cases, these PSI structures provide key clues to evolutionary and functional relationships among proteins that are not evident from sequence information alone, creating new opportunities for biological discovery. These novel biological insights can often only be gleaned by elucidation of 3D protein structure. The mission of the PSI also includes development of new technologies and methods aimed at reducing costs of structure production, and at providing 3D structures of particularly challenging proteins, such as membrane proteins, certain classes of eukaryotic proteins, and multi-protein complexes. A key goal of the PSI’s community outreach efforts is to make 3D structure an important component of biological research.  Finally, the PSI organizes and maintains an extensive database of protein sample production protocols, data, and reagents that are available to the broad scientific community. 

 

Background. The PSI was initiated in 2000 by the National Institute of General Medical Sciences.  The initial five-year phase of the program funded eleven pilot projects, aimed at developing core technologies for structural genomics and for creating the infrastructure required for large-scale protein structure production.  In this pilot phase, ~ 1,300 protein structures were deposited into the public domain.  The second phase of PSI (PSI-2), initiated on July 1st, 2005, supports four Large-Scale Research Centers, together with six additional Specialized Research Centers, two Computational Modeling Centers, the PSI Materials Repository, and a PSI Knowledge Base.    

PSI-2 puts strong emphasis on determining 3D structures from (i) large families of protein domains (with tens to hundreds of members) for which essentially no 3D structural information is presently available, and (ii) very large families (with hundreds to tens-of-thousands of members) for which only limited 3D structure information is available.  These include proteins from human and other model organisms

of significant biological or biomedical interest.  Protein target selection and 3D structure determination is coordinated across the PSI-2 centers to minimize redundancy.  The program is also supported by extensive structural bioinformatics efforts that leverage these experimental data by structural and functional annotation, including large-scale homology modeling.  The PSI Materials Repository provides infrastructure for distributing tens of thousands of physical reagents generated by the PSI program to the broader biological community.  This highly integrated program is designed to enhance the value of the Human Genome Project and other large scale gene sequencing projects using protein structure/function analyses, and to provide information, reagents, and technologies that will strengthen hypothesis-driven research programs in biology, chemistry, and medicine.   
 

Specific Goals and Measures of Success 

The following sections summarize the Specific Goals and Measures of Success in each of three areas: (i) Protein Structures, (ii) New Technologies; and (iii) Outreach to the Scientific Community.  

Throughout these sections, an Experimental Structure is defined as one determined by X-ray crystallography or NMR methods having satisfactory structure quality assessment statistics, as defined in the Appendix, and deposited in the Protein Data Bank (PDB). Some of these statistics will be assessed centrally, by the PSI Knowledge Base, and others will be reported on a regular basis by PSI-2 centers.

 

I. Protein Structures  

Goals for Protein Structure Production 

The central objective of PSI-2 is to increase the total number of proteins whose structure can be inferred from knowledge of their respective DNA sequences.  Toward this aim, the PSI-2 program will determine more than 3,000 high-quality unique experimental protein (or protein domain) structures (Experimental Structures) using X-ray crystallography or NMR spectroscopy.  At the time each of these structures is determined, nearly all will have “distinct” non-redundant protein sequences, distinctly different from those previously determined and deposited in the PDB.  Although these structures will be produced primarily by the Large-Scale Research Centers, the Specialized Research Centers will also contribute to the overall PSI-2 production of protein structures, particularly for challenging proteins. 

Many of the proteins targeted in PSI-2 will be the first structural representatives from large families of protein domains with ten to thousands of members.  In addition, there is high scientific value in obtaining structures of multiple members from highly-diverse protein domain “Mega families”, that include hundreds to tens-of-thousands of members and many subfamilies which cannot presently be modeled.   These Mega families will be targeted both to advance our knowledge about the evolution of protein structure and function and to improve our understanding of normal physiology and disease in humans.  Accordingly, an additional goal of PSI2 is to sample extensively across these Mega families so as to provide structural and functional coverage. 

In pursuing these central goals, PSI-2 will maximize the impact of experimental structures through computational homology modeling, and leverage the information content of the experimentally-determined protein structures using structural bioinformatics approaches.  In particular, the Experimental Structures produced in PSI-2 will provide templates required for modeling the 3D structures of millions of proteins, including tens of thousands of human proteins.   
 

Measures of Success for Experimental Structure Determination and Modeling Leverage 

The following sections describe some key Measures of Success for PSI-2.  These metrics provide a standardized means of counting Experimental Structures, assessing the impact of these Experimental Structures in the community, measuring the value of Experimental Structures in terms of structural models for related proteins, and estimating structural coverage for specific proteomes.  Additional details and definitions of the metrics outlined in this section are presented in the Appendix.  
 

I.1. Numbers of Experimental Structures and Residues 

I.2. Impact and Classification of Experimental Structures 

I.3. Numbers of Sequences For Which Homology Models Can Be Produced from PSI  Structures and Corresponding Coverage of Specific Proteomes. 

A second key goal of the PSI is to leverage the information provided by these four thousand Experimental Structures through computational modeling, generating millions of homology models that will be invaluable for advancing many different areas of scientific investigation.  These measures attempt to estimate how many such protein models can be constructed using a specific Experimental Structure, as well as assess  the coverage of specific proteomes by experimentally-determined and modeled 3D structures. 

A critical challenge in reporting such “Modeling Leverage” is assessment of the reliability of the resulting models.  This is an area of current active research with no broadly accepted standards or conventions for assessing model accuracy.  For the purpose of PSI-2, Modeling Leverage will be operationally defined based on sequence similarity using the conventions outlined in the Appendix. As modeling technologies improve, these conventions may be refined over time by the PSI Target Selection Committee.  

The following metrics provide estimates of Modeling Leverage and Structural Coverage of specific proteomes. They will each be assessed in terms of numbers of protein structures and numbers of residues in these protein structures which can, in principle, be modeled from Experimental Structures. The Appendix provides detailed operational definitions and conventions for these measures that will be used to assess modeling leverage for PSI. 

II. New Technologies  

Goals for Technology Development 

The PSI-2 is committed to developing and making available technological and methodological advances that provide enabling infrastructure for biology, chemistry, and medicine.  In addition to the Large-Scale Research Centers, the PSI-2 supports a number of Specialized Research Centers, whose mission is to develop novel technologies for target selection, protein production, and structure determination, particularly for challenging eukaryotic proteins, membrane proteins, and multi-protein complexes.  The primary goal of these technology development efforts in both the Large-Scale and Specialized centers is to provide to the scientific community new technologies and protocols that reduce costs and improve the efficiency of protein sample production, and the speed and accuracy of experimental structure determination.  This goal includes making accessible to the public the corresponding data, protocols, reagents, hardware, and software associated with PSI-2 supported technology development efforts. 

Measures of Success for Technology Development 

III. Outreach to the Scientific Community 

Goals for Community Outreach  

A key goal of the PSI is to make 3D structure an important component of biological research.  This includes propagation of new technologies and structural information to the broad scientific community, and incorporation of project nominations from the community, whether they be specific targets, groups of targets, or methodological or technological advances.  Approximately 15% of the effort of PSI-2 centers will be devoted to Community Outreach Targets.  Given that these nominated projects are likely to be more difficult than the majority of PSI-2 structures, the PSI-2 plans to deposit at least 100 – 300 such structures of Community Outreach Targets to the PDB. 

The PSI-2 is also providing comprehensive documentation of experimental protocols and interim results for gene cloning, gene expression, protein purification, protein crystallization, and structural characterization, including negative results.  All PSI-2 Centers must deposit standardized sets of data on protein sample preparation and structure production in the public-domain PepcDB database, shortly after the data are generated.   

A further goal of PSI-2 Community Outreach is to provide expression clones for all program targets, initially directly from the Centers, and eventually from the PSI-2 Materials Repository. Other reagents, such as small quantities of purified proteins, will also be provided when readily available.   
 

Measures of Success for Community Outreach 

III.1.  Number of protein targets “accepted” from community requests for investigation by PSI-2 Centers. 

III.2.  Number of PDB depositions of Community Targets. 

III.3.  Number and lists of publications with joint authorship between PSI Centers and non-PSI investigators. 

III.4.  Numbers and lists of expression vectors, expression hosts, purified proteins, and other materials distributed to non-PSI investigators. 

III.5.  Numbers of hits on TargetDB and PepcDB web sites per unit time. It is recognized that this metric underestimates the use of these data, as it does not include offline usage of these databases. This metric will be assessed centrally by TargetDB and PepcDB. 

III.6.  Number of downloads of PSI-2 Experimental Structures from the PDB. It is recognized that this metric underestimates the use of these data, as it does not include offline usage of these databases. This metric will be assessed centrally by the PDB. 

III.7. Number of accesses (hits) to PSI-2 Experimental Structures in PDB, PDBsum, SCOP, or CATH databases. 

III.8.  Numbers of attendees at Workshops offered by PSI-2 Centers, as self-declared by each center. 

III.9.  Numbers of seminars at non-PSI institutions/department meetings/national and international conferences on PSI activities given by PSI scientists, as self-declared by each center. 

III.10. Numbers and names of student trainees (undergraduate, graduate, postdoctoral) (i) directly supported by and (ii) otherwise involved in PSI-2 sponsored research programs. 

III.11. Numbers and names of visiting scientists (i) directly supported by and (ii) otherwise involved in PSI-2 sponsored research programs. 

III.12. Numbers and names of underrepresented minority scientists involved in PSI-2 sponsored research programs. 

III.13. Numbers of citations of published papers describing PSI-2 structures. 

Steering Subcommittee on Goals and Milestones

Chair:  Gaetano Montelione

Members:   Helen Berman, David Eisenberg, Wayne Hendrickson, Andrzej Joachimiak, George Phillips, Janna Wehrle, Stephen Burley, Ian Wilson

Advisor: Steven Brenner  

 

DRAFT 

Appendix: Definitions of Measures of Success  

The following is a draft version of Definitions of Measures of success.  These definitions and metrics will be formally defined by experts of the PSI Knowledge Base, and then approved by the PSI-2 Milestones and Goals Subcommittee. They may also be updated by the committee in the future. These metrics can be assessed retrospectively, since the dates of all PDB depositions are recorded.  The following are guidelines to direct this process. 

I. Protein Structures 

I.1. Numbers of Experimental Structures 

I.3. Numbers of Sequences for which Homology Models can be Produced from PSI-2 Structures and Corresponding Coverage of Specific Proteomes. 

For the purposes of PSI-2, Modeling Leverage of an experimental structure will be defined as the number of protein sequences (or the number of protein residues covered in these protein sequences) with E value < 10-10 (corresponding to  ~ 30% sequence identity) relative to the target experimental structure.  The modeling leverage depends on both the sequence database used and the date of the analysis. The Modeling Leverage of a particular Experimental Structure increases with time, as the sequence databases expand.  For the purpose of PSI-2, the modeling leverage will be computed against a specified version of the UniProt database. This choice will facilitate comparative analysis at different time points.