Protein Structure Initiative (PSI)
Steering Subcommittee on Goals and Milestones
January 2007
(updated 8.30.07)
Mission Statement. The long-range
goal of the Protein Structure Initiative is to make the three-dimensional
atomic-level structures of most proteins easily obtainable from knowledge
of their corresponding DNA sequences.
Broad Overall Goals.
The National Institutes of Health-National Institute of General Medical
Sciences (NIH-NIGMS) Protein Structure Initiative (PSI) was created
to expand the impact and value of the Human Genome Project, and other
genome sequencing projects, using three-dimensional (3D) protein structure
analysis. The primary goals of the PSI include (i) large-scale
protein structure determination by X-ray crystallography and nuclear
magnetic resonance (NMR) methods, along with broad structural coverage
of protein sequences through homology modeling, (ii) development of
new technologies and infrastructure that accelerate the process of 3D
protein structure analysis, and (iii) community outreach.
The most important goal of the
PSI is to maximize the coverage of protein sequences with structural
information. Selected gene products are being systematically prioritized
with the aim of attaining structural coverage of every major protein
domain family found in nature. Within this comprehensive target
selection program, emphasis is placed on obtaining 3D structures of
human proteins, fundamental and disease-causing proteins from bacterial,
fungal, protozoal, and viral pathogens, proteins from model organisms,
such as M. musculus, C. elegans, D. melanogaster,
S. cerevisiae, as well as proteins from gram-positive and gram-negative
bacteria. These include many proteins thought to represent drug discovery
targets.
By 2010, the PSI aims to deliver
more than 4,000 new 3D structures of proteins to the biological and
biomedical research community, including more than 1,000 structures
produced in the initial five-year pilot phase. Through the leverage
provided by the Computational Modeling and the Knowledge Base Centers,
these experimental 3D structures will be used to generate structure/function
information for millions of gene products. In many cases, these PSI
structures provide key clues to evolutionary and functional relationships
among proteins that are not evident from sequence information alone,
creating new opportunities for biological discovery. These novel biological
insights can often only be gleaned by elucidation of 3D protein structure.
The mission of the PSI also includes development of new technologies
and methods aimed at reducing costs of structure production, and at
providing 3D structures of particularly challenging proteins, such as
membrane proteins, certain classes of eukaryotic proteins, and multi-protein
complexes. A key goal of the PSI’s community outreach efforts is to
make 3D structure an important component of biological research.
Finally, the PSI organizes and maintains an extensive database of protein
sample production protocols, data, and reagents that are available to
the broad scientific community.
Background. The PSI was
initiated in 2000 by the National Institute of General Medical Sciences.
The initial five-year phase of the program funded eleven pilot projects,
aimed at developing core technologies for structural genomics and for
creating the infrastructure required for large-scale protein structure
production. In this pilot phase, ~ 1,300 protein structures were
deposited into the public domain. The second phase of PSI (PSI-2),
initiated on July 1st, 2005, supports four Large-Scale Research
Centers, together with six additional Specialized Research Centers,
two Computational Modeling Centers, the PSI Materials Repository, and
a PSI Knowledge Base.
PSI-2 puts strong emphasis on determining 3D structures from (i) large families of protein domains (with tens to hundreds of members) for which essentially no 3D structural information is presently available, and (ii) very large families (with hundreds to tens-of-thousands of members) for which only limited 3D structure information is available. These include proteins from human and other model organisms
of significant biological or biomedical
interest. Protein target selection and 3D structure determination
is coordinated across the PSI-2 centers to minimize redundancy.
The program is also supported by extensive structural bioinformatics
efforts that leverage these experimental data by structural and functional
annotation, including large-scale homology modeling. The PSI Materials
Repository provides infrastructure for distributing tens of thousands
of physical reagents generated by the PSI program to the broader biological
community. This highly integrated program is designed to enhance
the value of the Human Genome Project and other large scale gene sequencing
projects using protein structure/function analyses, and to provide information,
reagents, and technologies that will strengthen hypothesis-driven research
programs in biology, chemistry, and medicine.
Specific Goals and Measures
of Success
The following sections summarize
the Specific Goals and Measures of Success in each of three areas: (i)
Protein Structures, (ii) New Technologies; and (iii) Outreach to the
Scientific Community.
Throughout these sections, an Experimental Structure is defined as one determined by X-ray crystallography or NMR methods having satisfactory structure quality assessment statistics, as defined in the Appendix, and deposited in the Protein Data Bank (PDB). Some of these statistics will be assessed centrally, by the PSI Knowledge Base, and others will be reported on a regular basis by PSI-2 centers.
I. Protein Structures
Goals for Protein Structure
Production
The central objective of PSI-2
is to increase the total number of proteins whose structure can be inferred
from knowledge of their respective DNA sequences. Toward this
aim, the PSI-2 program will determine more than 3,000 high-quality unique
experimental protein (or protein domain) structures (Experimental Structures)
using X-ray crystallography or NMR spectroscopy. At the time each
of these structures is determined, nearly all will have “distinct”
non-redundant protein sequences, distinctly different from those previously
determined and deposited in the PDB. Although these structures
will be produced primarily by the Large-Scale Research Centers, the
Specialized Research Centers will also contribute to the overall PSI-2
production of protein structures, particularly for challenging proteins.
Many of the proteins targeted in
PSI-2 will be the first structural representatives from large families
of protein domains with ten to thousands of members. In addition,
there is high scientific value in obtaining structures of multiple members
from highly-diverse protein domain “Mega families”, that include
hundreds to tens-of-thousands of members and many subfamilies which
cannot presently be modeled. These Mega families will be
targeted both to advance our knowledge about the evolution of protein
structure and function and to improve our understanding of normal physiology
and disease in humans. Accordingly, an additional goal of PSI2
is to sample extensively across these Mega families so as to provide
structural and functional coverage.
In pursuing these central goals,
PSI-2 will maximize the impact of experimental structures through computational
homology modeling, and leverage the information content of the experimentally-determined
protein structures using structural bioinformatics approaches.
In particular, the Experimental Structures produced in PSI-2 will provide
templates required for modeling the 3D structures of millions of proteins,
including tens of thousands of human proteins.
Measures of Success for Experimental
Structure Determination and Modeling Leverage
The following sections describe
some key Measures of Success for PSI-2. These metrics provide
a standardized means of counting Experimental Structures, assessing
the impact of these Experimental Structures in the community, measuring
the value of Experimental Structures in terms of structural models for
related proteins, and estimating structural coverage for specific proteomes.
Additional details and definitions of the metrics outlined in this section
are presented in the Appendix.
I.1. Numbers of Experimental
Structures and Residues
I.1.A. Number of Novel
Experimental PSI-2 Structures. This metric enumerates the number
of Experimental Structures (or domains within multi-domain Experimental
Structures) deposited into the PDB for which, at the time of deposition,
no 3D structure was publicly available for a close homolog, defined
operationally as one with more than ~ 30% sequence identity over the
length of the relevant segment of the polypeptide chain. These structures
may resemble known protein structures, but are novel at the time they
are deposited in the PDB in the sense that their structures cannot be
predicted reliably by comparative modeling methods. The technical
process for defining a Novel Experimental Structure is outlined in the
Appendix. The majority of the 3,000 structures determined by PSI-2 would
contribute to this metric.
I.1.B. Number of Distinct
Experimental PSI-2 Structures with Nonredundant Sequences. This metric
enumerates structures of proteins (or protein domains) with sequences
distinctly different (i.e. not identical in sequence, as specifically
defined for a Distinct Experimental Structure in the Appendix) from
sequences deposited in the PDB prior to completing the targeted PSI-2
structure. This metric counts separately the multiple homologues
across a protein domain family that are not ‘novel’ by criterion
I.1.A. Although most proteins are selected using criterion I.1.A, by
the time some structures are completed they may no longer be Novel Experimental
Structures. The deliverable of more than 3,000 Distinct Experimental
Structures in PSI-2 refers specifically to this metric.
I.1.C. Number and Size
of Domain Families for which PSI-2 provides the first Experimental Structure
Representative. This metric enumerates the numbers and sizes of
Domain Families, or Mega Family subclusters, for which PSI-2 provides
the first Experimental Structure. As part of the process of target
selection, the PSI-2 Production Centers are assigned specific families
of protein domains, referred to as “BIG” or “Mega” families,
which are selected and organized in a coordinated bioinformatics effort.
Each of these domain families, selected on the basis of high novelty
and leverage value by the PSI-2 Target Selection Committee, includes
ten to thousands of members. Many of the 3,000 structures determined
by PSI-2 would contribute to this metric.
I.1.D. Total Number of Experimental
PSI-2 Structures. This metric enumerates all PDB depositions,
including multiple structures of the same protein sequence determined
by different methods (i.e., NMR versus X-ray crystallography), in different
crystal forms, different solution conditions, or bound to different
ligands. It would also count separately proteins that differ at
just a few amino acid sites, which are not distinct by criteria I.1.C.
The number of protein structures determined in PSI-2 that would contribute
to this metric should significantly exceed the expected 3,000 unique
Experimental Structures.
I.1.E. Numbers of Experimentally
Determined Residues. Each of the measures above (in I.1.A - D.)
will also be assessed on a residue basis; i.e., the number of residues
for which structural information is provided will also be estimated,
reflecting the value and challenge of determining larger protein (or
domain) structures.
I.2. Impact and Classification
of Experimental Structures
The following measures assess
impact of Experimental Structures in expanding our knowledge about specific
classes of proteins, and provide statistics on the classes of Experimental
Structures determined in PSI-2.
I.2.A. Number of Experimental
Structures from Specifically-Targeted BIG and Mega Domain Families.
These families of domains, defined by the PSI-2 Target Selection Committee
as having high value for extensive coverage, contain hundreds to tens-of-thousands
of members and many subfamilies which cannot presently be modeled.
They also include representatives in a broad range of proteomes, often
including the human proteome.
I.2.B. Number of Experimental
Structures from Biomedical Theme Target Lists. Center-Specific
Biomedical themes of PSI2 projects include (i) widely conserved domain
families constituting central processes conserved across all kingdoms
of life; (ii) domain families of phosphatases; (iii) proteins and domains
involved biological networks associated with cancers and other human
diseases; and (iv) proteins and domains from the proteomes of pathogenic
bacteria. In order to provide high quality homology models of
these important proteins, sequences with greater than the 30% sequence
identity with homologues in the PDB are often targeted for these biomedical
targets.
I.2.C. Number of Experimental
Structures from Community Outreach Target Lists. Community Outreach
targets are defined by nominations from the broad biological community.
Protein sequences with greater than the 30% sequence identity with homologues
in the PDB may be targeted in these community outreach efforts.
I.2.D. Number of Experimental
Structures from Specifically-Targeted Organisms or Groups of Organisms.
These include enumerations of protein structures from individual organisms,
as well as metagenomes or metabiomes, as defined by the PSI2 Target
Selection Committee.
I.2.E. Numbers of Novel
Chain Folds, New Multidomain Structures or New Arrangements of Domains
in Multidomain Proteins
I.2.F. Number of Experimental
Structures that Identify Previously Unrecognized Relationships Between
Protein Domain Families. This metric enumerates cases where a
homologous relationship that was not previously recognized by sequence
similarity is discoveredI by structural similarity.
I.2.G. Number of Experimental
Structures that are the First Protein Structural Representatives from
Specific Functional Classes.
I.2.H. Number of Experimental
Structures that Suggest Previously Unrecognized Biochemical (Molecular)
Function(s).
I.2.I. Number of Experimental
Structures that Provide Substantially New Biomedical Insights.
I.2.J. Number of Experimental
Structures of Human Proteins.
I.2.K. Number of Experimental
Structures of Eukaryotic ProteinsI.
I.2.M. Number of Experimental
Structures of Membrane Proteins. Membrane proteins (or membrane protein
domains) are defined operationally as proteins (or domains) that require
detergent extraction from cellular membrane fractions for structural
analysis.
I.2.N. Number of Experimental
Structures Determined at the Atomic Level using X-ray Crystallography,
Solution State NMR, and Solid State NMR methods, respectively.
I.2.O. Number and List
of Publications Describing PSI-2 3D Structures.
I.3. Numbers of Sequences For
Which Homology Models Can Be Produced from PSI Structures and
Corresponding Coverage of Specific Proteomes.
A second key goal of the PSI is
to leverage the information provided by these four thousand Experimental
Structures through computational modeling, generating millions of homology
models that will be invaluable for advancing many different areas of
scientific investigation. These measures attempt to estimate how
many such protein models can be constructed using a specific Experimental
Structure, as well as assess the coverage of specific proteomes
by experimentally-determined and modeled 3D structures.
A critical challenge in reporting
such “Modeling Leverage” is assessment of the reliability of the
resulting models. This is an area of current active research with
no broadly accepted standards or conventions for assessing model accuracy.
For the purpose of PSI-2, Modeling Leverage will be operationally defined
based on sequence similarity using the conventions outlined in the Appendix.
As modeling technologies improve, these conventions may be refined over
time by the PSI Target Selection Committee.
The following metrics provide estimates
of Modeling Leverage and Structural Coverage of specific proteomes.
They will each be assessed in terms of numbers of protein structures
and numbers of residues in these protein structures which can, in principle,
be modeled from Experimental Structures. The Appendix provides detailed
operational definitions and conventions for these measures that will
be used to assess modeling leverage for PSI.
I.3.A. Total Modeling
Leverage.
I.3.B. Novel Modeling
Leverage.
I.3.C. Modeling Leverage
and Coverage of the Human Proteome.
I.3.D. Modeling Leverage
and Coverage of Proteomes of Model Organisms and Pathogenic Microorganisms.
II. New Technologies
Goals for Technology Development
The PSI-2 is committed to developing
and making available technological and methodological advances that
provide enabling infrastructure for biology, chemistry, and medicine.
In addition to the Large-Scale Research Centers, the PSI-2 supports
a number of Specialized Research Centers, whose mission is to develop
novel technologies for target selection, protein production, and structure
determination, particularly for challenging eukaryotic proteins, membrane
proteins, and multi-protein complexes. The primary goal of these
technology development efforts in both the Large-Scale and Specialized
centers is to provide to the scientific community new technologies and
protocols that reduce costs and improve the efficiency of protein sample
production, and the speed and accuracy of experimental structure determination.
This goal includes making accessible to the public the corresponding
data, protocols, reagents, hardware, and software associated with PSI-2
supported technology development efforts.
Specific Goals for Technology
Development include:
Cost reduction of protein sample
preparation and experimental 3D structure determination by X-ray crystallography
and NMR spectroscopy.
Advances providing improved
efficiency in gene synthesis and cloning.
Advances providing improved
efficiency in protein expression.
Advances providing improved
efficiency in protein purification.
Advances providing improved
efficiency in protein crystallization.
Advances in methods providing
improved efficiency and/or accuracy of protein structure determination
by NMR and X-ray diffraction.
Advances in structural genomics
approaches for human and other eukaryotic proteins.
Advances in structural genomics
approaches for membrane proteins.
Advances in structural genomics
approaches for protein-protein and protein-ligand complexes.
Advances providing improved
efficiency and accuracy in computational homology modeling of structures
from sequences.
Advances in protein structure
quality assessment and refinement.
Advances in determination of
biochemical (molecular) functions from 3D protein structures.
Advances in developing laboratory
information management systems (LIMS) for organizing and integrating
information generated in large-scale structure production.
Measures of Success for Technology
Development
The following measures will
be reported regularly by each PSI-2 Center.
II.1. Numbers and list of publications
on new technologies.
II.2. Numbers of citations
of publications on new technologies.
II.3. Numbers and list of workshops
organized by PSI-2 centers for the scientific community.
II.4. Numbers and list of intellectual
property disclosures, licensing agreements, patents, and patent applications
on technologies invented in PSI-2 Centers.
II.5. Adoption of technologies
or methodologies by PSI and other structural genomics centers.
II.6. List of accomplishments
relative to each of the Goals outlined above.
II.7 Decrease in Total Cost
per Structure. Total Cost is defined as the sum of annual indirect
plus direct costs awarded to each center. These would be assessed
separately for the Large-scale and Specialized Research Centers, using
methods outlined in the Appendix. Of particular interest is the
change in cost-per-structure from year to year. This metric will
be refined further by the Milestones and Goals Committee.
III. Outreach to the Scientific
Community
Goals for Community Outreach
A key goal of the PSI is to make
3D structure an important component of biological research. This
includes propagation of new technologies and structural information
to the broad scientific community, and incorporation of project nominations
from the community, whether they be specific targets, groups of targets,
or methodological or technological advances. Approximately 15%
of the effort of PSI-2 centers will be devoted to Community Outreach
Targets. Given that these nominated projects are likely to be
more difficult than the majority of PSI-2 structures, the PSI-2 plans
to deposit at least 100 – 300 such structures of Community Outreach
Targets to the PDB.
The PSI-2 is also providing comprehensive
documentation of experimental protocols and interim results for gene
cloning, gene expression, protein purification, protein crystallization,
and structural characterization, including negative results. All
PSI-2 Centers must deposit standardized sets of data on protein sample
preparation and structure production in the public-domain PepcDB database,
shortly after the data are generated.
A further goal of PSI-2 Community
Outreach is to provide expression clones for all program targets, initially
directly from the Centers, and eventually from the PSI-2 Materials Repository.
Other reagents, such as small quantities of purified proteins, will
also be provided when readily available.
Measures of Success for Community
Outreach
The following metrics will
be reported annually by each of the PSI-2 Centers, except where indicated
otherwise.
III.1. Number of protein
targets “accepted” from community requests for investigation by
PSI-2 Centers.
III.2. Number of PDB depositions
of Community Targets.
III.3. Number and lists of
publications with joint authorship between PSI Centers and non-PSI investigators.
III.4. Numbers and lists
of expression vectors, expression hosts, purified proteins, and other
materials distributed to non-PSI investigators.
III.5. Numbers of hits on
TargetDB and PepcDB web sites per unit time. It is recognized that this
metric underestimates the use of these data, as it does not include
offline usage of these databases. This metric will be assessed centrally
by TargetDB and PepcDB.
III.6. Number of downloads
of PSI-2 Experimental Structures from the PDB. It is recognized that
this metric underestimates the use of these data, as it does not include
offline usage of these databases. This metric will be assessed centrally
by the PDB.
III.7. Number of accesses (hits)
to PSI-2 Experimental Structures in PDB, PDBsum, SCOP, or CATH databases.
III.8. Numbers of attendees
at Workshops offered by PSI-2 Centers, as self-declared by each center.
III.9. Numbers of seminars
at non-PSI institutions/department meetings/national and international
conferences on PSI activities given by PSI scientists, as self-declared
by each center.
III.10. Numbers and names of student
trainees (undergraduate, graduate, postdoctoral) (i) directly supported
by and (ii) otherwise involved in PSI-2 sponsored research programs.
III.11. Numbers and names of visiting
scientists (i) directly supported by and (ii) otherwise involved in
PSI-2 sponsored research programs.
III.12. Numbers and names of underrepresented
minority scientists involved in PSI-2 sponsored research programs.
III.13. Numbers of citations of
published papers describing PSI-2 structures.
Steering Subcommittee on Goals and Milestones
Chair: Gaetano Montelione
Members: Helen Berman, David Eisenberg, Wayne Hendrickson, Andrzej Joachimiak, George Phillips, Janna Wehrle, Stephen Burley, Ian Wilson
Advisor: Steven Brenner
DRAFT
Appendix: Definitions
of Measures of Success
The following is a draft version
of Definitions of Measures of success. These definitions and metrics
will be formally defined by experts of the PSI Knowledge Base, and then
approved by the PSI-2 Milestones and Goals Subcommittee. They may also
be updated by the committee in the future. These metrics can be assessed
retrospectively, since the dates of all PDB depositions are recorded.
The following are guidelines to direct this process.
I. Protein Structures
I.1. Numbers of Experimental
Structures
An Experimental Structure
is defined as one determined by X-ray crystallography or NMR methods,
having satisfactory structure quality assessment statistics, and deposited
in the Protein Data Bank (PDB). These structural quality
assessment criteria will be defined by experts in the PSI Knowledge
Base and approved by the PSI-2 Milestones and Goals Subcommittee.
I.1.A. Number of Novel
Experimental PSI-2 Structures. This metric enumerates the number
of Experimental Structures (or domains within multi-domain Experimental
Structures) deposited into the PDB for which, at the time of deposition,
no 3D structure was publicly available for a close homolog, defined
operationally as one with more than 30% sequence identity over the length
of the relevant segment of the polypeptide chain.
The following procedure is
proposed to compute these statistics:
1) Comparisons will be made on the date the target structure is released into the PDB.
2) Comparisons will be made with PDB sequence data.
3) Methodology: BLAST the sequence of target experimental structure against all released PDB structures (sequences mmCIF/PDBML data files) available on the date the target Experimental Structure was released. These are the “prior PDB structures”. Mask all regions of target experimental structure sequence between the beginning and end residues (including those aligned to gaps) in the local BLAST alignment with the prior PDB structure. Do this for alignments with E-values at least as significant as 10-10, and with percent identity reported by BLAST at least 30%. If there is an unmasked region of at least 50 consecutive residues, the target experimental structure is considered “Novel”
4) Count the number of residues
in target experimental structures that are considered “Novel” by
adding up the total number of unmasked residues.
I.1.B. Number of Distinct
Experimental PSI-2 Structures with Nonredundant Sequences. This metric
enumerates structures of proteins (or protein domains) with sequences
distinctly different (i.e. not identical in sequence) from sequences
deposited in the PDB prior to completing the targeted PSI-2 structure.
This metric counts separately the multiple homologues across a protein
domain family which are not novel by criterion I.1.A. The methodology
for this would be the same as in I.1.A, except that 98% ID is used instead
of 30% ID.
I.1.C. Number and Size
of Domain Families for which PSI-2 Provides the First Experimental Structural
Representative. This metric enumerates the numbers and sizes of
Domain Families or Mega Family Clusters for which PSI-2 provides the
first Experimental Structure. The sequence length of the part of the
structure that comprises the representative structure will also be tabulated.
Protein sequences of these Domain Families or Mega Family Clusters,
and their assignments to specific PSI-2 centers, will be provided at
a standard URL and will be versioned as needed.
I.1.D. Total Number of Experimental
PSI-2 Structures. This metric enumerates all PDB depositions,
including multiple structures of the same protein sequence determined
by different methods (i.e., NMR versus X-ray crystallography), in different
crystal forms, different solution conditions, or bound to different
ligands. It would also count separately proteins that differ at
just a few amino acid sites, which are not distinct by criteria I.1.C.
I.3. Numbers of Sequences for
which Homology Models can be Produced from PSI-2 Structures and Corresponding
Coverage of Specific Proteomes.
For the purposes of PSI-2, Modeling
Leverage of an experimental structure will be defined as the number
of protein sequences (or the number of protein residues covered in these
protein sequences) with E value < 10-10 (corresponding
to ~ 30% sequence identity) relative to the target experimental
structure. The modeling leverage depends on both the sequence
database used and the date of the analysis. The Modeling Leverage of
a particular Experimental Structure increases with time, as the sequence
databases expand. For the purpose of PSI-2, the modeling leverage
will be computed against a specified version of the UniProt database.
This choice will facilitate comparative analysis at different time points.
I.3.A. Total Modeling Leverage
This metric is defined as (i)
number of protein sequences (all or a part spanning at least 50 residues),
and (ii) the number of residues covered in these protein sequences,
for which the target Experimental Structure has modeling leverage, as
defined above.
I.3.B. Novel Modeling Leverage
This metric is defined as (i)
the total number of protein sequences (all or a part spanning at least
50 residues), and (ii) the number of residues covered in these protein
sequences, for which the target Experimental Structure has modeling
leverage, excluding the protein sequences (or residues in these sequences)
which fit these same criteria using the set of protein structures released
in the PDB prior to the release date of the Experimental Structure.
The following procedure is
proposed to compute these statistics:
Calculate this on both a per-sequence and per-residue basis, using similar methodology to I.1.A:
1) Mask all regions of Uniprot sequences between beginning and end residues of BLAST alignments (with E value at least as significant as 10-10 AND at least 30% identity) to all PDB entries released prior to the release of the target experimental structure.
2) Find additional regions of Uniprot sequence that are masked by BLAST alignment (same criteria as above) to the target experimental structure.
3) If there is ANY additional masking, the Center that solved the target experimental structure gets credit for the number of additional residues masked in the per-residue metric.
4) If there is a stretch of at least 50 consecutive residues that are masked by the target experimental Structure that were not previously masked, the Center gets credit for 1 new modelable structure in the per-sequence metric.
5) Note that 2 centers who
each contribute a target experimental structure that allows modeling
of a different 50-residue region of the same protein could BOTH get
credit for 1 new modelable structure by criteria
#4. It is also possible to tabulate which Center first provided
ANY structural information for a sequence.
I.3.C. Modeling Leverage
and Coverage of the human proteome
Total modeling leverage and
novel modeling leverage will be assessed for the human proteome using
the set of human proteins in a specified version of UniProt.
Assessments will be made both for (i) number and percentage of distinct
human proteins that are spanned by the sequences of all experimental
structures (all or a part spanning at least 50 residues), and (ii) the
number and percentage of all non-overlapping residues structurally covered
in all human proteins in Uniprot. This can be done using the same
methodology as in 1.3.B, but with human proteins only.
I.3.D. Modeling Leverage
and Coverage of proteomes of model organisms and pathological microorganisms.
Total modeling leverage and
novel modeling leverage will be assessed for proteomes of specific organisms,
including M. musculus, C. elegans, D. melanogaster,
A. thaliana, zebra fish, S. cerevisiae, E. coli (gram
negative), B. anthrax (gram positive), T. maritima,
M. tuberculosis, and others, using a specified version of Uniprot.
Assessments will be made both for (i) number and percentage of distinct
proteins in each organism that are spanned by the sequences of all experimental
structures (all or a part spanning at least 50 residues), and (ii) the
number and percentage of all non-overlapping residues that are structurally
covered for all proteins in each organism. This can be done using the
same methodology as in 1.3.B, but calculated on the proteome of each
organism.