Structural genomics target selection for the New York consortium on membrane protein structure

Marco Punta¹, James Love, Samuel Handelman, John F Hunt, Lawrence Shapiro, Wayne A Hendrickson, Burkhard Rost

Affiliations

PMID: 19859826
PMCID: PMC2780672
DOI: 10.1007/s10969-009-9071-1

Structural genomics target selection for the New York consortium on membrane protein structure

Marco Punta et al. J Struct Funct Genomics. 2009 Dec.

. 2009 Dec;10(4):255-68.

doi: 10.1007/s10969-009-9071-1. Epub 2009 Oct 27.

Authors

Marco Punta¹, James Love, Samuel Handelman, John F Hunt, Lawrence Shapiro, Wayne A Hendrickson, Burkhard Rost

Affiliation

¹ Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY, 10032, USA. punta@rostlab.org

PMID: 19859826
PMCID: PMC2780672
DOI: 10.1007/s10969-009-9071-1

Abstract

The New York Consortium on Membrane Protein Structure (NYCOMPS), a part of the Protein Structure Initiative (PSI) in the USA, has as its mission to establish a high-throughput pipeline for determination of novel integral membrane protein structures. Here we describe our current target selection protocol, which applies structural genomics approaches informed by the collective experience of our team of investigators. We first extract all annotated proteins from our reagent genomes, i.e. the 96 fully sequenced prokaryotic genomes from which we clone DNA. We filter this initial pool of sequences and obtain a list of valid targets. NYCOMPS defines valid targets as those that, among other features, have at least two predicted transmembrane helices, no predicted long disordered regions and, except for community nominated targets, no significant sequence similarity in the predicted transmembrane region to any known protein structure. Proteins that feed our experimental pipeline are selected by defining a protein seed and searching the set of all valid targets for proteins that are likely to have a transmembrane region structurally similar to that of the seed. We require sequence similarity aligning at least half of the predicted transmembrane region of seed and target. Seeds are selected according to their feasibility and/or biological interest, and they include both centrally selected targets and community nominated targets. As of December 2008, over 6,000 targets have been selected and are currently being processed by the experimental pipeline. We discuss how our target list may impact structural coverage of the membrane protein space.

PubMed Disclaimer

Figures

**Fig. 1**
Target selection protocol at NYCOMPS. a *Building the NYCOMPS98 dataset of valid targets.* We selected targets from 96 fully sequenced prokaryotic genomes. We used TMHMM2 [31] to predict TMHs in this set and retained only sequences with ≥2 TMHs. Finally, we applied a series of additional filters: we reduced redundancy at 98% using CD-HIT [34], we removed all sequences with 2 predicted TMHs for which the first TMH overlapped with a predicted signal peptide (using SignalP [35]) and we discarded sequences with at least 15 consecutive residues predicted to be disordered (using IUPred [36]). All sequences left constitute our set of valid targets, which we call NYCOMPS98. b *Expanding a protein seed into a family of related proteins within NYCOMPS98.* The seed is aligned against the whole NYCOMPS98 dataset using PSI-BLAST [39]. Retained sequences are those that satisfy our similarity criterion (Fig. 2). From this list we eliminate: sequences that are significantly similar to PDB proteins (filter is not applied to nominated targets), sequences known to constitute individual subunits of hetero-oligomeric complexes and sequences that differ significantly from the seed with respect to sequence length and number of predicted TMHs. We also discard proteins that align well with the family N-terminus consensus sequence (if any such consensus can be identified) but add some extra N-terminal residues, i.e. are possibly mis-annotated (Fig. S3). All remaining sequences are finally sent to cloning

**Fig. 2**
αIMPs similarity criterion. We align sequence α to sequence β (both αIMPs) using PSI-BLAST [39]. If the alignment has E value <10⁻³ and it extends over ≥50% of the predicted TM regions of both proteins, than we consider β similar to α. This criterion is used throughout the paper to establish similarity between αIMPs, e.g. similarity between a seed protein and proteins in the NYCOMPS98 dataset

**Fig. 3**
Diversity of NYCOMPS targets. a Distribution of sequence lengths. x-axis tick labels represent ranges, e.g. 100 means between 0 and 100 residues. The last bin (1,100) includes all proteins longer than 1,000 residues. b Distribution of number of TMHs predicted by TMHMM2 in all selected targets

**Fig. 4**
Potential novel αIMP leverage provided by NYCOMPS targets. a The x-axis gives the number of seed families for which we hypothetically determine a structure (corresponding to 10 seed families or to 25–100% of all seed families; e.g. 25% corresponds to 43 seed families and 100% to 174 seed families); the y-axis reports the number of predicted αIMPs with more than 2 TMHs for which more than 50% of the residues in the TM region could be modeled using the NYCOMPS targets on the x-axis as templates (leverage). *Numbers* on the y-axis are for proteins in: UniProtKB-TMH (i.e. all predicted αIMPs in UniProtKB with more than 2 TMHs, see “Methods”; *blank circles* and *continuous line*), Swiss-Prot-TMH (*filled diamonds* and *long-dash line*) and UniProtKB-TMH-Human (i.e. human proteins in UniProtKB-TMH, *crossed squares* and *short-dash line*). *Error bars* are obtained by bootstrapping [48] (“Methods”). b Comparison between NYCOMPS target and PDB protein leverage. On the y-axis we report the ratio between the respective leverage values. Notations are as in (a). See “Methods” for the way UniProtKB-TMH leverage by PDB proteins is calculated

See this image and copyright information in PMC

References

1. Burley SK, Joachimiak A, Montelione GT, Wilson IA. Contributions to the NIH-NIGMS protein structure initiative from the PSI production centers. Structure. 2008;16:5–11. doi: 10.1016/j.str.2007.12.002. - DOI - PMC - PubMed
1. Norvell JC, Berg JM. Update on the protein structure initiative. Structure. 2007;15:1519–1522. doi: 10.1016/j.str.2007.11.004. - DOI - PubMed
1. Norvell JC, Machalek AZ. Structural genomics programs at the US national institute of general medical sciences. Nat Struct Biol. 2000;7 Suppl:931. doi: 10.1038/80694. - DOI - PubMed
1. Stroud RM, Choe S, Holton J, Kaback HR, Kwiatkowski W, Minor DL, Riek R, Sali A, Stahlberg H, Harries W (2009) 2007 Annual progress report synopsis of the center for structures of membrane proteins. J Struct Funct Genomics 10:193–208 - PMC - PubMed
1. Punta M, Forrest LR, Bigelow H, Kernytsky A, Liu J, Rost B. Membrane protein prediction methods. Methods. 2007;41:460–474. doi: 10.1016/j.ymeth.2006.07.026. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Structural genomics target selection for the New York consortium on membrane protein structure

Affiliation

Structural genomics target selection for the New York consortium on membrane protein structure

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials