Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jan;30(1):36-44.
doi: 10.1093/molbev/mss217. Epub 2012 Sep 12.

Integrating sequence variation and protein structure to identify sites under selection

Affiliations

Integrating sequence variation and protein structure to identify sites under selection

Austin G Meyer et al. Mol Biol Evol. 2013 Jan.

Abstract

We present a novel method to identify sites under selection in protein-coding genes. Our method combines the traditional Goldman-Yang model of coding-sequence evolution with the information obtained from the 3D structure of the evolving protein, specifically the relative solvent accessibility (RSA) of individual residues. We develop a random-effects likelihood sites model in which rate classes are RSA dependent. The RSA dependence is modeled with linear functions. We demonstrate that our RSA-dependent model provides a significantly better fit to molecular sequence data than does a traditional, RSA-independent model. We further show that our model provides a natural, RSA-dependent neutral baseline for the evolutionary rate ratio ω = dN/dS Sites that deviate from this neutral baseline likely experience selection pressure for function. We apply our method to the influenza proteins hemagglutinin and neuraminidase. For hemagglutinin, our method recovers positively selected sites near the sialic acid-binding site and negatively selected sites that may be important for trimerization. For neuraminidase, our method recovers the oseltamivir resistance site and otherwise suggests that few sites deviate from the neutral baseline. Our method is broadly applicable to any protein sequences for which structural data are available or can be obtained via homology modeling or threading.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.
Fig. 1.
Regions of interest in formula image–RSA plot. Most sites in proteins fall into a trapezoidal region we consider the neutral baseline. Sites with formula image are generally considered to be under positive diversifying selection. In addition to such sites, our method can also identify sites with an formula image but either larger or smaller than expected given their RSA. These sites fall into the triangular regions below formula image that are either above or below the neutral baseline. Sites in these regions experience either an accelerated or a reduced rate of evolution relative to the baseline and are likely to be functionally important.
F<sc>ig</sc>. 2.
Fig. 2.
Model fit as a function of the number of slopes and intercepts in the model for the influenza hemagglutinin trimer. The shading reflects the difference in AIC between the best model (three slopes and three intercepts in this case) and all other models.
F<sc>ig</sc>. 3.
Fig. 3.
Assignments of sites to rate classes, for the influenza hemagglutinin trimer. Each graph shows each site’s dN/dS plotted against the site’s RSA. Sites are assumed to evolve at a dN/dS determined by the rate class they are most likely to fall into. Top left: The best model with multiple intercepts and no slope (no RSA dependence). Top right: The best model with multiple intercepts and one slope. Bottom left: The best model with multiple slopes and a single intercept. Bottom right: The overall best model with three intercepts and three slopes. formula imageAIC values are calculated relative to the overall best model. Figure S2 shows the same results but averaged over rate classes.
F<sc>ig</sc>. 4.
Fig. 4.
Average formula image versus RSA for hemagglutinin, obtained from the optimal model (three slopes and three intercepts). Dashed lines indicate the trapezoidally shaped neutral baseline (as ascertained by eye). Sites highlighted in red are within 8 Å of the sialic acid-binding region. Sites above the upper dashed line are significantly enriched in sites near the sialic acid-binding region (Fisher’s exact test, OR = 6.6, formula image).
F<sc>ig</sc>. 5.
Fig. 5.
Sites of interest identified for hemagglutinin. Sites that fall above the upper dashed line in fig. 4 are colored orange. Sites that fall below the lower dashed line in fig. 4 are colored light blue. The polypeptide backbone is colored green. Sialic acid is represented by the space-filling model near the top of the molecule. (A) View of the entire hemagglutinin monomer. (B) View of the sialic acid-binding region. Sites that are highlighted as “SA binding?” are unusually conserved and close to (though not within 8 Å of) the sialic acid. Sites that are highlighted as “trimer interface” are unusually conserved and seem to be important for trimerization. (C) View of the trimer-interface region. Labeling of sites is as in part (B).
F<sc>ig</sc>. 6.
Fig. 6.
Comparison of our results with previous work on neuraminidase. Left: Sites found by Bloom et al. (2010) to be involved in the evolution of oseltamivir resistance are highlighted in red. Right: Site 274 and sites found by Kryazhimskiy et al. (2011) to have 274 as trailing site are highlighted in red.

References

    1. Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr. 1974;19:716–723.
    1. Azaïs JM, Gassiat E, Mercadier C. The likelihood ratio test for general mixture models with or without structural parameter. ESAIM: Probab Stat. 2009;13:301–327.
    1. Bao L, Gu H, Dunn KA, Bielawski JP. Methods for selecting fixed-effect models for heterogeneous codon evolution, with comments on their application to gene and genome data. BMC Evol Biol. 2007;7:S5. - PMC - PubMed
    1. Bhatt S, Holmes EC, Pybus OG. The genomic rate of molecular adaptation of the human influenza A virus. Mol Biol Evol. 2011;28:2443–2451. - PMC - PubMed
    1. Bloom JD, Drummond DA, Arnold FH, Wilke CO. Structural determinants of the rate of protein evolution in yeast. Mol Biol Evol. 2006;23:1751–1761. - PubMed

Publication types