Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov 1;34(11):3006-3022.
doi: 10.1093/molbev/msx213.

Detection of Regional Variation in Selection Intensity within Protein-Coding Genes Using DNA Sequence Polymorphism and Divergence

Affiliations

Detection of Regional Variation in Selection Intensity within Protein-Coding Genes Using DNA Sequence Polymorphism and Divergence

Zi-Ming Zhao et al. Mol Biol Evol. .

Abstract

Numerous approaches have been developed to infer natural selection based on the comparison of polymorphism within species and divergence between species. These methods are especially powerful for the detection of uniform selection operating across a gene. However, empirical analyses have demonstrated that regions of protein-coding genes exhibiting clusters of amino acid substitutions are subject to different levels of selection relative to other regions of the same gene. To quantify this heterogeneity of selection within coding sequences, we developed Model Averaged Site Selection via Poisson Random Field (MASS-PRF). MASS-PRF identifies an ensemble of intragenic clustering models for polymorphic and divergent sites. This ensemble of models is used within the Poisson Random Field framework to estimate selection intensity on a site-by-site basis. Using simulations, we demonstrate that MASS-PRF has high power to detect clusters of amino acid variants in small genic regions, can reliably estimate the probability of a variant occurring at each nucleotide site in sequence data and is robust to historical demographic trends and recombination. We applied MASS-PRF to human gene polymorphism derived from the 1,000 Genomes Project and divergence data from the common chimpanzee. On the basis of this analysis, we discovered striking regional variation in selection intensity, indicative of positive or negative selection, in well-defined domains of genes that have previously been associated with neurological processing, immunity, and reproduction. We suggest that amino acid-altering substitutions within these regions likely are or have been selectively advantageous in the human lineage, playing important roles in protein function.

Keywords: Poisson Random Field; divergence; human evolution; model averaged site selection; natural selection; polymorphism.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.
The workflow of the Model Averaged Site Selection via Poisson Random Field (MASS-PRF) approach. Step 1 consists of the construction of clustering models. Step 2 consists of the estimation of model-averaged selection intensity γ and its 95% model uncertainty intervals for each site. Step 1 can be applied separately to aligned sequences of (A) replacement polymorphism (RP), (B) synonymous (silent) polymorphism (SP), (C) synonymous divergence (SD), and (D) replacement divergence (RD). Step 2 uses observed probabilities of SP, SD, RP, and RD, and merges them into PRF theory to estimate (E) selection intensity. RP and RD are used to estimate model-averaged selection intensities and their 95% model uncertainty intervals (solid line in blue); SP and SD can be combined to represent intragenic inhomogeneity of mutation rate (dashed line in red) for calculations of site-specific divergence time by default, or assuming homogeneity of mutation rate, can be replaced by gene-level divergence time calculated from total counts of SP and SD, or species divergence time can be exogenously supplied by the user.
<sc>Fig</sc>. 2.
Fig. 2.
Comparison of accuracy and precision based on the Kullback–Leibler (KL) divergence between simulated probabilities and estimated probabilities for the clustered and nonclustered model. The KL divergence quantifies the divergence of the distribution of expected (simulated) probabilities from the distribution of estimated probabilities. A value of KL divergence closer to zero indicates that the estimated probability from the clustered (green triangles) or nonclustered model (red triangles) match the expected probability imposed within the simulation. The simulated 1200 bp sequences featured variant sites restricted to cluster sizes (lengths of regions with all variants) ranging from 100 bp (a tight cluster) to 1,200 bp, incrementing cluster size by 100 bp. Analyses of accuracy (solid lines) and precision (dashed lines) are displayed for four classes of variants: (A) synonymous (silent) polymorphism, (B) replacement polymorphism, (C) synonymous divergence, and (D) replacement divergence.
<sc>Fig</sc>. 3.
Fig. 3.
Profiles of selection intensity (γ) across nucleotide positions for seven genes: (A) SLC6A5, (B) GRIN2C, (C) RNASEL, (D) IL18RAP, (E) TGM4, (F) WBP2NL, (G) SPAG5. A line indicates the model-averaged γ (red if the lower bound of γ > 4 or if the upper bound of γ < −1, otherwise blue), and a grey band indicates the 95% model uncertainty interval. The black horizontal line in each plot indicates γ = 0. Each figure reports synonymous polymorphic sites (Ps), replacement polymorphic sites (Pr), synonymous divergent sites (Ds), replacement divergent sites (Dr), and the Fisher Exact P value for the corresponding MK test.
<sc>Fig</sc>. 4.
Fig. 4.
Four scenarios for comparing selection inference by MASS-PRF and the MK test. (A) SLC8A1 was inferred to be under selection both by MASS-PRF and the MK test (P = 0.02); (B) NT5C1B was inferred to be under selection by MASS-PRF, but not MK (P = 0.67); (C) MGAM was not inferred to be under selection by MASS-PRF, but was inferred to be under selection by the MK test (P = 0.04); (D) TPH2 was not inferred to be under selection by either MASSPRF or the MK test (P = 1). A line indicates the model-averaged γ (red if the lower bound of γ > 4 or if the upper bound of γ < −1, otherwise blue), and a grey band indicates the 95% model uncertainty interval. The black horizontal line in each plot indicates γ  = 0. Each figure reports synonymous polymorphic sites (Ps), replacement polymorphic sites (Ps), synonymous divergent sites (Ds), replacement divergent sites (Dr), and the Fisher Exact P value for the corresponding MK test.

Similar articles

Cited by

References

    1. Aguileta G, Refregier G, Yockteng R, Fournier E, Giraud T.. 2009. Rapidly evolving genes in pathogens: methods for detecting positive selection and examples among fungi, bacteria, viruses and protists. Infect Genet Evol. 94:656–670. - PubMed
    1. Akashi H. 1994. Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics 1363:927–935. - PMC - PubMed
    1. Akashi H. 1995. Inferring weak selection from patterns of polymorphism and divergence at “silent” sites in Drosophila DNA. Genetics 1392:1067–1076. - PMC - PubMed
    1. Arbiza L, Dopazo J, Dopazo H.. 2006. Positive selection, relaxation, and acceleration in the evolution of the human and chimp genome. PLoS Comput Biol. 24:288–300. - PMC - PubMed
    1. Arenas M, Posada D.. 2010. Coalescent simulation of intracodon recombination. Genetics 1842:429–437. - PMC - PubMed