Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Oct 12:7:191.
doi: 10.1186/1471-2148-7-191.

DNA indels in coding regions reveal selective constraints on protein evolution in the human lineage

Affiliations

DNA indels in coding regions reveal selective constraints on protein evolution in the human lineage

Nicole de la Chaux et al. BMC Evol Biol. .

Abstract

Background: Insertions and deletions of DNA segments (indels) are together with substitutions the major mutational processes that generate genetic variation. Here we focus on recent DNA insertions and deletions in protein coding regions of the human genome to investigate selective constraints on indels in protein evolution.

Results: Frequencies of inserted and deleted amino acids differ from background amino acid frequencies in the human proteome. Small amino acids are overrepresented, while hydrophobic, aliphatic and aromatic amino acids are strongly suppressed. Indels are found to be preferentially located in protein regions that do not form important structural domains. Amino acid insertion and deletion rates in genes associated with elementary biochemical reactions (e. g. catalytic activity, ligase activity, electron transport, or catabolic process) are lower compared to those in other genes and are therefore subject to stronger purifying selection.

Conclusion: Our analysis indicates that indels in human protein coding regions are subject to distinct levels of selective pressure with regard to their structural impact on the amino acid sequence, as well as to general properties of the genes they are located in. These findings confirm that many commonly accepted characteristics of selective constraints for substitutions are also valid for amino acid insertions and deletions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Length distribution of non-frameshifting indels. Length distributions of coding insertions and deletions decay rapidly with increasing indel length. The longest insertion in our set is a 405 bp long segment, the largest deletion covers 168 bp.
Figure 2
Figure 2
Conservative and non-conservative indel events. a) Examples of 3 bp insertions in a protein coding region. An insertion can either occur between two codons (phase 0), between the first and second nucleotide of a codon (phase 1), or between the second and third nucleotide (phase 2). Phase 1 and 2 insertions can thereby be divided into conservative events, which only insert a new amino acid without changing the translated amino acid of the ancestral codon (phase 1- and 2-), or non-conservative events that additionally change it (phase 1+, 2+). Insertions in phase 0 are always conservative. In a similar manner deletions can be partitioned into the 5 different categories (reversing time arrows in the figure yields the corresponding examples). Notice that the indel in phase 1-could have also been assigned as a phase 0 or phase 2 indel depending on where the alignment algorithm prefers to place the gap (all three gap placements have equal numbers of matches and gaps and therefore equal alignment scores). b) Measured frequencies of non-conservative insertion and deletion events in observed data and simulations.
Figure 3
Figure 3
Frequencies of inserted and deleted amino acids. a) Frequency distribution of inserted/deleted amino acids resulting from coding indels in our set compared to the background amino acid frequencies in all human proteins. b) Frequencies of inserted/deleted amino acids grouped according to 10 different physio-chemical categories. Notice that amino acids can be assigned to more than one category. Error bars in a) and b) are standard deviations calculated by Δƒi=Ni/jNj, where Ni is the total number of inserted/deleted amino acids i, or amino acids in category i, respectively.
Figure 4
Figure 4
Indel frequencies in different structural regions of proteins. Frequency distribution of indel events in the four secondary structure categories helix, sheet, turn and no structure. The background distribution is the relative fraction of amino acids residing in each structure among all analyzed proteins. Error bars were calculated by Δƒi=Ni/jNj, where Ni is the total number of indels in structure i.
Figure 5
Figure 5
Distribution of dN/dS values among indel containing genes. The histograms show the measured distributions of gene frequencies with dN/dS values in binned intervals of length 0.1, starting from 0. Gene frequencies are generally peaked in the interval 0 ≤ dN/dS ≤ 0.1 and decay for larger dN/dS values, indicating strong purifying selection on protein coding regions throughout evolution. However, the distributions of the subsets of genes that contain at least one insertion/deletion decay slower compared to the background distribution of all analyzed genes.
Figure 6
Figure 6
Indel rates in 63 GO slim categories. For each GO slim category indel rates were measured in events (insertions+deletions) per 100 kbp in the protein coding regions of all genes assigned to the particular category. The horizontal black line is the average indel rate in all protein coding regions with available GO annotation. We assumed that errors of indel rates are given by Δri=Ni/Li, where Ni is the overall number of indels in GO slim category i, and Li is the total length of all protein coding regions assigned to that category. For GO slim categories with Ni = 0 errors were obtained by setting Ni = 1. The category nucleic acid metabolic process combines nucleobase, nucleoside, nucleotide and nucleic acid metabolic processes.
Figure 7
Figure 7
Identifying insertion and deletion events. The figure shows an exemplary multiple alignment of orthologous sequence segments in human, chimp and rhesus. The gap containing regions I and D can unambiguously be explained by a single insertion (I) or deletion (D) event in the human lineage since its speciation from the common ancestor with chimp. In contrast, region I* has non-overlapping gaps in chimp and rhesus and therefore requires at least two indel events. These scenarios are always ambiguous. For instance, I* can be explained by an insertion in human and a deletion in chimp, but also by a deletion in chimp and a deletion in rhesus.

Similar articles

Cited by

References

    1. Table of non-frameshifting indels in protein coding regions of the human genome. http://evogen.molgen.mpg.de/data/coding_indels41.txt
    1. Kimura M, Ohta T. The Average Number of Generations until Fixation of a Mutant Gene in a Finite Population. Genetics. 1969;7(3):763–771. - PMC - PubMed
    1. Chen FC, Li WH. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet. 2001;7(2):444–456. doi: 10.1086/318206. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pub%med&pubmedi... - DOI - PMC - PubMed
    1. Britten RJ, Rowen L, Williams J, Cameron RA. Majority of divergence between closely related DNA samples is due to indels. Proc Natl Acad Sci USA. 2003;7(8):4661–4665. doi: 10.1073/pnas.0330964100. http://dx.doi.org/10.1073/pnas.0330964100 - DOI - DOI - PMC - PubMed
    1. Watanabe H, Fujiyama A, Hattori M, Taylor TD, Toyoda A, Kuroki Y, Noguchi H, BenKahla A, Lehrach H, Sudbrak R, Kube M, Taenzer S, Galgoczy P, Platzer M, Scharfe M, Nordsiek G, Blöcker H, Hellmann I, Khaitovich P, Pääbo S, Reinhardt R, Zheng HJ, Zhang XL, Zhu GF, Wang BF, Fu G, Ren SX, Zhao GP, Chen Z, Lee YS, Cheong JE, Choi SH, Wu KM, Liu TT, Hsiao KJ, Tsai SF, Kim CG, OOta S, Kitano T, Kohara Y, Saitou N, Park HS, Wang SY, Yaspo ML, Sakaki Y. DNA sequence and comparative analysis of chimpanzee chromosome 22. Nature. 2004;7(6990):382–388. doi: 10.1038/nature02564. http://dx.doi.org/10.1038/nature02564 - DOI - DOI - PubMed

LinkOut - more resources