Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jan 4:13:5.
doi: 10.1186/1471-2164-13-5.

Controversies in modern evolutionary biology: the imperative for error detection and quality control

Affiliations

Controversies in modern evolutionary biology: the imperative for error detection and quality control

Francisco Prosdocimi et al. BMC Genomics. .

Abstract

Background: The data from high throughput genomics technologies provide unique opportunities for studies of complex biological systems, but also pose many new challenges. The shift to the genome scale in evolutionary biology, for example, has led to many interesting, but often controversial studies. It has been suggested that part of the conflict may be due to errors in the initial sequences. Most gene sequences are predicted by bioinformatics programs and a number of quality issues have been raised, concerning DNA sequencing errors or badly predicted coding regions, particularly in eukaryotes.

Results: We investigated the impact of these errors on evolutionary studies and specifically on the identification of important genetic events. We focused on the detection of asymmetric evolution after duplication, which has been the subject of controversy recently. Using the human genome as a reference, we established a reliable set of 688 duplicated genes in 13 complete vertebrate genomes, where significantly different evolutionary rates are observed. We estimated the rates at which protein sequence errors occur and are accumulated in the higher-level analyses. We showed that the majority of the detected events (57%) are in fact artifacts due to the putative erroneous sequences and that these artifacts are sufficient to mask the true functional significance of the events.

Conclusions: Initial errors are accumulated throughout the evolutionary analysis, generating artificially high rates of event predictions and leading to substantial uncertainty in the conclusions. This study emphasizes the urgent need for error detection and quality control strategies in order to efficiently extract knowledge from the new genome data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Evolutionary scenario involving asymmetrical evolution after duplication (AED). A schematic view of the AED events included in this study. Using the human gene Hi as a reference, homologs are detected in each vertebrate genome that maintain the same genome neighborhood as the human gene. At the same time, the homologs from each genome with the highest similarity to the human reference gene are identified (full arrows indicate similarity homologs and dashed arrows indicate syntenic homologs). We then selected AED events where the relocated similarity homolog has evolved significantly faster than the local syntenic homolog.
Figure 2
Figure 2
Estimation of sequence error rates. A) Percentage of predicted sequence errors in 19,778 protein families in 14 vertebrate genomes. In blue, the percentage of sequences with at least one error. In red, the percentage of total errors observed. B) Classification of sequence errors into 7 types according to their position in the sequence and their nature (see methods). The histogram shows the frequencies of each error type observed in all protein sequences (C-deletion = C-terminal deletion; C-extension = C-terminal extension; N-deletion = N-terminal deletion; N-extension = N-terminal extension; segment = suspicious sequence segment: deletion = internal deletion; insertion = internal insertion).
Figure 3
Figure 3
Number of putative ortholog relationships between human and 13 vertebrate genomes. A. Putative ortholog relationships between human and each of the 13 vertebrate genomes used in this study were identified by similarity-based and synteny-based approaches. B. The proportion of orthologs predicted by the synteny approach for which the same ortholog was predicted by the similarity-based approach.
Figure 4
Figure 4
Effect of erroneous sequences on prediction of asymmetrical evolution in 13 vertebrate genomes. A. The presence of erroneous sequences give rise to a number of artifactual AED events (shown in red). The remaining events are defined as putative AED events (shown in blue). B. Comparison of percentage of protein sequences predicted to contain errors and percentage of artifactual AED events for each genome.
Figure 5
Figure 5
Characterization of sequence errors in predicted asymmetrical evolution events. Errors are classified into 7 types according to their position in the sequence and their nature (see methods). The proportions of the different classes found in the human reference sequences, the syntenic homolog (V_syn) and the highest similarity homolog (V_sim) are shown, as well as the proportions observed in the pooled sequences in the gene triplets. (C-deletion = C-terminal deletion; C-extension = C-terminal extension; N-deletion = N-terminal deletion; N-extension = N-terminal extension; segment = suspicious sequence segment: deletion = internal deletion; insertion = internal insertion).
Figure 6
Figure 6
An example of an artifactual AED event. Part of the multiple sequence alignment of the human COPG protein sequence [Ensembl:ENSP00000325002] and putative orthologs in the macaque genome. The suspicious segment is boxed in grey. For the Ensembl macaque sequences, exons are colored alternately in black and blue. Residues overlapping splice sites are shown in red.
Figure 7
Figure 7
A putative AED event. A) Multiple sequence alignment of hepatoma-derived growth factor (HDGF) and HDGF-like proteins. Black lines indicate the two main subgroups corresponding to the duplication node in the phylogenetic tree. Known phosphorylation sites are labeled with asterisks. B) The phylogenetic tree constructed using the Neighbour-Joining algorithm with 500 bootstraps. Bootstrap values for each node are shown in red. The distance between human and mouse HDGF1 sequences (in blue) is longer than the distance between human HDGF1 and mouse HDGF sequences (in green).
Figure 8
Figure 8
Detection of potential sequence errors. Examples of sequence discrepancies (highlighted in blue) that are identified in the subfamily alignments. A) Potential mispredicted exons, resulting in suspicious sequence segments, are identified based on the conserved blocks in the subfamily alignment. B) Potential start and stop site errors are predicted based on the distribution of the positions of the N/C-terminal residues. C) Identification of a potential inserted intron, based on the presence of a single sequence with the insertion in a given subfamily. D) Identification of a potential missing exon, based on the presence of a single sequence with a deletion in a given subfamily.

Similar articles

Cited by

References

    1. Mardis ER. A decade's perspective on DNA sequencing technology. Nature. 2011;470(7333):198–203. doi: 10.1038/nature09796. - DOI - PubMed
    1. Philippe H, Brinkmann H, Lavrov DV, Littlewood DT, Manuel M, Worheide G, Baurain D. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol. 2011;9(3):e1000602. doi: 10.1371/journal.pbio.1000602. - DOI - PMC - PubMed
    1. Soria-Carrasco V, Castresana J. Estimation of phylogenetic inconsistencies in the three domains of life. Mol Biol Evol. 2008;25(11):2319–2329. doi: 10.1093/molbev/msn176. - DOI - PubMed
    1. Stiller JW. Experimental design and statistical rigor in phylogenomics of horizontal and endosymbiotic gene transfer. BMC Evol Biol. 2011;11(1):259. doi: 10.1186/1471-2148-11-259. - DOI - PMC - PubMed
    1. Koonin EV. The origin and early evolution of eukaryotes in the light of phylogenomics. Genome Biol. 2011;11(5):209. - PMC - PubMed

Publication types

LinkOut - more resources