Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Apr 4:7:53.
doi: 10.1186/1471-2148-7-53.

On homology searches by protein Blast and the characterization of the age of genes

Affiliations

On homology searches by protein Blast and the characterization of the age of genes

M Mar Albà et al. BMC Evol Biol. .

Abstract

Background: It has been shown in a variety of organisms, including mammals, that genes that appeared recently in evolution, for example orphan genes, evolve faster than older genes. Low functional constraints at the time of origin of novel genes may explain these results. However, this observation has been recently attributed to an artifact caused by the inability of Blast to detect the fastest genes in different eukaryotic genomes. Distinguishing between these two possible explanations would be of great importance for any studies dealing with the taxon distribution of proteins and the origin of novel genes.

Results: Here we used simulations of protein sequences to examine the capacity of Blast to detect proteins of diverse evolutionary rates in the different species of an eukaryotic phylogenetic tree that included metazoans, fungi and plants. We simulated the evolution of protein genes with the same evolutionary rates than those observed in functional mammalian genes and with among-site rate heterogeneity. Under these conditions, we found that only a very small percentage of simulated ancestral eukaryotic proteins was affected by the Blast artifact. We show that the good detectability of Blast is due to the heterogeneity of protein evolutionary rates at different sites, since only a small conserved motif in a sequence suffices to detect its homologues. Our results indicate that Blast, at least when applied within eukaryotes, only misses homologues of extremely fast-evolving sequences, which are rare in the mammalian genome, as well as sequences evolving homogeneously or pseudogenes.

Conclusion: Although great care should be exercised in the recognition of remote homologues, most functional mammalian genes can be detected in eukaryotic genomes by Blast. That is, the majority of functional mammalian genes are not as fast as for not being detected in other metazoans, fungi or plants, if they had been present in these organisms. Thus, the correlation previously found between age and rate seems not to be due to a pure Blast artifact, at least for mammals. This may have important implications to understand the mechanisms by which novel genes originate.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Phylogenetic tree used in the simulations. Numbers along the branches are proportional to the branch lengths and indicate amino acid substitutions/site. The taxonomic groupings used for the age classification of genes are also shown.
Figure 2
Figure 2
Distributions of amino acid genetic distances in genes of different age categories. (a) Distributions of genetic distances of the 1558 mammalian genes coding for proteins with lengths between 300 and 500 amino acids. (b) Distributions of genetic distances of the simulated genes with rate heterogeneity. (c) Distributions of genetic distances of the simulated genes without rate heterogeneity. The total number of genes for the simulations with and without rate heterogeneity were 1618 and 1578, respectively. Different pools of simulated alignments always produced very similar results (see Methods).
Figure 3
Figure 3
Fragments of alignments of simulated sequences. Simulations were done with (a) or without (b) among-site rate heterogeneity. Both simulation were performed with Rose, following the tree represented in Figure 1 multiplied by a factor of 1.5. Positions of the alignments where more than 50% of the sequences are identical are shown with black boxes. The trees recalculated from the respective complete alignments are also shown, with the scale in amino acid substitutions/site. Interestingly, despite there being a very similar genetic distance between A and G in both alignments, A finds G by Blast in the alignment evolved under rate heterogeneity (A) but not in the alignment without rate heterogeneity (B).

References

    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Daubin V, Ochman H. Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res. 2004;14:1036–1042. doi: 10.1101/gr.2231904. - DOI - PMC - PubMed
    1. Domazet-Loso T, Tautz D. An evolutionary analysis of orphan genes in Drosophila. Genome Res. 2003;13:2213–2219. doi: 10.1101/gr.1311003. - DOI - PMC - PubMed
    1. Wang W, Zheng H, Yang S, Yu H, Li J, Jiang H, Su J, Yang L, Zhang J, McDermott J, Samudrala R, Wang J, Yang H, Yu J, Kristiansen K, Wong GK. Origin and evolution of new exons in rodents. Genome Res. 2005;15:1258–1264. doi: 10.1101/gr.3929705. - DOI - PMC - PubMed
    1. Subramanian S, Kumar S. Gene expression intensity shapes evolutionary rates of the proteins encoded by the vertebrate genome. Genetics. 2004;168:373–381. doi: 10.1534/genetics.104.028944. - DOI - PMC - PubMed

Publication types