Blast sampling for structural and functional analyses
- PMID: 17319945
- PMCID: PMC1819393
- DOI: 10.1186/1471-2105-8-62
Blast sampling for structural and functional analyses
Abstract
Background: The post-genomic era is characterised by a torrent of biological information flooding the public databases. As a direct consequence, similarity searches starting with a single query sequence frequently lead to the identification of hundreds, or even thousands of potential homologues. The huge volume of data renders the subsequent structural, functional and evolutionary analyses very difficult. It is therefore essential to develop new strategies for efficient sampling of this large sequence space, in order to reduce the number of sequences to be processed. At the same time, it is important to retain the most pertinent sequences for structural and functional studies.
Results: An exhaustive analysis on a large scale test set (284 protein families) was performed to compare the efficiency of four different sampling methods aimed at selecting the most pertinent sequences. These four methods sample the proteins detected by BlastP searches and can be divided into two categories: two customisable methods where the user defines either the maximal number or the percentage of sequences to be selected; two automatic methods in which the number of sequences selected is determined by the program. We focused our analysis on the potential information content of the sampled sets of sequences using multiple alignment of complete sequences as the main validation tool. The study considered two criteria: the total number of sequences in BlastP and their associated E-values. The subsequent analyses investigated the influence of the sampling methods on the E-value distributions, the sequence coverage, the final multiple alignment quality and the active site characterisation at various residue conservation thresholds as a function of these criteria.
Conclusion: The comparative analysis of the four sampling methods allows us to propose a suitable sampling strategy that significantly reduces the number of homologous sequences required for alignment, while at the same time maintaining the relevant information concerning the active site residues.
Figures












Similar articles
-
Protein structural similarity search by Ramachandran codes.BMC Bioinformatics. 2007 Aug 23;8:307. doi: 10.1186/1471-2105-8-307. BMC Bioinformatics. 2007. PMID: 17716377 Free PMC article.
-
Identification of homologs in insignificant blast hits by exploiting extrinsic gene properties.BMC Bioinformatics. 2007 Sep 21;8:356. doi: 10.1186/1471-2105-8-356. BMC Bioinformatics. 2007. PMID: 17888146 Free PMC article.
-
An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments.J Mol Biol. 2000 Aug 18;301(3):691-711. doi: 10.1006/jmbi.2000.3975. J Mol Biol. 2000. PMID: 10966778
-
Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book.Nat Methods. 2004 Dec;1(3):195-202. doi: 10.1038/nmeth725. Nat Methods. 2004. PMID: 15789030 Review.
-
Protein function from sequence and structure data.Appl Bioinformatics. 2003;2(1):3-12. Appl Bioinformatics. 2003. PMID: 15130830 Review.
Cited by
-
MSV3d: database of human MisSense Variants mapped to 3D protein structure.Database (Oxford). 2012 Apr 3;2012:bas018. doi: 10.1093/database/bas018. Print 2012. Database (Oxford). 2012. PMID: 22491796 Free PMC article.
-
Vector Quantized Spectral Clustering Applied to Whole Genome Sequences of Plants.Evol Bioinform Online. 2019 Mar 26;15:1176934319836997. doi: 10.1177/1176934319836997. eCollection 2019. Evol Bioinform Online. 2019. PMID: 30936678 Free PMC article.
References
-
- Genome OnLine Database http://www.genomesonline.org/
-
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Research Materials