A phylogenetic approach for weighting genetic sequences
- PMID: 34049487
- PMCID: PMC8164272
- DOI: 10.1186/s12859-021-04183-8
A phylogenetic approach for weighting genetic sequences
Abstract
Background: Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are 'novel' compared to the others in the same dataset, and low weights to sequences that are over-represented.
Results: We formalise this principle by rigorously defining the evolutionary 'novelty' of a sequence within an alignment. This results in new sequence weights that we call 'phylogenetic novelty scores'. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column-important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes.
Conclusions: Our phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy.
Keywords: Alignment; Conservation scores; Phylogenetics; Protein profile; Sequence weights.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures






Similar articles
-
Bayesian coestimation of phylogeny and sequence alignment.BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83. BMC Bioinformatics. 2005. PMID: 15804354 Free PMC article.
-
SEPP: SATé-enabled phylogenetic placement.Pac Symp Biocomput. 2012:247-58. doi: 10.1142/9789814366496_0024. Pac Symp Biocomput. 2012. PMID: 22174280
-
Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling.Bull Math Biol. 2020 Jan 22;82(2):21. doi: 10.1007/s11538-020-00693-3. Bull Math Biol. 2020. PMID: 31970502
-
Alignment methods: strategies, challenges, benchmarking, and comparative overview.Methods Mol Biol. 2012;855:203-35. doi: 10.1007/978-1-61779-582-4_7. Methods Mol Biol. 2012. PMID: 22407710 Review.
-
A review on multiple sequence alignment from the perspective of genetic algorithm.Genomics. 2017 Oct;109(5-6):419-431. doi: 10.1016/j.ygeno.2017.06.007. Epub 2017 Jun 29. Genomics. 2017. PMID: 28669847 Review.
Cited by
-
NetAllergen, a random forest model integrating MHC-II presentation propensity for improved allergenicity prediction.Bioinform Adv. 2023 Oct 16;3(1):vbad151. doi: 10.1093/bioadv/vbad151. eCollection 2023. Bioinform Adv. 2023. PMID: 37901344 Free PMC article.
References
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous