Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes
- PMID: 2315319
- PMCID: PMC53667
- DOI: 10.1073/pnas.87.6.2264
Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes
Abstract
An unusual pattern in a nucleic acid or protein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydrophobicity, or secondary structure potential; for multiple sequences, they can reflect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts, these permit the choice of scoring systems which are "optimal" for distinguishing biologically relevant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biological features. These include distinctive charge regions in transcription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport proteins, and statistically significant subalignments involving the recently characterized cystic fibrosis gene.
Similar articles
-
Applications and statistics for multiple high-scoring segments in molecular sequences.Proc Natl Acad Sci U S A. 1993 Jun 15;90(12):5873-7. doi: 10.1073/pnas.90.12.5873. Proc Natl Acad Sci U S A. 1993. PMID: 8390686 Free PMC article.
-
Statistical studies of biomolecular sequences: score-based methods.Philos Trans R Soc Lond B Biol Sci. 1994 Jun 29;344(1310):391-402. doi: 10.1098/rstb.1994.0078. Philos Trans R Soc Lond B Biol Sci. 1994. PMID: 7800709
-
A method for detecting distant evolutionary relationships between protein or nucleic acid sequences in the presence of deletions or insertions.J Mol Evol. 1978 Jun 20;11(2):143-61. doi: 10.1007/BF01733890. J Mol Evol. 1978. PMID: 671562
-
Theories for Sequence-Dependent Phase Behaviors of Biomolecular Condensates.Biochemistry. 2018 May 1;57(17):2499-2508. doi: 10.1021/acs.biochem.8b00058. Epub 2018 Mar 13. Biochemistry. 2018. PMID: 29509422 Review.
-
Statistical analysis of DNA sequences.J Natl Cancer Inst. 1988 May 18;80(6):395-406. doi: 10.1093/jnci/80.6.395. J Natl Cancer Inst. 1988. PMID: 3285010 Review.
Cited by
-
Rapid identification of intact bacterial resistance plasmids via optical mapping of single DNA molecules.Sci Rep. 2016 Jul 27;6:30410. doi: 10.1038/srep30410. Sci Rep. 2016. PMID: 27460437 Free PMC article.
-
Target prediction for small, noncoding RNAs in bacteria.Nucleic Acids Res. 2006 May 22;34(9):2791-802. doi: 10.1093/nar/gkl356. Print 2006. Nucleic Acids Res. 2006. PMID: 16717284 Free PMC article.
-
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005. doi: 10.1093/nar/29.14.2994. Nucleic Acids Res. 2001. PMID: 11452024 Free PMC article. Review.
-
The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes.PLoS Comput Biol. 2009 Dec;5(12):e1000593. doi: 10.1371/journal.pcbi.1000593. Epub 2009 Dec 11. PLoS Comput Biol. 2009. PMID: 20011103 Free PMC article.
-
ATDB: a uni-database platform for animal toxins.Nucleic Acids Res. 2008 Jan;36(Database issue):D293-7. doi: 10.1093/nar/gkm832. Epub 2007 Oct 11. Nucleic Acids Res. 2008. PMID: 17933766 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources