Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

G Z Hertz¹, G D Stormo

Affiliations

PMID: 10487864
DOI: 10.1093/bioinformatics/15.7.563

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

G Z Hertz et al. Bioinformatics. 1999 Jul-Aug.

. 1999 Jul-Aug;15(7-8):563-77.

doi: 10.1093/bioinformatics/15.7.563.

Authors

G Z Hertz¹, G D Stormo

Affiliation

¹ Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder, CO 80309-0347, USA. hertz@colorado.edu

PMID: 10487864
DOI: 10.1093/bioinformatics/15.7.563

Abstract

Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme.

Results: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein.

Availability: Programs were developed under the UNIX operating system and are available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus.

PubMed Disclaimer

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

HG-00249/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Silverchair Information Systems
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Affiliation

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Authors

Affiliation

Abstract

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous