. 2016 Mar 24:53:3.9.1-3.9.25.

doi: 10.1002/0471250953.bi0309s53.

Finding Protein and Nucleotide Similarities with FASTA

William R Pearson¹

Affiliations

PMID: 27010337
PMCID: PMC5072362
DOI: 10.1002/0471250953.bi0309s53

Finding Protein and Nucleotide Similarities with FASTA

William R Pearson. Curr Protoc Bioinformatics. 2016.

. 2016 Mar 24:53:3.9.1-3.9.25.

doi: 10.1002/0471250953.bi0309s53.

Author

William R Pearson¹

Affiliation

¹ University of Virginia School of Medicine, Charlottesville, Virginia.

PMID: 27010337
PMCID: PMC5072362
DOI: 10.1002/0471250953.bi0309s53

Abstract

The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons.

Keywords: E()-value; alignment annotation; expectation; homology; scoring matrices; similarity.

PubMed Disclaimer

Figures

**Figure 3.9.1. fasta36 search results**
A simple fasta36 search using an *E. histolytica* putative phosphate transporter (UniProt C4M1E7_ENTHI) as a query sequence in a search of the UniProt reference human proteome. (A) Search summary and statistical output. The command line used to perform the search is shown, as well as the name and version of the program, the name and length of the query sequence as well as the name of the database searched. (B) The list of top scoring sequences, with their raw similarity scores (opt), the normalized bit score, and the expectation value. (C) The highest scoring alignment between C4M1E7_ENTHI and ABD1_HUMAN.

**Figure 3.9.2. Alignment annotation**
Annotations on the alignment of a putative *E. histolytica* protein ( C4M1E7_ENTHI) with human ABD1, a α/β-hydrolase protein family member (Figure 3.9.1C). The annotations on this alignment were produced by the scripts/ann_upfeat_pfam_www.pl script provided with the FASTA package distribution. (A) A text and graphical presentation of the annotation and domain data. UniProt annotated Variant: and functional ( Site: ) information is shown, as well as the conservation state, e.g. 166E=137E of the annotated site. Pfam domain boundaries Region: are also used to produce sub-alignment scores. The boundaries in the query ( 152-403) and subject ( 123-363) sequences are shown, as are the raw score, bit score, fraction identical, and Q-score (−10log(p)). (B) Compact alignment information, alignment encoding, and annotation information. The scores, and alignment start and end coordinates shown in Figure 3.9.1C and schematically in part (A) are reported here as tab-delimited fields. The single line beginning tr|C4M1E7|C4M1E7_ENTHI has been split into five lines to fit the page. The notations <cont> and parenthetical comments are not included in the single line output. Each of the fields in the -m 8CC output is separated by a <tab> character. The output fields match blast tabular output, with the addition of a CIGAR string and an annotation string. The annotation string includes all the annotation information shown in part (A).

**Figure 3.9.3**
Output from scripts/summ_domain_ident.pl.

See this image and copyright information in PMC

References

1. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7:e1002195. - PMC - PubMed
1. Farrar M. Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics. 2007;23:156–161. - PubMed
1. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer ELL, Tate J, Punta M. Pfam: the protein families database. Nucleic Acids Res. 2014;42(Database issue):D222–30. - PMC - PubMed
1. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39:W29–W37. - PMC - PubMed
1. Gonzalez MW, Pearson WR. RefProtDom: A protein database with improved domain boundaries and homology relationships. Bioinformatics. 2010;26:2361–2361. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Finding Protein and Nucleotide Similarities with FASTA

Affiliation

Finding Protein and Nucleotide Similarities with FASTA

Author

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous