Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 24:53:3.9.1-3.9.25.
doi: 10.1002/0471250953.bi0309s53.

Finding Protein and Nucleotide Similarities with FASTA

Affiliations

Finding Protein and Nucleotide Similarities with FASTA

William R Pearson. Curr Protoc Bioinformatics. .

Abstract

The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons.

Keywords: E()-value; alignment annotation; expectation; homology; scoring matrices; similarity.

PubMed Disclaimer

Figures

Figure 3.9.1
Figure 3.9.1. fasta36 search results
A simple fasta36 search using an E. histolytica putative phosphate transporter (UniProt C4M1E7_ENTHI) as a query sequence in a search of the UniProt reference human proteome. (A) Search summary and statistical output. The command line used to perform the search is shown, as well as the name and version of the program, the name and length of the query sequence as well as the name of the database searched. (B) The list of top scoring sequences, with their raw similarity scores (opt), the normalized bit score, and the expectation value. (C) The highest scoring alignment between C4M1E7_ENTHI and ABD1_HUMAN.
Figure 3.9.2
Figure 3.9.2. Alignment annotation
Annotations on the alignment of a putative E. histolytica protein ( C4M1E7_ENTHI) with human ABD1, a α/β-hydrolase protein family member (Figure 3.9.1C). The annotations on this alignment were produced by the scripts/ann_upfeat_pfam_www.pl script provided with the FASTA package distribution. (A) A text and graphical presentation of the annotation and domain data. UniProt annotated Variant: and functional ( Site: ) information is shown, as well as the conservation state, e.g. 166E=137E of the annotated site. Pfam domain boundaries Region: are also used to produce sub-alignment scores. The boundaries in the query ( 152-403) and subject ( 123-363) sequences are shown, as are the raw score, bit score, fraction identical, and Q-score (−10log(p)). (B) Compact alignment information, alignment encoding, and annotation information. The scores, and alignment start and end coordinates shown in Figure 3.9.1C and schematically in part (A) are reported here as tab-delimited fields. The single line beginning tr|C4M1E7|C4M1E7_ENTHI has been split into five lines to fit the page. The notations <cont> and parenthetical comments are not included in the single line output. Each of the fields in the -m 8CC output is separated by a <tab> character. The output fields match blast tabular output, with the addition of a CIGAR string and an annotation string. The annotation string includes all the annotation information shown in part (A).
Figure 3.9.3
Figure 3.9.3
Output from scripts/summ_domain_ident.pl.

References

    1. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7:e1002195. - PMC - PubMed
    1. Farrar M. Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics. 2007;23:156–161. - PubMed
    1. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer ELL, Tate J, Punta M. Pfam: the protein families database. Nucleic Acids Res. 2014;42(Database issue):D222–30. - PMC - PubMed
    1. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39:W29–W37. - PMC - PubMed
    1. Gonzalez MW, Pearson WR. RefProtDom: A protein database with improved domain boundaries and homology relationships. Bioinformatics. 2010;26:2361–2361. - PMC - PubMed

Publication types