Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar:Chapter 10:10.8.1-10.8.24.
doi: 10.1002/0471250953.bi1008s37.

Using BLAT to find sequence similarity in closely related genomes

Affiliations

Using BLAT to find sequence similarity in closely related genomes

Medha Bhagwat et al. Curr Protoc Bioinformatics. 2012 Mar.

Abstract

The BLAST-Like Alignment Tool (BLAT) is used to find genomic sequences that match a protein or DNA sequence submitted by the user. BLAT is typically used for searching similar sequences within the same or closely related species. It was developed to align millions of expressed sequence tags and mouse whole-genome random reads to the human genome at a higher speed. It is freely available either on the Web or as a downloadable stand-alone program. BLAT search results provide a link for visualization in the University of California, Santa Cruz (UCSC) Genome Browser, where associated biological information may be obtained. Three example protocols are given: using an mRNA sequence to identify the exon-intron locations and associated gene in the genomic sequence of the same species, using a protein sequence to identify the coding regions in a genomic sequence and to search for gene family members in the same species, and using a protein sequence to find homologs in another species.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Web BLAT search screen for Protocol 1. The interface allows the user to easily specify (from left to right) the genome, assembly, mode of search, the desired sorting of the results, and the output format. The Genome pull-down menu provides a choice of over 50 species from mammals, fish, invertebrates, yeast, and others. Some, such as human, will have more than one choice of genome assembly in the Assembly pull-down menu. The Query type pull-down menu provides an ability to choose the mode of the search. The DNA Query type used in this protocol searches a DNA query against a DNA database. Additional available options are described in the text. The Sort output pull-down menu can be used to sort the results table. The options are “query, score”; “query, start”; “chromosome, start”; “chromosome, score”; and “score”. The “query, score” option first sorts by query ID (if multiple sequences are pasted into the input box) and then by score. Finally, the “Output type” pull-down menu provides 3 options to present the results, hyperlink, psl and psl no header. The choice of “Output type” as hyperlink yields a table with a link (Browser) to display each alignment in the UCSC Genome Browser and a link (details) to the details of the alignment. The psl output type provides details about mismatches, gaps, and blocks in a tabular format and does not provide links to alignments or the genome browser. The PSL output format is described in detail in the text. Finally, the large text box is for the input sequence. The sequence needs to be in FASTA format as shown here for the query NM_000531.5, which is used in Protocol 1. Clicking on the submit button generates the table of alignments. The “I’m feeling lucky button” goes directly to the genome browser to display the genome alignment of the best scoring alignment of the first input sequence. The clear button resets the input text box. The “Browse” and “submit file” buttons are for uploading sequences from a file instead of copying them into the text box. In this Protocol, the selected options are “Human” genome, “Feb 2009 GRCh37/hg19” assembly, “DNA” as query type, “query, score” for sorting the results and “hyperlink” as the output type.
Figure 2
Figure 2
Results table for Protocol 1. The BLAT search results provide the following columns: 1) ACTIONS – links for visualization of the alignment in the UCSC Genome Browser (browser link) and a more detailed alignment text view (details link); 2) QUERY – an identifier for the query sequence; 3) SCORE – the number of matches with a penalty for mismatches and gaps (see subsection “Score calculation” in the Commentary); 4) START – the location of the beginning of the alignment in the query sequence; 5) END – the location of the end of the alignment in the query sequence; 6) QSIZE – the length of the query sequence; 7) IDENTITY – an indication of the number of matching bases and gaps (see subsection “Percent identity calculation” in the Commentary); 8) CHRO – the chromosome; 9) STRAND – both query strands (‘+’ and ‘−’) are checked in the alignment. In the translated alignment mode, a second ‘+’ or ‘−’ for the genomic strand is provided; 10) START – the location of the beginning of the alignment in the genome sequence; 11) END - the location of the end of the alignment in the genome sequence; and 11) SPAN – the number of bases on the genome covered by the alignment. The information in the table – such as score, span and identity – indicates the extent of the match. The results in this Protocol are sorted by score; the top result has a much higher score than the others. The first row in the results shows that the QUERY NM_000531.5 matches the human genome with a score of 1638 (SCORE column) from its nucleotide 1 to 1647 (START and END columns next to the SCORE column). The query size is 1647 (QSIZE column). Thus, the entire query has coverage in the human genome with 100% identity (IDENTITY column). This alignment is on chromosome X (CHRO column), on the plus/forward strand (STRAND column) from nucleotide 38211736 to 38280703 (START and END columns next to the STRAND column), covering a range of 68968 (SPAN column) nucleotides.
Figure 3
Figure 3
Detailed alignment information for a part of the query cDNA sequence in Protocol 1. Capital blue letters indicate matching nucleotides in the cDNA sequence NM_000531.5 to the human genomic sequence. Since in this alignment, query coverage identity is 100%, each nucleotide in the cDNA is capitalized. Light blue letters indicate where the blocks of the query sequence begin and end on the aligned genomic sequence, thus indicating the start and end positions of exons. The query is aligned to the genome in 10 blocks, as listed on the left bar of the page. Each block represents an exon; thus, there are 10 exons in the cDNA NM_000531.5.
Figure 4
Figure 4
Detailed alignment information for a part of the aligned target genome sequence (chromosome X) in Protocol 1. Upstream non-aligned bases are shown in lowercase black letters; the first block of aligned sequence (exon) bases are shown in uppercase blue letters, followed by another non-aligned block in lowercase black letters, an intron. The start and end of the exon are shown in light blue. Details can be found in the text.
Figure 5
Figure 5
Detailed side-by-side alignment information for the query cDNA and target genome sequence for the first match in Protocol 1. The figure shows the first block of the first alignment between the query and target genome divided in sections of 50 nucleotides. In each section, the top line represents the query cDNA NM_000531.5, and the line beneath it represents the human genomic DNA, chromosome X. In this block, the NM_000531.5 cDNA query nucleotides with coordinates from 1 to 291 align with human chromosome X nucleotides (in the Feb 2009 GRCh37/hg19 assembly) from 38211736 to 38212026. The identity, shown by a vertical bar between the query and genome nucleotides, is 100%. Details can be found in the text.
Figure 6
Figure 6
UCSC Genome Browser display of the first match in Protocol 1 with the default gene annotation track added. The position/search box lists the displayed region coordinates, chromosome X:38211736 to 38280703, corresponding to the first match. This region is depicted by a red rectangle in the p arm of the chromosome X ideogram. The box below the ideogram displays two tracks. The top track, labeled “Your Sequence from BLAT Search”, shows the alignment of the query NM_000531.5. Each thick bar represents an alignment block identified in the “details” view in Figure 3. The second track is the UCSC gene track (labeled “UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics”). This track identifies the gene, OTC (the symbol written on the left side of the track), associated with the query transcript NM_000531.5. The exons in the gene, indicated by thick bars on the gene line, match the thick bars on the query line above. The gene is annotated on the plus strand as indicated by the ≫> symbols on the line representing it. Information about the gene can be obtained by clicking on the gene name, OTC. Details can be found in the text.
Figure 7
Figure 7
The BLAT search screen for Protocol 2. Note that the option “protein” is selected for the Query type. Refer to the legend for Figure 1 for additional information on the content of each search menu option.
Figure 8
Figure 8
The results of the BLAT search, Protocol 2, shown in Figure 7. The first hit has 100% coverage (IDENTITY column) on chromosome 11. Other hits also have high identity and long alignments. All of the hits in this example are on chromosome 11. Refer to the legend for Figure 2 and text for additional information on the content of each column of the display.
Figure 9
Figure 9
Detailed alignment information for the first match of the query protein and chromosome 11 sequences in Protocol 2. The protein query NP_000509.1 sequence is listed at the top, and the region of the chromosome 11 sequence that aligned to the query is shown below it. In this match, all of the query sequence was aligned to the genome (translation), so all of the letters in the query sequence are capitals and colored blue. In the genomic DNA sequence below, the (translation of the) nucleotides that aligned to the protein sequence are shown in capital letters and colored blue. Refer to the legends for Figures 3 and 4 and to the text for additional information on the content of the display.
Figure 10
Figure 10
Detailed side-by-side alignment information for the query protein and genome sequences in Protocol 2. The result shows the alignment is in 3 blocks indicating the gene has 3 coding exons. Each alignment block is separated by a horizontal line, and is divided into sections of 60 coordinates. In each section, the first sequence line provides the query amino acid sequence and the second sequence line gives the nucleotide sequence of the aligning genome. For both lines, the starting and ending coordinates are at the nucleotide level. Thus, in the top section of the first block, the first line shows the sequence of the first 20 amino acids in the query. The coordinates for this line are 1 to 60 corresponding to the codons for the first 20 amino acids. The second sequence line shows the nucleotides with starting and ending coordinates (5248251 to 5248192) on the aligning chromosome, 11. The second section of this block shows alignment of query amino acids 21 to 30 (again the coordinates for the top line, 61–90, are for the nucleotides in the codons) to the nucleotides 5248191 to 5248162 on chromosome 11. See the text for details.
Figure 11
Figure 11
UCSC Genome Browser display of the aligned genome region for the first match in Protocol 2. The region displayed in this view is nucleotides 5246831 to 5248251 of chromosome 11. The query sequence alignment is represented in the top track and the UCSC gene track is represented in the second line. A label to the left of this track indicates the symbol for the gene, HBB, represented in that track. Refer to the legend for Figure 6 and the text for additional information on the content of the display.
Figure 12
Figure 12
The BLAT search screen for the “Support Protocol” section. Note that the “Sort output” selection is “chrom, start”. Refer to the legend for Figure 1 for additional information on the content of each search menu option.
Figure 13
Figure 13
Results of the BLAT search in Figure 12. Since the sorting option “chrom, start” was used, the results are sorted by their position on the chromosome as opposed to the “query, score” sorting in Figure 8. In this case, “start” refers to the column START after the STRAND column. Note that the hits are on chromosome 11 between positions 5246831 and 5290908. Refer to the legend for Figure 2 and the text for additional information on the content of each column of the display.
Figure 14
Figure 14
UCSC Genome Browser display of the aligned genome region in the search described in the “Support Protocol”. After changing the view in the genome browser to include positions 5246831 and 5290908, the range identified in the previous figure, the browser shows genes corresponding to all six matches: HBB, HBD, HBBP1, HBG1, HBG2, and HBE1. Refer to the legend for Figure 6 and the text for additional information on the content of the display.
Figure 15
Figure 15
The BLAT search screen for Protocol 3. When the Genome was changed to Chimp, the Assembly changed automatically. Refer to the legend for Figure 1 for additional information on the content of each search menu option.
Figure 16
Figure 16
The results of the BLAT search shown in Figure 15. The first match has a much higher score than the second, and it matches over a longer span. Based on the columns CHRO, STRAND, START and END, the second match lies entirely within the span of the first. Refer to the legend for Figure 2 and the text for additional information on the content of each column of the display.
Figure 17
Figure 17
Detailed alignment information for the first match of the query protein and the chimp genome sequences in Protocol 3. The top section of the details page shows, in capital blue letters, the portion of the queried human protein that matched the chimp genome. Below that is shown the section for the chimp chromosome X sequence, again showing the aligning sequence in capital blue letters. The lowercase black letters indicate regions which are not aligned. The red arrow points to the amino acid threonine, shown in a lowercase black letter t, at the 125th position of the query NP_000531.1indicating a mismatch at that position with respect to the (translation of the) genome sequence. Refer to the legend for Figures 3 and 4 and to the text for additional information on the content of the display.
Figure 18
Figure 18
Detailed side-by-side alignment information for a portion of the first match of the query protein and chimp genome sequences in Protocol 3. Exact matches between the query and genome sequences are shown by a vertical line. Mismatches are shown by the letter code for the amino acid encoded by the aligned genomic sequence. To illustrate, the mismatch between the 125th amino acid threonine in the human query protein and methionine encoded by the chimp genome at the corresponding position is highlighted by a red rectangle. Refer to the legend for Figure 10 and the text for additional information on the content of the display.
Figure 19
Figure 19
Results of the first match of Protocol 1 displayed as a custom track in the UCSC Genome Browser. The track is labeled “User Supplied Track”. Instructions to generate a custom track using the PSL output format are described in the text.

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. - PubMed
    1. Bina M. Identification and mapping of paralogous genes on a known genomic DNA sequence. In: Bina M, editor. Methods in Molecular Biology, Vol. 338: Gene Mapping, Discovery, and Expression. Humana Press; Totowa, NJ: 2006. pp. 21–29. - PubMed
    1. Harper C, Huang C, Stryke D, Kawamoto M, Ferrin T, Babbitt P. Comparison of methods for genomic localization of gene trap sequences. BMC Genomics. 2006;7:236. - PMC - PubMed
    1. Karolchik D, Hinrichs AS, Kent WJ. The UCSC Genome Browser. Curr Protoc Bioinformatics. 2009;Chapter 1(Unit 1.4) - PubMed
    1. Kent WJ. BLAT -- The BLAST-like alignment tool. Genome Research. 2002;12:656–664. The original article by the author of BLAT discusses the rationale and algorithms used in its development. - PMC - PubMed