Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Oct 12:11:506.
doi: 10.1186/1471-2105-11-506.

Geoseq: a tool for dissecting deep-sequencing datasets

Affiliations

Geoseq: a tool for dissecting deep-sequencing datasets

James Gurtowski et al. BMC Bioinformatics. .

Abstract

Background: Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest.

Results: Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment.

Conclusions: Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Geoseq Architecture - Deep sequencing datasets are retrieved from public datastores such as NCBI's SRA. The metadata gets processed and organized in a database. The sequences are indexed using a suffix array. The suffix array indexes are saved to a storage cluster. The client can browse or search the metadata via a browser. When the client selects a dataset of interest, an analysis request is submitted to the processing cluster through a JSON service. The processing cluster retrieves the suffix array indexes from the storage cluster and performs the analysis. All results are returned to the browser either as graphs or as downloadable links.
Figure 2
Figure 2
Dataset Browsing/Selection. The upper-left pane allows filtering of the datasets. The main pane shows the results of the filtering. Details of the run can be expanded by clicking on the arrow before the name. If the read is indexed, a button appears that select the dataset for analysis and places the name in the lower left pane. After the selection is done, cliking on the Analyze button will launch the form to build a search query (figure 3). Only pre-indexed libraries can be selected for further analysis.
Figure 3
Figure 3
Query Building. A box at the bottom of the page confirms the datasets previously selected. The user can either select enter a custom sequence (left) or use a miRNA from our database (right). his will be used to search the previously selected dataset. The result of the analysis is shown in Figures 6 and 7.
Figure 4
Figure 4
Geoseq Analysis. Geoseq uses a tiered process to analyze sequencing data. A. Reads from a deep sequencing experiment are converted into a suffix index for rapid querying. Querying of an input sequence against the experiment is done by breaking the sequence into all possible tiles of a given size and finding the frequency of each tile in the suffix index. B. Visualization of the results can be done in either the Starts View which shows the hits for a tile at the position of the start of the tile, or the Coverage View which integrates hits from all tiles covering a particular position.
Figure 5
Figure 5
Choice of Tile-size. The number of hits for a particular miRNA changes with the tile-size used to query a sRNA-Seq dataset. The size of the tile is varied, while the start of the tile is held fixed at the beginning of the mature sequence of mmu-mir-24-1. At lower tile-sizes, the number of hits corresponds to fragments of the mature sequence that were sequenced. As the tile-size approaches the size of the miRNA, there is a drop-off in hits as there exist fewer reads that span the entire mature sequence. Finally, when the tile-size exceeds the length of the mature miRNA (22-bp), the number of hits drop to zero. The tile-size controls sensitivity and specificity, larger tile sizes increase specificity while smaller tile sizes increase sensitivity.
Figure 6
Figure 6
mRNA-seq. Using the sequences for splice-variant forms of a gene, NM_182470 and NM_002654 for PKM2, in Geoseq allows identification of the correct version that is expressed in the sample being sequenced. The second isoform, NM_182470, shows gaps in coverage (highlighted) indicating that only the first isoform, NM_002654, is expressed.
Figure 7
Figure 7
microRNA Analysis I. The output of a Geoseq query run on sRNA-Seq libraries. In the Starts view, the histogram displays the number of times the k-mer starting at a position was found in the library (there is one histogram per user-selected library). In the Coverage view, the numbers represent the sum of contributions for each tile that covers the position. This figure shows the results of miRNA, dme-mir-10 queried against two D. melanogaster libraries, SRR001664 and SRR001343. The tracks below the histograms indicate the positions of known mature and star sequences for mir-10 according to miRBase [8]. Below this is shown the folded structure of the pre-miRNA from RNAfold [15]. In experiment SRR001664, the expression of the mature sequence is greater than the expression of the star, which is the canonical case. However, in library SRR001343, the expression of the canonical star-sequence is similar to the expression of the mature sequence. This suggests that the roles of star and mature sequences may, on occasions, be context-dependent.
Figure 8
Figure 8
microRNA analysis II. Geoseq allows us to identify expression patterns of microRNAs in libraries. Here we show expression for mmu-miR-712 in SRR023850 which does not exhibit the canonical pattern seen for most other microRNA libraries (Figure 7). This was used to identify mis-annotated miRNAs from miRBase that are listed in Table 1.

References

    1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotech. 2008;26(10):1135–1145. doi: 10.1038/nbt1486. - DOI - PubMed
    1. Shumway M, Cochrane G, Sugawara H. Archiving next generation sequencing data. Nucleic Acids Research. 2010. pp. D870–871. - DOI - PMC - PubMed
    1. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. Galaxy: a platform for interactive large-scale genome analysis. Genome Research. 2005;15(10):1451–5. doi: 10.1101/gr.4086505. - DOI - PMC - PubMed
    1. Homann R, Fleer D, Giegerich R, Rehmsmeier M. mkESA: enhanced suffix array construction tool. Bioinformatics (Oxford, England) 2009;25(8):1084–1085. doi: 10.1093/bioinformatics/btp112. - DOI - PMC - PubMed
    1. Faith JJ, Olson AJ, Gardner TS, Sachidanandam R. Lightweight genome viewer: portable software for browsing genomics data in its chromosomal context. BMC Bioinformatics. 2007;8:344. doi: 10.1186/1471-2105-8-344. - DOI - PMC - PubMed

LinkOut - more resources