Geoseq: a tool for dissecting deep-sequencing datasets

James Gurtowski¹, Anthony Cancio, Hardik Shah, Chaya Levovitz, Ajish George, Robert Homann, Ravi Sachidanandam

Affiliations

PMID: 20939882
PMCID: PMC2972303
DOI: 10.1186/1471-2105-11-506

Geoseq: a tool for dissecting deep-sequencing datasets

James Gurtowski et al. BMC Bioinformatics. 2010.

. 2010 Oct 12:11:506.

doi: 10.1186/1471-2105-11-506.

Authors

James Gurtowski¹, Anthony Cancio, Hardik Shah, Chaya Levovitz, Ajish George, Robert Homann, Ravi Sachidanandam

Affiliation

¹ Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, 1425 Madison Avenue, New York, NY 10029, USA.

PMID: 20939882
PMCID: PMC2972303
DOI: 10.1186/1471-2105-11-506

Abstract

Background: Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest.

Results: Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment.

Conclusions: Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.

PubMed Disclaimer

Figures

**Figure 1**
**Geoseq Architecture - Deep sequencing datasets are retrieved from public datastores such as NCBI's SRA**. The metadata gets processed and organized in a database. The sequences are indexed using a suffix array. The suffix array indexes are saved to a storage cluster. The client can browse or search the metadata via a browser. When the client selects a dataset of interest, an analysis request is submitted to the processing cluster through a JSON service. The processing cluster retrieves the suffix array indexes from the storage cluster and performs the analysis. All results are returned to the browser either as graphs or as downloadable links.

**Figure 2**
**Dataset Browsing/Selection**. The upper-left pane allows filtering of the datasets. The main pane shows the results of the filtering. Details of the run can be expanded by clicking on the arrow before the name. If the read is indexed, a button appears that select the dataset for analysis and places the name in the lower left pane. After the selection is done, cliking on the *Analyze* button will launch the form to build a search query (figure 3). Only pre-indexed libraries can be selected for further analysis.

**Figure 3**
**Query Building**. A box at the bottom of the page confirms the datasets previously selected. The user can either select enter a custom sequence (left) or use a miRNA from our database (right). his will be used to search the previously selected dataset. The result of the analysis is shown in Figures 6 and 7.

**Figure 4**
**Geoseq Analysis**. Geoseq uses a tiered process to analyze sequencing data. A. Reads from a deep sequencing experiment are converted into a suffix index for rapid querying. Querying of an input sequence against the experiment is done by breaking the sequence into all possible tiles of a given size and finding the frequency of each tile in the suffix index. B. Visualization of the results can be done in either the Starts View which shows the hits for a tile at the position of the start of the tile, or the Coverage View which integrates hits from all tiles covering a particular position.

**Figure 5**
**Choice of Tile-size**. The number of hits for a particular miRNA changes with the tile-size used to query a sRNA-Seq dataset. The size of the tile is varied, while the start of the tile is held fixed at the beginning of the mature sequence of mmu-mir-24-1. At lower tile-sizes, the number of hits corresponds to fragments of the mature sequence that were sequenced. As the tile-size approaches the size of the miRNA, there is a drop-off in hits as there exist fewer reads that span the entire mature sequence. Finally, when the tile-size exceeds the length of the mature miRNA (22-bp), the number of hits drop to zero. The tile-size controls sensitivity and specificity, larger tile sizes increase specificity while smaller tile sizes increase sensitivity.

**Figure 6**
**mRNA-seq**. Using the sequences for splice-variant forms of a gene, NM_182470 and NM_002654 for PKM2, in Geoseq allows identification of the correct version that is expressed in the sample being sequenced. The second isoform, NM_182470, shows gaps in coverage (highlighted) indicating that only the first isoform, NM_002654, is expressed.

**Figure 7**
**microRNA Analysis I**. The output of a Geoseq query run on sRNA-Seq libraries. In the *Starts* view, the histogram displays the number of times the k-mer starting at a position was found in the library (there is one histogram per user-selected library). In the *Coverage* view, the numbers represent the sum of contributions for each tile that covers the position. This figure shows the results of miRNA, dme-mir-10 queried against two D. melanogaster libraries, SRR001664 and SRR001343. The tracks below the histograms indicate the positions of known mature and star sequences for mir-10 according to miRBase [8]. Below this is shown the folded structure of the pre-miRNA from RNAfold [15]. In experiment SRR001664, the expression of the mature sequence is greater than the expression of the star, which is the canonical case. However, in library SRR001343, the expression of the canonical star-sequence is similar to the expression of the mature sequence. This suggests that the roles of star and mature sequences may, on occasions, be context-dependent.

**Figure 8**
**microRNA analysis II**. Geoseq allows us to identify expression patterns of microRNAs in libraries. Here we show expression for mmu-miR-712 in SRR023850 which does not exhibit the canonical pattern seen for most other microRNA libraries (Figure 7). This was used to identify mis-annotated miRNAs from miRBase that are listed in Table 1.

See this image and copyright information in PMC

References

1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotech. 2008;26(10):1135–1145. doi: 10.1038/nbt1486. - DOI - PubMed
1. Shumway M, Cochrane G, Sugawara H. Archiving next generation sequencing data. Nucleic Acids Research. 2010. pp. D870–871. - DOI - PMC - PubMed
1. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. Galaxy: a platform for interactive large-scale genome analysis. Genome Research. 2005;15(10):1451–5. doi: 10.1101/gr.4086505. - DOI - PMC - PubMed
1. Homann R, Fleer D, Giegerich R, Rehmsmeier M. mkESA: enhanced suffix array construction tool. Bioinformatics (Oxford, England) 2009;25(8):1084–1085. doi: 10.1093/bioinformatics/btp112. - DOI - PMC - PubMed
1. Faith JJ, Olson AJ, Gardner TS, Sachidanandam R. Lightweight genome viewer: portable software for browsing genomics data in its chromosomal context. BMC Bioinformatics. 2007;8:344. doi: 10.1186/1471-2105-8-344. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Geoseq: a tool for dissecting deep-sequencing datasets

Affiliation

Geoseq: a tool for dissecting deep-sequencing datasets

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources