Streaming fragment assignment for real-time analysis of sequencing experiments

Adam Roberts¹, Lior Pachter

Affiliations

PMID: 23160280
PMCID: PMC3880119
DOI: 10.1038/nmeth.2251

Streaming fragment assignment for real-time analysis of sequencing experiments

Adam Roberts et al. Nat Methods. 2013 Jan.

. 2013 Jan;10(1):71-3.

doi: 10.1038/nmeth.2251. Epub 2012 Nov 18.

Authors

Adam Roberts¹, Lior Pachter

Affiliation

¹ Department of Computer Science, University of California, Berkeley, Berkeley, California, USA.

PMID: 23160280
PMCID: PMC3880119
DOI: 10.1038/nmeth.2251

Abstract

We present eXpress, a software package for efficient probabilistic assignment of ambiguously mapping sequenced fragments. eXpress uses a streaming algorithm with linear run time and constant memory use. It can determine abundances of sequenced molecules in real time and can be applied to ChIP-seq, metagenomics and other large-scale sequencing data. We demonstrate its use on RNA-seq data and show that eXpress achieves greater efficiency than other quantification methods.

PubMed Disclaimer

Figures

**Figure 1**
Overview of eXpress. The input consists of either single or paired-end reads aligned to a set of target sequences and provided in a file or streamed to eXpress. For single fragments that map to multiple sites, assignment probabilities are calculated for each site given previous estimates of target sequence abundances (initially a uniform prior is used). Next, a “forgetting mass” is calculated and partial counts are distributed to the target sequences according to the assignment probability. Parameters for fragment length distribution, sequence bias, and sequence read errors are updated in a similar fashion and used in the next round of alignment. Once the input data has been processed, relative abundances are calculated from the count distributions, along with distributions of estimated and effective counts. An alignment file that includes mapping probabilities can be generated. eXpress can determine whether further sequencing is needed by monitoring relative abundances, making it applicable to real-time sequencing and analysis.

**Figure 2**
(a) Accuracy of eXpress, RSEM, and Cufflinks at multiple sequencing depths in a simulation of one billion read pair fragments generated with (dashed lines) and without (solid lines) sequencing bias. Accuracy for different abundance levels can be found in Supplementary Figure 4. (b) Comparison of time and memory requirements. Since eXpress only stores counts for each of the targets and auxiliary parameters, its memory use is constant in the number of fragments processed. The running time scales linearly with the number of fragments. Stars represent an imposed memory constraint of 24 GB or a software crash

**Figure 3**
Example of abundance estimation by eXpress, RSEM, and Cufflinks at different depths of simulated data for the three-isoform human gene UGT3A2. The RefSeq annotation is shown at top. Dashed lines indicate the ground-truth relative abundances used for the simulation. eXpress only processes each fragment once whereas RSEM and Cufflinks perform many iterations before converging to the maximum likelihood solution. Nevertheless, as more fragments are observed, all three algorithms converge toward the correct answer at approximately the same depth. In fact, eXpress is more robust than the batch algorithms at low depth due to its use of a prior. The stop sign shows where eXpress using an optimal forgetting factor would automatically stop if a convergence threshold was set to 10^-6 in terms of the Kullback-Leibler divergence between the abundance estimates at intervals of 100 fragments. The lower x-axis shows the estimated depth required to observe the corresponding number of reads mapping to this gene (upper x-axis) at a fixed gene-level abundance. Abundance was calculated using a human embryonic stem cell RNA-seq dataset (Online Methods).

See this image and copyright information in PMC

References

1. Lipman D, Flicek P, Salzberg S, Gerstein M, Knight R. Genome Biology. 2011;12:3. - PMC - PubMed
1. Wold B, Myers RM. Nature Methods. 2008;5:1. - PubMed
1. Hashimoto T, de Hoon MJL, Grimmond SM, Daub CO, Hayashizaki Y, Faulkner GJ. Bioinformatics. 2009;25:2613–2614. - PubMed
1. Li B, Dewey CN. BMC Bioinformatics. 2011;12:323. - PMC - PubMed
1. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Nature Biotechnology. 2010;28:511–515. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Streaming fragment assignment for real-time analysis of sequencing experiments

Affiliation

Streaming fragment assignment for real-time analysis of sequencing experiments

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources