Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jan;10(1):71-3.
doi: 10.1038/nmeth.2251. Epub 2012 Nov 18.

Streaming fragment assignment for real-time analysis of sequencing experiments

Affiliations

Streaming fragment assignment for real-time analysis of sequencing experiments

Adam Roberts et al. Nat Methods. 2013 Jan.

Abstract

We present eXpress, a software package for efficient probabilistic assignment of ambiguously mapping sequenced fragments. eXpress uses a streaming algorithm with linear run time and constant memory use. It can determine abundances of sequenced molecules in real time and can be applied to ChIP-seq, metagenomics and other large-scale sequencing data. We demonstrate its use on RNA-seq data and show that eXpress achieves greater efficiency than other quantification methods.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of eXpress. The input consists of either single or paired-end reads aligned to a set of target sequences and provided in a file or streamed to eXpress. For single fragments that map to multiple sites, assignment probabilities are calculated for each site given previous estimates of target sequence abundances (initially a uniform prior is used). Next, a “forgetting mass” is calculated and partial counts are distributed to the target sequences according to the assignment probability. Parameters for fragment length distribution, sequence bias, and sequence read errors are updated in a similar fashion and used in the next round of alignment. Once the input data has been processed, relative abundances are calculated from the count distributions, along with distributions of estimated and effective counts. An alignment file that includes mapping probabilities can be generated. eXpress can determine whether further sequencing is needed by monitoring relative abundances, making it applicable to real-time sequencing and analysis.
Figure 2
Figure 2
(a) Accuracy of eXpress, RSEM, and Cufflinks at multiple sequencing depths in a simulation of one billion read pair fragments generated with (dashed lines) and without (solid lines) sequencing bias. Accuracy for different abundance levels can be found in Supplementary Figure 4. (b) Comparison of time and memory requirements. Since eXpress only stores counts for each of the targets and auxiliary parameters, its memory use is constant in the number of fragments processed. The running time scales linearly with the number of fragments. Stars represent an imposed memory constraint of 24 GB or a software crash
Figure 3
Figure 3
Example of abundance estimation by eXpress, RSEM, and Cufflinks at different depths of simulated data for the three-isoform human gene UGT3A2. The RefSeq annotation is shown at top. Dashed lines indicate the ground-truth relative abundances used for the simulation. eXpress only processes each fragment once whereas RSEM and Cufflinks perform many iterations before converging to the maximum likelihood solution. Nevertheless, as more fragments are observed, all three algorithms converge toward the correct answer at approximately the same depth. In fact, eXpress is more robust than the batch algorithms at low depth due to its use of a prior. The stop sign shows where eXpress using an optimal forgetting factor would automatically stop if a convergence threshold was set to 10-6 in terms of the Kullback-Leibler divergence between the abundance estimates at intervals of 100 fragments. The lower x-axis shows the estimated depth required to observe the corresponding number of reads mapping to this gene (upper x-axis) at a fixed gene-level abundance. Abundance was calculated using a human embryonic stem cell RNA-seq dataset (Online Methods).

References

    1. Lipman D, Flicek P, Salzberg S, Gerstein M, Knight R. Genome Biology. 2011;12:3. - PMC - PubMed
    1. Wold B, Myers RM. Nature Methods. 2008;5:1. - PubMed
    1. Hashimoto T, de Hoon MJL, Grimmond SM, Daub CO, Hayashizaki Y, Faulkner GJ. Bioinformatics. 2009;25:2613–2614. - PubMed
    1. Li B, Dewey CN. BMC Bioinformatics. 2011;12:323. - PMC - PubMed
    1. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Nature Biotechnology. 2010;28:511–515. - PMC - PubMed

Publication types