Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(10):e46679.
doi: 10.1371/journal.pone.0046679. Epub 2012 Oct 4.

Improving PacBio long read accuracy by short read alignment

Affiliations

Improving PacBio long read accuracy by short read alignment

Kin Fai Au et al. PLoS One. 2012.

Abstract

The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: J.G.U. and L.L. are full-time employees and stock holders of Pacific Biosciences, a company commercializing single-molecule, real-time nucleic acid sequencing technologies.

Figures

Figure 1
Figure 1. The workflow of standard LSC and the outline of error correction based on HC transformation.
(a) LSC consists of five steps: HC transformation of SRs and LRs, SR quality control, SR-LR alignment, error correction and decompression transformation. LSC outputs the sequence from the left-most SR-covered point to the right-most SR-covered one. (b) In the SR-LR layout, correction points consist of four types: HC points, point mismatches, deletions and insertions. Each correction point is treated independently and replaced by the consensus sequence from SRs for the first three types. Insertion sequence at position i is treated as a whole at the gap between two positions i and i+1 of the compressed LRs. A consensus sequence of this gap is inserted to the final output at the corresponding position.
Figure 2
Figure 2. The histogram of sequence identities of ecLRs by LSC and rLRs.
Red bars are the LSC ecLRs and purple ones are rLRs. After corrections, much more ecLRs have accuracy higher than 0.9.
Figure 3
Figure 3. The comparison of sequence identities between SR-covered/SR-uncovered regions in each ecLRs.
The sequence identity distribution of SR-covered regions is in red and SR-uncovered one in green.
Figure 4
Figure 4. The scatter plots of SR-covered sequence percentage (SP) and sequence identity of ecLRs.
(a) overview (b) zoom-in view from SP of 0.2 to 1.0 and sequence identity from 0.8 to 1.0. Sequence identity is positively related with SP.
Figure 5
Figure 5. The histogram of lengths of LSC ecLRs (red bars) and PacBioToCA ecLRs (purple bars).
There are much more ecLRs from LSC than PacBioToCA in every bin.
Figure 6
Figure 6. The pie chart of the LSC ecLRs (I> = 0.9).
The LSC ecLRs are categorized by their identities and lengths. 22.40% of these outputs are the comparable result with PacBioToCA, while LSC also output many other ecLRs with good accuracy with various lengths.
Figure 7
Figure 7. The overview and the zoom-in view of a new 3′ UTR isoform of GPM6B detected by an LSC ecLR (4,259 bp).
(a) without error correction by LSC, the rLR cannot detect two 3′ end exons of this isoform because of the high error rate. GPM6B encodes a membrane glycoprotein that belongs to the proteolipid protein family. Proteolipid protein family members are expressed in most brain regions and different isoforms of GPM6B could alter cell-to-cell communication. (b) after correction, there are much less errors (marked in orange and red) in the exons.

References

    1. HiSeq™ Sequencing Systems - Redefining the trajectory of sequencing. Available: http://www.illumina.com/Documents/systems/hiseq/datasheet_hiseq_systems.pdf. Accessed 2012 Sep 8.
    1. Li JJ, Jiang CR, Brown BJ, Huang H, Bickel PJ (2011) Sparse Linear Modeling of RNA-seq Data for Isoform Discovery and Abundance Estimation. Proc Natl Acad Sci. USA 108(50): 19867–19872. - PMC - PubMed
    1. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, et al... (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28: 511–515. - PMC - PubMed
    1. Mason CE, Elemento O (2012) Faster sequencers, larger datasets, new challenges. Genome Biol. Mar 27 13(3): 314. - PMC - PubMed
    1. Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA. Jan 25 108(4): 1513–8. - PMC - PubMed

Publication types