Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 May 11;6(5):e19816.
doi: 10.1371/journal.pone.0019816.

Targeted assembly of short sequence reads

Affiliations

Targeted assembly of short sequence reads

René L Warren et al. PLoS One. .

Abstract

As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled stringently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming genomic mutations, polymorphisms, fusions and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Detection of true positive versus false positive SNVs in lobular breast cancer.
TASR was run incrementally on up to 2 billion, 51 and 76 nt lobular breast cancer NGS whole-genome shotgun reads, providing 5 to 36-fold coverage of the 3 Gbp human genome. We used as targets 51 nt sequences containing one of 31 SNVs detected by NGS read alignment and confirmed by Sanger sequencing (true positive), 31 matching sequences containing the reference base instead (reference) and 31 detected by NGS read alignment but not confirmed by Sanger sequencing (false positive). Although close to twice as much WGSS data had been generated from the LBC sample, we see that a fraction of that (∼19-fold) is sufficient for confirming most (68%) true positive SNVs.
Figure 2
Figure 2. De novo assembly of prostate carcinoma RNA-seq data.
Using a TMPRSS2:ERG target sequence that differs from a TMPRSS2 target by a single base (underlined), TASR generated a contig, which captures 18 ERG-specific bases fused to exon 1 of TMPRSS2 in a prostate adenocarcinoma sample (SRA accession SRX027125). These bases were not specified in the target sequence and thus, unknown from the original hypothesis. A total of 121 reads span the TMPRSS2:ERG fusion coordinate (underlined base). Higher base coverage is expected in the middle of the contig where 15-mer read recruitment reaches a maximum for both strand and is unaffected by the limiting effects of the minimum overlap (-m) option on the edge of the sequence target. This highlights the importance of using a sequence target that is sufficiently long and at least the same length as the input reads. From this result, it is very likely that the prostate adenocarcinoma sample contains an admixture of TMPRSS2 transcripts, including the TMPRSS2{NM_005656.2}:r.1_71_ERG{NM_004449.3}:r.226_3097 fusion and that those have varied abundance, as reflected by high depth of coverage.

References

    1. Stein LD. The case for cloud computing in genome informatics. Genome Biol. 2010;11:207. - PMC - PubMed
    1. Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007;23:500–501. - PMC - PubMed
    1. Warren RL, Nelson BH, Holt RA. Profiling model T-cell metagenomes with short reads. Bioinformatics. 2009;25:458–464. - PubMed
    1. Freeman JD, Warren RL, Webb JR, Nelson BH, Holt RA. Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome Res. 2009;19:1817–1824. - PMC - PubMed
    1. Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, et al. Extending assembly of short DNA sequences to handle error. Bioinformatics. 2007;23:2942–2944. - PubMed