Targeted assembly of short sequence reads

René L Warren¹, Robert A Holt

Affiliations

PMID: 21589938
PMCID: PMC3092772
DOI: 10.1371/journal.pone.0019816

Targeted assembly of short sequence reads

René L Warren et al. PLoS One. 2011.

. 2011 May 11;6(5):e19816.

doi: 10.1371/journal.pone.0019816.

Authors

René L Warren¹, Robert A Holt

Affiliation

¹ Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada. rwarren@bcgsc.ca

PMID: 21589938
PMCID: PMC3092772
DOI: 10.1371/journal.pone.0019816

Abstract

As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled stringently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming genomic mutations, polymorphisms, fusions and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Detection of true positive versus false positive SNVs in lobular breast cancer.**
TASR was run incrementally on up to 2 billion, 51 and 76 nt lobular breast cancer NGS whole-genome shotgun reads, providing 5 to 36-fold coverage of the 3 Gbp human genome. We used as targets 51 nt sequences containing one of 31 SNVs detected by NGS read alignment and confirmed by Sanger sequencing (true positive), 31 matching sequences containing the reference base instead (reference) and 31 detected by NGS read alignment but not confirmed by Sanger sequencing (false positive). Although close to twice as much WGSS data had been generated from the LBC sample, we see that a fraction of that (∼19-fold) is sufficient for confirming most (68%) true positive SNVs.

**Figure 2. De novo assembly of prostate carcinoma RNA-seq data.**
Using a *TMPRSS2:ERG* target sequence that differs from a *TMPRSS2* target by a single base (underlined), TASR generated a contig, which captures 18 *ERG*-specific bases fused to exon 1 of *TMPRSS2* in a prostate adenocarcinoma sample (SRA accession SRX027125). These bases were not specified in the target sequence and thus, unknown from the original hypothesis. A total of 121 reads span the *TMPRSS2:ERG* fusion coordinate (underlined base). Higher base coverage is expected in the middle of the contig where 15-mer read recruitment reaches a maximum for both strand and is unaffected by the limiting effects of the minimum overlap (-m) option on the edge of the sequence target. This highlights the importance of using a sequence target that is sufficiently long and at least the same length as the input reads. From this result, it is very likely that the prostate adenocarcinoma sample contains an admixture of *TMPRSS2* transcripts, including the TMPRSS2{NM_005656.2}:r.1_71_ERG{NM_004449.3}:r.226_3097 fusion and that those have varied abundance, as reflected by high depth of coverage.

See this image and copyright information in PMC

References

1. Stein LD. The case for cloud computing in genome informatics. Genome Biol. 2010;11:207. - PMC - PubMed
1. Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007;23:500–501. - PMC - PubMed
1. Warren RL, Nelson BH, Holt RA. Profiling model T-cell metagenomes with short reads. Bioinformatics. 2009;25:458–464. - PubMed
1. Freeman JD, Warren RL, Webb JR, Nelson BH, Holt RA. Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome Res. 2009;19:1817–1824. - PMC - PubMed
1. Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, et al. Extending assembly of short DNA sequences to handle error. Bioinformatics. 2007;23:2942–2944. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Targeted assembly of short sequence reads

Affiliation

Targeted assembly of short sequence reads

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials