Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 May 20;3(5):865-80.
doi: 10.1534/g3.113.005967.

PRICE: software for the targeted assembly of components of (Meta) genomic sequence data

Affiliations

PRICE: software for the targeted assembly of components of (Meta) genomic sequence data

J Graham Ruby et al. G3 (Bethesda). .

Abstract

Low-cost DNA sequencing technologies have expanded the role for direct nucleic acid sequencing in the analysis of genomes, transcriptomes, and the metagenomes of whole ecosystems. Human and machine comprehension of such large datasets can be simplified via synthesis of sequence fragments into long, contiguous blocks of sequence (contigs), but most of the progress in the field of assembly has focused on genomes in isolation rather than metagenomes. Here, we present software for paired-read iterative contig extension (PRICE), a strategy for focused assembly of particular nucleic acid species using complex metagenomic data as input. We describe the assembly strategy implemented by PRICE and provide examples of its application to the sequence of particular genes, transcripts, and virus genomes from complex multicomponent datasets, including an assembly of the BCBL-1 strain of Kaposi's sarcoma-associated herpesvirus. PRICE is open-source and available for free download (derisilab.ucsf.edu/software/price/ or sourceforge.net/projects/pricedenovo/).

Keywords: KSHV; de novo genome assembly; high-throughput DNA sequencing; metagenomics.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic views of the PRICE assembly strategy. (A) The three general steps of a PRICE assembly cycle: (1) retrieval of reads that are likely to derive from genomic regions proximal to the edges of initial input contigs; (2) localized assembly of each contig with its gathered proximal reads to yield larger, extended contigs; and (3) collapse of highly redundant contigs that were generated in the prior assembly step (meta-assembly). (B) A more detailed description of steps 1 and 2 from (A): (i) read mapping to the outward-facing edges of the input contigs; (ii) gathering of the paired-ends (green) of the mapped reads, along with other input contigs linked by mapped read pairs (orange); (iii) strand-specific assembly of the gathered reads and linked contigs; and (iv) repetition using the output contigs as input for a new assembly cycle. (C) Scaling strategy for assembly requirements (minimum overlap and minimum percent identity between aligned sequences). Both requirements were scaled in proportion to the log of the number of input sequences (y-axis), with the minimum overlap increasing linearly (left x-axis) and the minimum percent identity approaching 100% asymptotically (right x-axis). Global minimum values for both factors were defined and applied at and below a baseline number of input sequences (red dashed lines). (D) Hierarchical workflow for local assemblies using a series of different strategies, with subsequent steps increasing both sensitivity and computational demand. The same steps apply to meta-assembly, but the de Bruijn graph method is not applied in that case and the final gapped and ungapped alignments are limited to cases of extensively overlapping sequences.
Figure 2
Figure 2
De novo assembly of an RNA virus genome from a metagenomic dataset. (A) Scale and genic structure of LSV2, a positive-strand RNA genome encoding three ORFs: ORF1 (function unknown), RdRP, and capsid (Runckel et al. 2011). (B) Assembly of LSV2 by PRICE seeded with a single 65nt read. Contigs from each step of a 12-cycle PRICE assembly aligned to the single 12th-cycle output contig. (C) Percentage of reads from the full input dataset that could be aligned to the GenBank reference LSV1 (HQ871931) or LSV2 (HQ888865) genomes, requiring ≥90% identity across the entire read length. (D) Nucleotide % identity of nonoverlapping 50nt windows of the PRICE-assembled LSV2 vs. the reference LSV1 (orange) and LSV2 (blue) genomes. (E) Amino acid % identity of nonoverlapping 10aa windows across each of the three LSV ORFs (starts and ends defined by the reference LSV2 annotations) vs. the protein sequences for ORF1/RdRP/capsid from LSV1 (orange; AEH26192/AEH26193/AEH26194) and LSV2 (blue; AEH26187/AEH26189/AEH26188). (F) Read coverage across the PRICE-assembled LSV2 genome. Coverage values are averaged across nonoverlapping 10nt windows. (G) Contigs from assemblies performed on the same paired-read dataset as above (Dryad repository: doi:10.5061/dryad.9n8rh) using MetaVelvet (Zerbino and Birney 2008; Namiki et al. 2012) (blue), SOAPdenovo (Li et al. 2010b) (orange), IDBA-UD (Peng et al. 2012) (green), and Trinity (Grabherr et al. 2011) (red). Bars indicate alignments between contigs output by those assemblers and the PRICE-assembled LSV2 generated by BLASTn (Altschul et al. 1990) and covering ≥150nt on the PRICE LSV2. Analysis was limited to contigs ≥200nt. Contigs are marked whose lengths are >125% (~) or >200% (*) that of their aligned segments. (H) PRICE sensitivity: the number of nucleotides from LSV2 encompassed by the alignments shown in (G) for each assembly. I) The N50 length for alignments shown in (G). (J) The redundancy of the aligned portions of each assembly shown in (G). Calculated as the summed lengths of the aligned segments divided by the length of their total footprint on the LSV2 assembly from (H). (K) PRICE specificity: the % of nucleotides or contigs from each assembly that were aligned to LSV2 in (G). Chimeric contigs that only partially aligned to LSV2 were fully counted. Only contigs ≥200nt were considered for both the alignments and the total assembly size.
Figure 3
Figure 3
De novo assembly of the keratin 6A mRNA from a transcriptome dataset. (A) Contigs from each step of a 10-cycle PRICE assembly aligned to the keratin 6A reference sequence (NM_005554.3) by BLASTn (Altschul et al. 1990). The seed sequence is a single 54nt read from the paired-end transcriptome dataset (Arron et al. 2011) (dark blue; see Materials and Methods); the later contigs (purple) include poly-A tail sequence not included in the reference sequence. Left: the total number of output contigs generated in each cycle, shown as a histogram of blue or green bars for cycles that were or were not explicitly targeted (using the –target flag; see Materials and Methods) to the seed sequence, respectively. (B) Read coverage from the 54nt paired-end read dataset determined by mapping to the keratin 6A reference. Units are the number of reads overlapping each nucleotide, averaged across nonoverlapping 10nt windows. Coverage is shown requiring 90% (red) or 100% (orange) nucleotide identity between the read and the reference. (C) Identity of the PRICE contig vs. the reference keratin 6A sequence, as well as the human keratin 6B and 6C isoforms (NM_005555.3 and NM_173086.4, respectively) for nonoverlapping 50nt windows and including the poly-A tail sequence. Bottom, the maximum % identity for each 50nt contig window to the three keratin 6 isoforms.
Figure 4
Figure 4
De novo assembly of the BCBL-1 strain of the KSHV. (A) Read coverage from the 65nt paired-end read dataset determined by mapping to the KSHV reference genome (Rezaee et al. 2006) (NC_009333.1) with BLASTn (Altschul et al. 1990), requiring 90% identity. Units are the number of reads overlapping each nucleotide, averaged across nonoverlapping 100nt windows. (B) Heat maps indicating the percent of nucleotides that are G/C in nonoverlapping 250nt windows (orange), LZW sequence complexity (Welch 1984) of nonoverlapping 250nt sequences (blue; see Materials and Methods), read mappability as determined by mapping every overlapping 65mer from the genome using the same method as in (A) and averaging the coverage over a 100nt window (red), and the minimum overlap between adjacently mapping reads across each 100nt window, measured as the minimum value across all reads with 3′ ends in the window, measuring the maximum overlap with all reads mapped 3′ of the given read (green). (C) Contigs from selected steps of a 65-cycle PRICE assembly aligned to the reference genome. Seed sequences of 65nt are shown as the innermost ring (dark blue), followed by intermediate contigs aligned to the reference genome by BLASTn (Altschul et al. 1990), with the final contigs aligned to the reference genome by the Smith-Waterman method (Smith and Waterman 1981) shown on the outer ring (purple).

References

    1. Aird D., Ross M. G., Chen W.-S., Danielsson M., Fennell T., et al. , 2011. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12: R18. - PMC - PubMed
    1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J., 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403–410 - PubMed
    1. Ariyaratne P. N., Sung W.-K., 2011. PE-Assembler: de novo assembler using short paired-end reads. Bioinformatics 27: 167–174 - PubMed
    1. Arron S. T., Ruby J. G., Dybbro E., Ganem D., Derisi J. L., 2011. Transcriptome sequencing demonstrates that human papillomavirus is not active in cutaneous squamous cell carcinoma. J. Invest. Dermatol. 131: 1745–1753 - PMC - PubMed
    1. Bechtel J. T., Liang Y., Hvidding J., Ganem D., 2003. Host range of Kaposi’s sarcoma-associated herpesvirus in cultured cells. J. Virol. 77: 6474–6481 - PMC - PubMed

Publication types