Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 30;12(10):jkac210.
doi: 10.1093/g3journal/jkac210.

Recovering individual haplotypes and a contiguous genome assembly from pooled long-read sequencing of the diamondback moth (Lepidoptera: Plutellidae)

Affiliations

Recovering individual haplotypes and a contiguous genome assembly from pooled long-read sequencing of the diamondback moth (Lepidoptera: Plutellidae)

Samuel Whiteford et al. G3 (Bethesda). .

Abstract

The assembly of divergent haplotypes using noisy long-read data presents a challenge to the reconstruction of haploid genome assemblies, due to overlapping distributions of technical sequencing error, intralocus genetic variation, and interlocus similarity within these data. Here, we present a comparative analysis of assembly algorithms representing overlap-layout-consensus, repeat graph, and de Bruijn graph methods. We examine how postprocessing strategies attempting to reduce redundant heterozygosity interact with the choice of initial assembly algorithm and ultimately produce a series of chromosome-level assemblies for an agricultural pest, the diamondback moth, Plutella xylostella (L.). We compare evaluation methods and show that BUSCO analyses may overestimate haplotig removal processing in long-read draft genomes, in comparison to a k-mer method. We discuss the trade-offs inherent in assembly algorithm and curation choices and suggest that "best practice" is research question dependent. We demonstrate a link between allelic divergence and allele-derived contig redundancy in final genome assemblies and document the patterns of coding and noncoding diversity between redundant sequences. We also document a link between an excess of nonsynonymous polymorphism and haplotigs that are unresolved by assembly or postassembly algorithms. Finally, we discuss how this phenomenon may have relevance for the usage of noisy long-read genome assemblies in comparative genomics.

Keywords: Plutella xylostella; assembly; haplotype; pool-seq.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Contiguity and BUSCO content and of alternative genome assembly methods and the effects of removing putative allelic redundancy. In each panel, “canu,” “flye,” and “wtdbg” refer to the preliminary assemblies produced by each algorithm. “+ purge_dups + HiC” refers to these same assemblies with the additional application of the purge_dups program followed by HiC scaffolding or, Haplomerger2 followed by HiC scaffolding (a) depicts the differences in overall contig size and contiguity between the different methods. The dotted curve describes a previously published reference genome (accession: GCA_000330985.1). The dashed straight line indicates the estimated genome size from an independent flow cytometry estimate (Baxter et al. 2011). (b) Overall BUSCO scores from a database of 5,286 genes. BUSCO scores from the aforementioned accession are also included. (C) This image details the relationships of genes within these sets. Groups of genes are colored by BUSCO score in the initial assembly. BUSCO genes that are single copy and complete in all assemblies are omitted to emphasize differences between assemblies.
Fig. 2.
Fig. 2.
A k-mer-based validation of the alternative genome assembly methods and effects of removing putative allelic redundancy. a) An example of stacked k-mer distributions subdivided by assembly representation (spectra-cn plot) and an overlay of the modeled contributions of sequencing errors, heterozygous content and homozygous content (dotted lines from left to right, respectively). b) The spectra-cn plots for each of the assembly versions (c) shows the number of k-mers present in the intersections between the modeled k-mer content distributions and individual assembly coverage categories present in the spectra-cn plots.
Fig. 3.
Fig. 3.
Quantifying divergence between duplicated BUSCO genes. a) shows the distribution of πN/πS scores for duplicated (N. copies = 2) BUSCO genes remaining after the application of purge_dups or Haplomerger2. b) This image shows an alignment-free quantification of the dissimilarity of intronic and exonic sequence between the same duplicated BUSCO genes (see Methods for details). Panels labeled “Tandem” indicate that the BUSCO copies were found on the same assembly contig, whereas “Unique” indicates that the copies were found on different assembly contigs.

References

    1. Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, Fang Q, Xie D, Feng S, Stiller J, et al.Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature. 2020;587(7833):246–251. doi:10.1038/s41586-020-2871-y. - DOI - PMC - PubMed
    1. Azevedo L, Serrano C, Amorim A, Cooper DN.. Trans-species polymorphism in humans and the great apes is generally maintained by balancing selection that modulates the host immune response. Hum Genomics. 2015;9(1):4–9. doi:10.1186/s40246-015–0043-1. - PMC - PubMed
    1. Baxter SW, Davey JW, Johnston JS, Shelton AM, Heckel DG, Jiggins CD, Blaxter ML.. Linkage mapping and comparative genomics using next-generation RAD sequencing of a non-model organism. PLoS One. 2011;6(4):e19315. doi:10.1371/journal.pone.0019315. - DOI - PMC - PubMed
    1. Charlesworth D, Willis JH.. The genetics of inbreeding depression. Nat Rev Genet. 2009;10(11):783–796. doi:10.1038/nrg2664. - DOI - PubMed
    1. Cheng H, Concepcion GT, Feng X, Zhang H, Li H.. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18(2):170–175. doi:10.1038/s41592-020–01056-5. - PMC - PubMed

Publication types

LinkOut - more resources