Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Sep 4;19(1):651.
doi: 10.1186/s12864-018-5040-z.

Linked read technology for assembling large complex and polyploid genomes

Affiliations

Linked read technology for assembling large complex and polyploid genomes

Alina Ott et al. BMC Genomics. .

Abstract

Background: Short read DNA sequencing technologies have revolutionized genome assembly by providing high accuracy and throughput data at low cost. But it remains challenging to assemble short read data, particularly for large, complex and polyploid genomes. The linked read strategy has the potential to enhance the value of short reads for genome assembly because all reads originating from a single long molecule of DNA share a common barcode. However, the majority of studies to date that have employed linked reads were focused on human haplotype phasing and genome assembly.

Results: Here we describe a de novo maize B73 genome assembly generated via linked read technology which contains ~ 172,000 scaffolds with an N50 of 89 kb that cover 50% of the genome. Based on comparisons to the B73 reference genome, 91% of linked read contigs are accurately assembled. Because it was possible to identify errors with > 76% accuracy using machine learning, it may be possible to identify and potentially correct systematic errors. Complex polyploids represent one of the last grand challenges in genome assembly. Linked read technology was able to successfully resolve the two subgenomes of the recent allopolyploid, proso millet (Panicum miliaceum). Our assembly covers ~ 83% of the 1 Gb genome and consists of 30,819 scaffolds with an N50 of 912 kb.

Conclusions: Our analysis provides a framework for future de novo genome assemblies using linked reads, and we suggest computational strategies that if implemented have the potential to further improve linked read assemblies, particularly for repetitive genomes.

Keywords: Genome assembly; Long molecule sequencing; Polyploid assembly.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

The maize B73 stock used in this study is derived from that originally obtained from Don Robertson (Iowa State University) as Schnable Lab Ac # 660; the B73 inbred is available from USDA’s National Plant Germplasm service as PI 550473. The Huntsman proso millet variety is available from the USDA NPGS as PI 578074. No permission was necessary to collect the plant samples and no specimens were deposited as vouchers.

Consent for publication

Not applicable.

Competing interests

J.C.S, C.-T.Y., and P.S.S. have equity interests in Data2Bio, LLC and Dryland Genetics, LLC.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Types of assembled contigs and alignments to REF contigs. a A contig pair is a pair of contigs which are the only contigs originating from a single scaffold. b Some scaffolds contain “N”s that denote scaffolding of contigs from pairs of reads or linked reads with common barcodes. After removal of “N”s, the remaining sequences are termed LR contigs or REF contigs, depending on the origin of the scaffold. Removal of 40 bases from both ends of an LR contig results in a trimmed LR contig. c Trimmed or untrimmed LR contigs are aligned to the REF contigs. Alignments are categorized as fully aligned, where the entire contig aligns to a REF contig; alignments with tails, where a region of the LR contig aligns to a REF contig but a region at either or both ends of the LR contig does not align to the REF contig; or uncategorized, where the LR contig extends past the edge of a REF contig. d LR contigs with tails are divided into two regions: the aligned region and the tail region. Tails can be removed in silico to generate a set of tail-derived contigs. e LR contigs with tails that fully align to a unique location in the genome on the same or a different REF contig are termed chimeric LR contigs
Fig. 2
Fig. 2
Illustration of machine learning methodology. A gene sequence is converted to a state sequence that forms a Markov chain; the Markov chain is encoded using a Probabilistic Finite State Automation (PFSA); the transition matrix of the PFSA is used as an input to the deep convolutional neutral network (CNN) for classifying the gene sequence
Fig. 3
Fig. 3
Conservation of gene order between the foxtail millet reference genome and pairs of scaffolds from the proso millet linked read assembly spanning the same region. The foxtail millet reference genome is shown in the center panel with genes indicated by gray arrows and protein coding exons by green squares. Proso millet scaffolds are shown above and below the foxtail millet genome. Red and blue lines connect gene regions from the foxtail millet genome with homologous sequence in the respective proso millet scaffolds
Fig. 4
Fig. 4
Coverage of the pseudomolecule level assembly of foxtail millet by syntenic proso millet scaffolds. Green horizontal lines indicate each of the nine foxtail millet chromosomes. Boxes in red and blue indicate syntenic regions from individual proso millet scaffolds. Boxes are tiled above (blue) and below (red) in such a way as to avoid double coverage of the foxtail millet genome by multiple scaffolds on the same side (Methods)

References

    1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–351. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed
    1. Feuillet C, Leach JE, Rogers J, Schnable PS, Eversole K. Crop genome sequencing: lessons and rationales. Trends Plant Sci. 2011;16(2):77–88. doi: 10.1016/j.tplants.2010.10.005. - DOI - PubMed
    1. Schnable PS, Ware D, Fulton RS, Stein JC, Wei FS, Pasternak S, Liang CZ, Zhang JW, Fulton L, Graves TA, et al. The B73 maize genome: complexity, diversity, and Dynamics. Science. 2009;326(5956):1112–1115. doi: 10.1126/science.1178534. - DOI - PubMed
    1. Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, Clum A, Copeland A, Huddleston J, Eichler EE, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods. 2013;10(6):563. doi: 10.1038/nmeth.2474. - DOI - PubMed
    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):53–59. doi: 10.1038/nature07517. - DOI - PMC - PubMed

MeSH terms