Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Jun;21(3):427-439.
doi: 10.1016/j.gpb.2023.04.004. Epub 2023 Apr 25.

Recent Advances in Assembly of Complex Plant Genomes

Affiliations
Review

Recent Advances in Assembly of Complex Plant Genomes

Weilong Kong et al. Genomics Proteomics Bioinformatics. 2023 Jun.

Abstract

Over the past 20 years, tremendous advances in sequencing technologies and computational algorithms have spurred plant genomic research into a thriving era with hundreds of genomes decoded already, ranging from those of nonvascular plants to those of flowering plants. However, complex plant genome assembly is still challenging and remains difficult to fully resolve with conventional sequencing and assembly methods due to high heterozygosity, highly repetitive sequences, or high ploidy characteristics of complex genomes. Herein, we summarize the challenges of and advances in complex plant genome assembly, including feasible experimental strategies, upgrades to sequencing technology, existing assembly methods, and different phasing algorithms. Moreover, we list actual cases of complex genome projects for readers to refer to and draw upon to solve future problems related to complex genomes. Finally, we expect that the accurate, gapless, telomere-to-telomere, and fully phased assembly of complex plant genomes could soon become routine.

Keywords: Assembly algorithm; Complex plant genome; Haplotype-resolved assembly; Sequencing technology; Telomere-to-telomere genome.

PubMed Disclaimer

Conflict of interest statement

The authors have declared no competing interests.

Figures

Figure 1
Figure 1
The assembly of highly repetitive sequences A. A collapsed assembly error example in tandem repeats. A tandem repeat containing two copies (R1 and R2) separates unique sequences S1 and S2. B. Chimeric or fragmented assembly errors in long segmental repeats among different chromosomal regions. S1, S2, S3, and S4 indicate unique sequences, and R1 and R2 represent two identical long segmental repeats. C. The impact of sequencing errors on the assembly of highly similar repeats.
Figure 2
Figure 2
Theimpact of high heterozygosityon genome assembly A. Consensus sequence assembly of a low-heterozygosity genome. Small-scale variations (such as SNPs) in different haplotype sequences can be aligned during assembly and then assembled into consensus sequences. B. Bubble structures of highly heterozygous genomes. Large-scale structural variations from different haplotype sequences affect the sequence alignment to form ‘bubbles’ representing redundant allelic sequences and fail to form consensus sequences. SNP, single-nucleotide polymorphism.
Figure 3
Figure 3
Challenges of polyploid genome assembly A. Illustration of chimeric contig assembly errors in an autotetraploid genome, including switch errors and false duplications. B. Incorrect Hi-C clustering of chimeric contigs leads to multiple misassemblies. C. Illustration of collapsed contig assembly errors in an autotetraploid genome. D. The collapsed contig generates Hi-C links with all contigs belonging to four haplotypes, resulting in a superlong and erroneous scaffold. Hap, haplotype; Hi-C, high-throughput/resolution chromosome conformation capture.
Figure 4
Figure 4
Three strategies for identifying redundant contigs A. With the RD-based strategy, redundant or phased contigs are approximately one-half of the mapped RD of collapsed or haplotype-fused contigs due to the bisected RD and the extreme similarity between redundant contigs. Based on the RD of contigs, phased contigs and collapsed contigs can be accurately identified, and the redundant phased contigs will be filtered. B. With WGAC-based strategy, contigs with long-scale alignment are identified as redundant contigs, and only the longer one is selected to leave in the monoploid genome. C. In the K-mer-based strategy, more than 40× Illumina or BGI short reads are first used to build the K-mer data pool. Then, low- and medium-frequency K-mers are mapped to assembled contigs. Redundant contigs share a high proportion of low- and medium-frequency K-mers, and relatively long contigs are finally selected to leave in the monoploid genome. RD, read depth; WGAC, whole genome alignment comparison.

Similar articles

Cited by

References

    1. Meyers L.A., Levin D.A. On the abundance of polyploids in flowering plants. Evolution. 2006;60:1198–1206. - PubMed
    1. Kyriakidou M., Tai H.H., Anglin N.L., Ellis D., Stromvik M.V. Current strategies of polyploid plant genome sequence assembly. Front Plant Sci. 2018;9:1660. - PMC - PubMed
    1. Wang P.P., Moore B.M., Panchy N.L., Meng F.R., Lehti-Shiu M.D., Shiu S.H. Factors influencing gene family size variation among related species in a plant family, Solanaceae. Genome Biol Evol. 2018;10:2596–2613. - PMC - PubMed
    1. Kaul S., Koo H.L., Jenkins J., Rizzo M., Rooney T., Tallon L.J., et al. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. - PubMed
    1. Sun Y.Q., Shang L.G., Zhu Q.H., Fan L.J., Guo L.B. Twenty years of plant genome sequencing: achievements and challenges. Trends Plant Sci. 2022;27:391–401 - PubMed

Publication types