Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 3;13(8):evab138.
doi: 10.1093/gbe/evab138.

Long Reads Are Revolutionizing 20 Years of Insect Genome Sequencing

Affiliations

Long Reads Are Revolutionizing 20 Years of Insect Genome Sequencing

Scott Hotaling et al. Genome Biol Evol. .

Abstract

The first insect genome assembly (Drosophila melanogaster) was published two decades ago. Today, nuclear genome assemblies are available for a staggering 601 insect species representing 20 orders. In this study, we analyzed the most-contiguous assembly for each species and provide a "state-of-the-field" perspective, emphasizing taxonomic representation, assembly quality, gene completeness, and sequencing technologies. Relative to species richness, genomic efforts have been biased toward four orders (Diptera, Hymenoptera, Collembola, and Phasmatodea), Coleoptera are underrepresented, and 11 orders still lack a publicly available genome assembly. The average insect genome assembly is 439.2 Mb in length with 87.5% of single-copy benchmarking genes intact. Most notable has been the impact of long-read sequencing; assemblies that incorporate long reads are ∼48× more contiguous than those that do not. We offer four recommendations as we collectively continue building insect genome resources: 1) seek better integration between independent research groups and consortia, 2) balance future sampling between filling taxonomic gaps and generating data for targeted questions, 3) take advantage of long-read sequencing technologies, and 4) expand and improve gene annotations.

Keywords: Arthropoda; Insecta; Oxford Nanopore; Pacific Biosciences; arthropod genomics; long-read sequencing.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Taxonomic representation, contiguity, and the timeline of availability for the most-contiguous nuclear genome assembly for 601 insect species in GenBank as of November 2020. Only one assembly per named species or subspecies is included. (a) The taxonomic diversity of available insect genome assemblies. Observed versus expected numbers of genome assemblies represent the total number of available assemblies versus those that would be expected given the proportion that each order comprises all described insect diversity. Significance was assessed with Fisher’s exact tests. One order is underrepresented (Coleoptera) whereas four orders are overrepresented (Diptera, Hymenoptera, Collembola, Phasmatodea). Eleven orders (light red silhouettes) have no publicly available genome assembly. A breakdown of sequencing technology by order is shown in supplementary figure S1, Supplementary Material online. (b) Genome contiguity versus total assembly length. Contiguity was assessed with contig N50, the mid-point of the contig distribution where 50% of the genome is assembled into contigs of a given length or longer. The inset plot shows a comparison of contig N50 distributions for short-read (n =365) versus long-read (n =126) assemblies. Significance was assessed with a Welch’s t-test. A finer-scale breakdown by sequencing technology is shown in supplementary figure S2, Supplementary Material online. (c) The timeline of genome assembly availability for insects according to the GenBank publication date. A steady increase in contiguity is largely precipitated by the rise of long-read sequencing. Labeled in (b) and (c): well-known or outlier genome assemblies in terms of either model status, assembly size, or contiguity. Groups of species in the same genus are labeled with black circles. (d) Contig N50 by taxonomic group. Generally, taxa were grouped into orders except when 10 or more assemblies were available for a lower taxonomic level (family or genus). As in (b) and (c), each point represents a single insect genome assembly.
Fig. 2.
Fig. 2.
Variation in assembly size and BUSCO gene completeness across Insecta. (a) Assembly size for all insects, grouped by order then family. To improve visualization, the upper display limit was set to 2.8 Gb. Four genome assemblies exceeded this value and are labeled with gray text (in Gb). Taxa silhouettes were either handmade or taken from PhyloPic (http://phylopic.org, last accessed July 15, 2021. (b) BUSCO results for each insect genome assembly. Each horizontal bar represents one assembly (n =601 species) and corresponds to the same taxon in the assembly size plot to the left in (a). (ce) Long-read versus short-read genome assembly comparisons of (c) complete BUSCOs (single and duplicated combined), (d) fragmented BUSCOs, and (e) duplicated BUSCOs only. Significance was assessed with Welch’s t-tests. (f) A comparison of BUSCO completeness versus contig N50. Each point represents the best available assembly for one taxon and groups of taxa in the same genus are labeled with black circles. Unsurprisingly, more contiguous genome assemblies also exhibit greater gene completeness. (g) Longer genes are more likely to be fragmented in insect genome assemblies, regardless of the technology used. However, a much stronger correlation exists between short-read assemblies and fragmentation of longer genes (Spearman’s p: 0.24, P <2.2e-16) than for long-read assemblies (Spearman’s p: 0.08, P =0.002). Unlike in (ce), each circle in (g) represents the percent of fragmentation for that BUSCO gene across all long- or short-read assemblies. Thus, each gene is included twice (once for each technology). All 1,367 BUSCO genes in the OrthoDB v.10 Insecta gene set (Kriventseva, et al. 2019) were used except one 2.02 kb gene that was missing in >70% of assemblies and subsequently removed from analysis and visualization. BUSCO gene lengths varied from 198 bp to 9.01 kb.

References

    1. Adams MD, et al.2000. The genome sequence of Drosophila melanogaster. Science 287:2185–2195. - PubMed
    1. Amarasinghe SL, et al.2020. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21(1):30. - PMC - PubMed
    1. Bellinger PF, Christiansen KA, Janssens F.. 2020. Checklist of the Collembola of the world. Available from: http://www.collembola.org.
    1. Collins FS, Morgan M, Patrinos A.. 2003. The Human Genome Project: lessons from large-scale biology. Science 300(5617):286–290. - PubMed
    1. Consortium AgG. 2017. Genetic diversity of the African malaria vector Anopheles gambiae. Nature 552:96. - PMC - PubMed

Publication types