Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2019 Apr 24;9(1):6480.
doi: 10.1038/s41598-019-42795-6.

A comparative analysis of methods for de novo assembly of hymenopteran genomes using either haploid or diploid samples

Affiliations
Comparative Study

A comparative analysis of methods for de novo assembly of hymenopteran genomes using either haploid or diploid samples

Tal Yahav et al. Sci Rep. .

Abstract

Diverse invertebrate taxa including all 200,000 species of Hymenoptera (ants, bees, wasps and sawflies) have a haplodiploid sex determination system, where females are diploid and males are haploid. Thus, hymenopteran genome projects can make use of DNA from a single haploid male sample, which is assumed advantageous for genome assembly. For the purpose of gene annotation, transcriptome sequencing is usually conducted using RNA from a pool of individuals. We conducted a comparative analysis of genome and transcriptome assembly and annotation methods, using genetic sources of different ploidy: (1) DNA from a haploid male or a diploid female (2) RNA from the same haploid male or a pool of individuals. We predicted that the use of a haploid male as opposed to a diploid female will simplify the genome assembly and gene annotation thanks to the lack of heterozygosity. Using DNA and RNA from the same haploid individual is expected to provide better confidence in transcript-to-genome alignment, and improve the annotation of gene structure in terms of the exon/intron boundaries. The haploid genome assemblies proved to be more contiguous, with both contig and scaffold N50 size at least threefold greater than their diploid counterparts. Completeness evaluation showed mixed results. The SOAPdenovo2 diploid assembly was missing more genes than the haploid assembly. The SPAdes diploid assembly had more complete genes, but a higher level of duplicates, and a greatly overestimated genome size. When aligning the two transcriptomes against the male genome, the male transcriptome gave 2-3% more complete transcripts than the pool transcriptome for genes with comparable expression levels in both transcriptomes. However, this advantage disappears in the final results of the gene annotation pipeline that incorporates evidence from homologous proteins. The RNA pool is still required to obtain the full transcriptome with genes that are expressed in other life stages and castes. In conclusion, the use of a haploid source material for a de novo genome project provides a substantial advantage to the quality of the genome draft and the use of RNA from the same haploid individual for transcriptome to genome alignment provides a minor advantage for genes that are expressed in the adult male.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Example of SNPs and indels in the worker (diploid) DNA sequence reads mapped against the male (haploid) genome assembly. Each colored row is a read from the diploid sample (red rectangle labeled W). The sequence of the male assembly is shown at the bottom of the figure (green rectangle labeled M). Examples are shown for base substitutions (a,b), deletion (c), and insertion (d).
Figure 2
Figure 2
Fragmented vs. complete transcripts in the comparison of the male and pool transcriptomes. (a) An example for a gene classified by BUSCO as complete in the male and fragmented in the pool (BUSCO gene EOG09370DXT). The coverage data range is normalized to a range of 0–2500 reads per position. (b) An example for a gene classified by BUSCO as complete in the pool and fragmented in the male (BUSCO gene EOG093706PM). The coverage data range is normalized to a range of 0–1000 reads per position.
Figure 3
Figure 3
An example of SNPs in the pool transcriptome. The male genome and transcriptome both have a G at this position, while the pool RNAseq reads, have either A or G. The black rectangles highlight multiple additional SNPs.
Figure 4
Figure 4
An example of alternative splicing in the pool and male transcriptomes. Visualization of splice junctions using IGV Sashimi plot of male and pool transcripts of same gene. All the splice junctions are of on the negative strand of gene EOG093710JH. Arcs represent splicing events. In orange circles are the number of reads splits across the splice junction. Height of bars between arcs represents exon coverage (reads per position).

References

    1. Church DM, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9:1–5. doi: 10.1371/journal.pbio.1001091. - DOI - PMC - PubMed
    1. Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 2014;15:121–32. doi: 10.1038/nrg3642. - DOI - PubMed
    1. Simpson JT, Pop M. The Theory and Practice of Genome Sequence Assembly. Annu. Rev. Genomics Hum. Genet. 2015;16:153–172. doi: 10.1146/annurev-genom-090314-050032. - DOI - PubMed
    1. Putnam NH, et al. The amphioxus genome and the evolution of the chordate karyotype. Nature. 2008;453:1064–1071. doi: 10.1038/nature06967. - DOI - PubMed
    1. Steinberg, K. M. et al. Single haplotype assembly of the human genome from a hydatidiform mole Single haplotype assembly of the human genome from a hydatidiform mole. 2066–2076, 10.1101/gr.180893.114.2066 (2014). - PMC - PubMed

Publication types