Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Oct 1;6(10):1-16.
doi: 10.1093/gigascience/gix085.

De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads

Affiliations

De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads

Jonas Korlach et al. Gigascience. .

Abstract

Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna's hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution.

Keywords: SMRT Sequencing; brain; de novo genome assembly; language; long reads.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Gene completeness within assemblies. (A) Comparison to a 248 highly conserved core CEGMA eukaryote gene set using human genes [23], between the Sanger-based zebra finch and Illumina-based Anna's hummingbird references and their respective PacBio-based assemblies. We used a more stringent cut-off (>95%) for completeness than usually done (>90%) because we felt 90% was too permissive as it could allow entire missing exons and still call a gene as complete. Gene count is the percentage of genes in each of the assemblies that met this criterion. (B) Comparison to a 303 single-copy conserved eukaryotic BUSCO gene set [26]. Complete is ≥95% complete; fragmented is <95% complete; missing is not found. (C) Comparison to 4915 single-copy conserved genes from the avian BUSCO gene [26].
Figure 2:
Figure 2:
Transcriptome and regulome representation within assemblies. (A) Percentage of RNA-Seq and H3K27Ac ChIP-Seq reads from the zebra finch RA song nucleus mapped back to the zebra finch Sanger-based and PacBio-based genome assemblies. (B) Pie charts of the distributions of the RNA-Seq reads mapped to the zebra finch genome assemblies. (C) Pie charts of the distribution of ChIP-Seq reads mapped to the zebra finch genome assemblies. *P < 0.05; **P < 0.002; ***P < 0.0001; paired t test within animals between assemblies; n = 5 RNA-Seq and n = 3 ChIP-Seq independent replicates from different animals.
Figure 3:
Figure 3:
Comparison of EGR1 assemblies. (A) UCSC genome browser view of the Sanger-based zebra finch EGR1 assembly, highlighting (from top to bottom) 4 contigs (light and dark brown) with 3 gaps, GC percent, nucleotide quality score (blue), RefSeq gene prediction (purple), and areas of repeat sequences. (B) Summary comparison of the Sanger-based and PacBio-based zebra finch assemblies, showing in the latter filling the gaps (black) and correcting erroneous reference sequences surrounding the gaps (red). Tick mark is a synonymous heterozygous SNP in the coding region between the primary (1) and secondary (2) haplotypes. Panels A and B are of the same scale. (C) Comparison of the hummingbird Illumina- and PacBio-based assemblies, showing similar corrections that further lead to a correction in the protein coding sequence prediction (blue). (D) Multiple sequence alignment of the EGR1 protein for the 4 assemblies (2 zebra finch and 2 hummingbird) in (B) and (C), showing corrections to the Illumina-based hummingbird protein prediction by the PacBio-based assembly.
Figure 4:
Figure 4:
Comparison of DUSP1 assemblies. (A) UCSC genome browser view of the Sanger-based zebra finch DUSP1 assembly, highlighting 4 contigs with 3 gaps, GC percent, nucleotide quality score, Blat alignment of the NCBI gene prediction (XP_002193168.1, blue), and repeat sequences. (B) Resolution of the region by the PacBio-based zebra finch assembly, filling the gaps (black) and correcting erroneous reference sequences in repeat regions (red) and gene predictions (blue). Panels (A) and (B) are of the same scale. (C) Resolution and correction to the hummingbird Illumina-based assembly with the PacBio-based assembly (same color scheme as in (B)). (D) Multiple sequence alignment of the DUSP1 protein for the 4 assemblies in (B) and (C), showing numerous corrections to the Sanger-based and Illumina-based protein predictions by both PacBio-based assemblies.
Figure 5:
Figure 5:
Comparison of FOXP2 assemblies. (A) UCSC genome browser view of the Sanger-based zebra finch FOXP2 assembly, highlighting 10 contigs with nine gaps, GC percent, nucleotide quality score, RefSeq gene prediction, and repeat sequences. (B) Table showing the number of resolved and corrected erroneous base pairs in the gaps by the PacBio-based primary and secondary haplotype assemblies; the asterisk indicates differences between haplotypes. (C) Dot plot of the Sanger-based reference (x-axis) and the PacBio-based primary assembly (y-axis) corresponding to the 3 GC-rich region gaps immediately upstream and surrounding the first exon of the FOXP2 gene. (D) Schematic summary of corrections to the 3 gaps shown in (C) in the 2 haplotypes of the PacBio-based assembly. The protein coding sequence alignments are in Figure S13A.
Figure 6:
Figure 6:
Comparison of SLIT1 assemblies. (A) UCSC genome browser view of the Sanger-based zebra finch SLIT1 assembly, highlighting 15 contigs with 14 gaps, GC percent, nucleotide quality score, NCBI SLIT1 gene prediction (XP_012430014.1, blue), and repeat sequences. Red circles are gaps that correspond to the missing exon 1 and part of the missing exon 35, respectively. (B) Multiple sequence alignment comparison of the SLIT1 protein for the 4 assemblies compared, including the 2 different haplotypes from the PacBio-based zebra finch assembly (rows 2 and 3).

References

    1. Hillier LW, Miller W, Birney E et al. . Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 2004;432:695–716. - PubMed
    1. Warren WC, Clayton DF, Ellegren H et al. . The genome of a songbird. Nature 2010;464:757–62. - PMC - PubMed
    1. Shi Z, Luo G, Fu L et al. . miR-9 and miR-140-5p target FoxP2 and are regulated as a function of the social context of singing behavior in zebra finches. Journal of Neuroscience 2013;33:16510–21. - PMC - PubMed
    1. Pfenning AR, Hara E, Whitney O et al. . Convergent transcriptional specializations in the brains of humans and song-learning birds. Science 2014;346:1256846. - PMC - PubMed
    1. Koepfli K-P, Paten B, O’Brien SJ et al. . The genome 10K project: a way forward. Ann Rev Anim Biosci 2015;3:57–111 - PMC - PubMed

Publication types

Substances