Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Oct 14:2025.10.14.682431.
doi: 10.1101/2025.10.14.682431.

The complete genome of a songbird

Affiliations

The complete genome of a songbird

Giulio Formenti et al. bioRxiv. .

Abstract

Bird genomes are the smallest among amniotes, but remain challenging to assemble due to their structural complexity. This study presents the first fully phased, diploid, telomere-to-telomere (T2T) reference genome for the zebra finch (Taeniopygia guttata), a model organism for neuroscience and evolutionary genomics. Combining multiple sequencing strategies resulted in closing nearly all gaps, adding ~90 Mbp of previously missing sequence (7.8%). This includes T2T assemblies for all microchromosomes, including dot chromosomes, and the previously almost entirely missing chr16. The T2T genome is comprehensively annotated for genes, repeats, structural variants, and long-read methylation calls. Complete centromeric structures were assembled and annotated along with kinetochore binding sites. Relative to the previous high-quality reference of the Vertebrate Genomes Project, 2,778 (8.51%) previously unassembled or unannotated genes were identified, of which 9% overlap with segmental duplications. This first complete genome of a songbird, now the new public reference, illuminates avian genome architecture and function.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest S.K. has received travel funds to speak at events hosted by Oxford Nanopore Technologies. E.D.J. is on the Cell Press Advisory Board.

Figures

Figure 1.
Figure 1.. Assembly of a complete zebra finch genome.
(A) Final assembly strategy used for the bTaeGut7 reference diploid genome. (B) Diploid Circos plot, illustrating the gaps in the previous bTaeGut4 reference (track A), repeat distribution with telomere and centromere positions (track B), minisatellite repeats distribution (track C), GC content (track D), methylation levels (track E), genes distribution (track F, coding genes and long non-coding RNA coloured differently), and HiFi and ONT coverage distribution (track G). Synteny between the paternal (pat) and maternal (mat) haplotypes is shown as links in the center of the plot.
Figure 2.
Figure 2.. Improvements over bTaeGut1.4 and other previous references.
(A) Telomere and gap completeness across zebra finch references (Supplementary Table 6). Displayed are improvements in total telomeres found and chromosomes by telomere and gap completeness. (B) Progression of zebra finch reference genome completeness over time relative to the T2T assembly. (C) Distribution of genomic features within previously unassembled regions across chromosomes, including of each haplotype, annotated as protein-coding genes, lncRNAs, intergenic regions, overlapping features (presenting 2 or more overlapping features), and pseudogenes. (D) Total number of bases inside previously unassembled regions and the same annotated features. (E,F) Stacked bar plot showing the composition of repeat classes annotated within previously unassembled regions, and enrichment plot indicating the relative abundance of repeat on each chromosome. (G,H) Distribution of abundance and divergence of satDNA families in the maternal and paternal haplotypes, respectively. Divergence is calculated as the Kimura two-parameter distance to a consensus sequence. (I) Satellite monomer length in relation to its proportion in the genome.
Figure 3.
Figure 3.. Characterization of centromeric repeats.
(A) Overview of centromeric satellite localizations. (B) Tgut716A and Tgut191A presence and length in all chromosomes. (C) Colocalization of putative centromeres with cytogenetic maps. The axes represent the distance from Tgut716A and Tgut191A of centromeric and distal PCR primers from a previous study. (D) Methylation colocalization. Points show the average methylation for individual repeat arrays. One “best”, i.e. with the lowest methylation average is also highlighted for each chromosome. (E) Kimura 2-parameter (K2P) distances of Tgut716A and Tgut191A with respect to their chromosome consensus. The filled data points represent repeats within the centromere (Tgut716A) and PAR (Tgut191A). (F) Side by side comparison of PAR regions in chrZ and chrW. Image shows an annotated sequence identity heatmap (top) and the corresponding location in the Hi-C map (bottom), using StainedGlass and PretextView, respectively. Tgut716A, Tgut191A, and unique sequences are indicated using different colors. (G) Sequence identity heatmaps of 600 kbp in the p-arm of the maternal chr35 containing multiple ITS tandem arrays interleaved with duplicated segments and Tgut191A repeats.
Figure 4.
Figure 4.. Chromosome architecture in the zebra finch.
(A) In the TCHEST model, centromere satellite repeats (Tgut716A) are directly adjacent to the telomere, and are followed by heterochromatin. Heterochromatin and euchromatin can either alternate in the chromosome body (relaxed TCHEST) or exist in only two organized sections (dot chromosomes). A subtelomeric secondary satellite array (Tgut191A) can optionally be present flanking euchromatin. Macrochromosomes often show the two satellites arrays juxtaposed or interleaved, which may be suggestive of different patterns of chromosome fusion followed by rearrangements. (B) Enrichment patterns in 10kb non-overlapping windows of A and B compartments on dot chromosomes for genes, GC content, methylation, repeats, satellites and non-B DNA motifs. ‘***’ denotes p<0.001 adjusted with FDR. (C) Correlation between minisatellite repeats and HiFi coverage in dot chromosomes’ A and B compartments. Each point represents a 10 kb window coloured by its A/B assignment. Density contours are shown for each compartment to highlight data distribution. The black line represents a regression fit across all data points. (D) Non-B DNA motif density in relation to chromosome size. (E) Non-B DNA motif enrichment in and between genes in A and B compartments of dot chromosomes compared to the genome-wide average. The dashed red line on y=1 symbolises no enrichment in relation to the average motif densities in the genome. “ALL” is all non-B motifs considered together. (F) Scatter plot of chromosome size (Mbp) and telomere length (kbp). The box plots compare telomere lengths between chromosome types based on their TCHEST model categories. Hollow datapoints represent outliers.
Figure 5.
Figure 5.. Characterization of structural variants.
(A) Heterozygosity, measured as the total number of SNPs per 1kbp, across all chromosomes with centromeres labelled as red triangles. (B) Maternal and paternal haplotype size difference for each chromosome colored by the chromosome type. (C) Sequence length measured as percentage of the size of the maternal chromosome being identified as syntenic regions (SYN), not aligned between two haplotypes (NOTAL), inversions (INV), inverted translocations (INVTR), translocations (TRANS) and duplications (DUP). Each chromosome is colored by the chromosome type. (D) Chr5 for the maternal and paternal haplotypes, as displayed in PretextView and SVbyEye. (E) Schematic of translocations and inversions between the maternal and paternal haplotypes. Left side shows 1 inversion and 3 translocations and the right side, 2 inversions and 2 translocations. (F) Comparison of chrZ between bTaeGut1.4 and bTaeGut7, displayed using SVbyEye and NCBI’s Comparative Genome Browser to highlight the genes present in the inversions. Red stars indicate the location of the tangles present on chrZ.
Figure 6.
Figure 6.. Zebra finch genome annotation.
(A) Gene count versus chromosome size for protein-coding genes, pseudogenes, lncRNAs, and retrocopies. Scatter plots show the relationship between chromosome size (log-transformed) and the number of annotated genes for each category. Chromosomes are colored, considering their classification as micro/macro/dot chromosomes. Each point represents a chromosome, with labels indicating chromosome IDs. Dashed lines represent linear regression trends with 95% confidence intervals. (B) Distribution of gene lengths, mean intron lengths, and mean exon lengths across all chromosomes in the diploid assembly. Y-axis values are plotted on a log10 scale and measured in base pairs (bp). (C) The PCA plot visualizes the clustering of chromosomes based on their codon usage patterns. (D) The heatmap displays the codon usage distribution across the chromosomes, with color intensity indicating the frequency of the specific codon. Lighter shades (green/yellow) correspond to higher usage, while darker shades reflect lower usage. (E) The percentage of segmental duplications (SDs) for each chromosome, with chromosomes categorized based on their classification. (F) Barplots show the significantly enriched Gene Ontology (GO) terms in the duplicated regions of the genome for each category: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). The x-axis represents the statistical significance of enrichment (−log(FDR p-value)), while the bar color reflects the magnitude of observed-minus-expected gene counts (Obs–Exp), with warmer colors indicating greater enrichment.

References

    1. Kapusta A., and Suh A. (2017). Evolution of bird genomes-a transposon’s-eye view. Ann. N. Y. Acad. Sci. 1389, 164–185. - PubMed
    1. Jarvis E.D., Güntürkün O., Bruce L., Csillag A., Karten H., Kuenzel W., Medina L., Paxinos G., Perkel D.J., Shimizu T., et al. (2005). Avian brains and a new understanding of vertebrate brain evolution. Nat. Rev. Neurosci. 6, 151–159. - PMC - PubMed
    1. Rhie A., McCarthy S.A., Fedrigo O., Damas J., Formenti G., Koren S., Uliano-Silva M., Chow W., Fungtammasan A., Kim J., et al. (2021). Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746. - PMC - PubMed
    1. Kim J., Lee C., Ko B.J., Yoo D.A., Won S., Phillippy A.M., Fedrigo O., Zhang G., Howe K., Wood J., et al. (2022). False gene and chromosome losses in genome assemblies caused by GC content variation and repeats. Genome Biol. 23, 204. - PMC - PubMed
    1. Srikulnath K., Ahmad S.F., Singchat W., and Panthum T. (2021). Why do some vertebrates have microchromosomes? Cells 10, 2182. - PMC - PubMed

Publication types

LinkOut - more resources