This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Oct 14:2025.10.14.682431.

doi: 10.1101/2025.10.14.682431.

The complete genome of a songbird

Giulio Formenti¹, Nivesh Jain¹, Jack A Medico¹, Marco Sollitto^{1

2}, Dmitry Antipov³, Suziane Barcellos⁴, Matthew Biegler⁴, Inês Borges⁵, J King Chang⁶, Ying Chen¹, Haoyu Cheng⁷, Helena Conceição⁸, Matthew Davenport⁴, Lorraine De Oliveira⁸, Erick Duarte¹, Gillian Durham⁹, Jonathan Fenn^{10

11}, Niamh Forde^{12

13}, Pedro A Galante⁸, Kenji Gerhardt¹⁴, Alice M Giani^{15

16}, Simona Giunta¹⁷, Juhyun Kim³, Aleksey Komissarov¹⁸, Bonhwang Koo¹, Sergey Koren³, Denis Larkin¹⁹, Chul Lee⁴, Heng Li^{20

21}, Kateryna Makova²², Patrick Masterson²³, Terence Murphy²³, Kirsty McCaffrey¹, Rafael L V Mercuri⁷, Yeojung Na¹⁴, Mary J O'Connell^{10

11}, Shujun Ou¹⁴, Adam Phillippy³, Marina Popova¹⁸, Arang Rhie³, Francisco J Ruiz-Ruano^{5

24}, Simona Secomandi^{1

4}, Linnéa Smeds²², Alexander Suh^{5

24}, Tatiana Tilley¹, Niki Vontzou^{5

25}, Paul D Waters⁶, Jennifer Balacco¹, Erich D Jarvis^{1

4

9}

Affiliations

¹ The Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, 10065 USA.
² Department of Biology, University of Florence, 50019, Sesto Fiorentino, FI, Italy.
³ Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institute of Health, Bethesda, MD, 20892, USA.
⁴ The Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, 10065, USA.
⁵ Centre for Molecular Biodiversity Research, Leibniz Institute for the Analysis of Biodiversity Change, Zoologisches Forschungsmuseum A. Koenig, Adenauerallee 160, D-53113 Bonn, Germany.
⁶ Evolution & Ecology Research Centre, School of Biotechnology and Biomolecular Sciences, Faculty of Science, University of New South Wales, Sydney, NSW 2052, Australia.
⁷ Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, 06520 USA.
⁸ Centro de Oncologia Molecular, Hospital Sirio Libanes, Sao Paulo, 01308-050, Brazil.
⁹ Field Research Center, The Rockefeller University, Millbrook, NY, 12545, USA.
¹⁰ Computational and Molecular Evolutionary Biology Group, School of Life Sciences, Faculty of Medicine and Health Sciences, University of Nottingham, Nottingham, NG7 2RD, UK.
¹¹ Division of Evolution, Infection and Genomics, School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, M13 9PT, UK.
¹² Discovery and Translational Sciences Department, Leeds Institute of Cardiovascular and Metabolic Medicine, School of Medicine, University of Leeds, Leeds, UK.
¹³ Centre for Reproductive Health, Institute for Regeneration and Repair, University of Edinburgh, 5 Little France Drive, Edinburgh EH16 4UU, UK.
¹⁴ Department of Molecular Genetics, The Ohio State University, Columbus, OH, USA, 43210.
¹⁵ Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.
¹⁶ Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.
¹⁷ Laboratory of Genome Evolution, Department of Biology & Biotechnologies Charles Darwin, University of Rome "La Sapienza", Rome, Italy, 00185.
¹⁸ Aglabx, Paphos, Cyprus.
¹⁹ Department of Comparative Biomedical Sciences, Royal Veterinary College, University of London, 4 Royal College St, London, UK NW10TU.
²⁰ Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
²¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA.
²² Department of Biology, Penn State University, University Park, Pennsylvania, 16802, USA.
²³ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
²⁴ Bonn Institute for Organismal Biology - Animal Biodiversity, University of Bonn, Germany.
²⁵ School of Biological Sciences, University of East Anglia, Norwich, UK.

PMID: 41480146
PMCID: PMC12754717
DOI: 10.1101/2025.10.14.682431

The complete genome of a songbird

Giulio Formenti et al. bioRxiv. 2025.

[Preprint]. 2025 Oct 14:2025.10.14.682431.

doi: 10.1101/2025.10.14.682431.

Authors

Affiliations

¹ The Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, 10065 USA.
² Department of Biology, University of Florence, 50019, Sesto Fiorentino, FI, Italy.
³ Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institute of Health, Bethesda, MD, 20892, USA.
⁴ The Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, 10065, USA.
⁵ Centre for Molecular Biodiversity Research, Leibniz Institute for the Analysis of Biodiversity Change, Zoologisches Forschungsmuseum A. Koenig, Adenauerallee 160, D-53113 Bonn, Germany.
⁶ Evolution & Ecology Research Centre, School of Biotechnology and Biomolecular Sciences, Faculty of Science, University of New South Wales, Sydney, NSW 2052, Australia.
⁷ Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, 06520 USA.
⁸ Centro de Oncologia Molecular, Hospital Sirio Libanes, Sao Paulo, 01308-050, Brazil.
⁹ Field Research Center, The Rockefeller University, Millbrook, NY, 12545, USA.
¹⁰ Computational and Molecular Evolutionary Biology Group, School of Life Sciences, Faculty of Medicine and Health Sciences, University of Nottingham, Nottingham, NG7 2RD, UK.
¹¹ Division of Evolution, Infection and Genomics, School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, M13 9PT, UK.
¹² Discovery and Translational Sciences Department, Leeds Institute of Cardiovascular and Metabolic Medicine, School of Medicine, University of Leeds, Leeds, UK.
¹³ Centre for Reproductive Health, Institute for Regeneration and Repair, University of Edinburgh, 5 Little France Drive, Edinburgh EH16 4UU, UK.
¹⁴ Department of Molecular Genetics, The Ohio State University, Columbus, OH, USA, 43210.
¹⁵ Nash Family Department of Neuroscience & Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.
¹⁶ Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.
¹⁷ Laboratory of Genome Evolution, Department of Biology & Biotechnologies Charles Darwin, University of Rome "La Sapienza", Rome, Italy, 00185.
¹⁸ Aglabx, Paphos, Cyprus.
¹⁹ Department of Comparative Biomedical Sciences, Royal Veterinary College, University of London, 4 Royal College St, London, UK NW10TU.
²⁰ Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
²¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA.
²² Department of Biology, Penn State University, University Park, Pennsylvania, 16802, USA.
²³ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
²⁴ Bonn Institute for Organismal Biology - Animal Biodiversity, University of Bonn, Germany.
²⁵ School of Biological Sciences, University of East Anglia, Norwich, UK.

PMID: 41480146
PMCID: PMC12754717
DOI: 10.1101/2025.10.14.682431

Abstract

Bird genomes are the smallest among amniotes, but remain challenging to assemble due to their structural complexity. This study presents the first fully phased, diploid, telomere-to-telomere (T2T) reference genome for the zebra finch (Taeniopygia guttata), a model organism for neuroscience and evolutionary genomics. Combining multiple sequencing strategies resulted in closing nearly all gaps, adding ~90 Mbp of previously missing sequence (7.8%). This includes T2T assemblies for all microchromosomes, including dot chromosomes, and the previously almost entirely missing chr16. The T2T genome is comprehensively annotated for genes, repeats, structural variants, and long-read methylation calls. Complete centromeric structures were assembled and annotated along with kinetochore binding sites. Relative to the previous high-quality reference of the Vertebrate Genomes Project, 2,778 (8.51%) previously unassembled or unannotated genes were identified, of which 9% overlap with segmental duplications. This first complete genome of a songbird, now the new public reference, illuminates avian genome architecture and function.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest S.K. has received travel funds to speak at events hosted by Oxford Nanopore Technologies. E.D.J. is on the Cell Press Advisory Board.

Figures

**Figure 1.. Assembly of a complete zebra finch genome.**
(A) Final assembly strategy used for the bTaeGut7 reference diploid genome. (B) Diploid Circos plot, illustrating the gaps in the previous bTaeGut4 reference (track A), repeat distribution with telomere and centromere positions (track B), minisatellite repeats distribution (track C), GC content (track D), methylation levels (track E), genes distribution (track F, coding genes and long non-coding RNA coloured differently), and HiFi and ONT coverage distribution (track G). Synteny between the paternal (pat) and maternal (mat) haplotypes is shown as links in the center of the plot.

**Figure 2.. Improvements over bTaeGut1.4 and other previous references.**
(A) Telomere and gap completeness across zebra finch references (Supplementary Table 6). Displayed are improvements in total telomeres found and chromosomes by telomere and gap completeness. (B) Progression of zebra finch reference genome completeness over time relative to the T2T assembly. (C) Distribution of genomic features within previously unassembled regions across chromosomes, including of each haplotype, annotated as protein-coding genes, lncRNAs, intergenic regions, overlapping features (presenting 2 or more overlapping features), and pseudogenes. (D) Total number of bases inside previously unassembled regions and the same annotated features. (E,F) Stacked bar plot showing the composition of repeat classes annotated within previously unassembled regions, and enrichment plot indicating the relative abundance of repeat on each chromosome. (G,H) Distribution of abundance and divergence of satDNA families in the maternal and paternal haplotypes, respectively. Divergence is calculated as the Kimura two-parameter distance to a consensus sequence. (I) Satellite monomer length in relation to its proportion in the genome.

**Figure 3.. Characterization of centromeric repeats.**
(A) Overview of centromeric satellite localizations. (B) Tgut716A and Tgut191A presence and length in all chromosomes. (C) Colocalization of putative centromeres with cytogenetic maps. The axes represent the distance from Tgut716A and Tgut191A of centromeric and distal PCR primers from a previous study. (D) Methylation colocalization. Points show the average methylation for individual repeat arrays. One “best”, i.e. with the lowest methylation average is also highlighted for each chromosome. (E) Kimura 2-parameter (K2P) distances of Tgut716A and Tgut191A with respect to their chromosome consensus. The filled data points represent repeats within the centromere (Tgut716A) and PAR (Tgut191A). (F) Side by side comparison of PAR regions in chrZ and chrW. Image shows an annotated sequence identity heatmap (top) and the corresponding location in the Hi-C map (bottom), using StainedGlass and PretextView, respectively. Tgut716A, Tgut191A, and unique sequences are indicated using different colors. (G) Sequence identity heatmaps of 600 kbp in the p-arm of the maternal chr35 containing multiple ITS tandem arrays interleaved with duplicated segments and Tgut191A repeats.

**Figure 4.. Chromosome architecture in the zebra finch.**
(A) In the TCHEST model, centromere satellite repeats (Tgut716A) are directly adjacent to the telomere, and are followed by heterochromatin. Heterochromatin and euchromatin can either alternate in the chromosome body (relaxed TCHEST) or exist in only two organized sections (dot chromosomes). A subtelomeric secondary satellite array (Tgut191A) can optionally be present flanking euchromatin. Macrochromosomes often show the two satellites arrays juxtaposed or interleaved, which may be suggestive of different patterns of chromosome fusion followed by rearrangements. (B) Enrichment patterns in 10kb non-overlapping windows of A and B compartments on dot chromosomes for genes, GC content, methylation, repeats, satellites and non-B DNA motifs. ‘***’ denotes p<0.001 adjusted with FDR. (C) Correlation between minisatellite repeats and HiFi coverage in dot chromosomes’ A and B compartments. Each point represents a 10 kb window coloured by its A/B assignment. Density contours are shown for each compartment to highlight data distribution. The black line represents a regression fit across all data points. (D) Non-B DNA motif density in relation to chromosome size. (E) Non-B DNA motif enrichment in and between genes in A and B compartments of dot chromosomes compared to the genome-wide average. The dashed red line on y=1 symbolises no enrichment in relation to the average motif densities in the genome. “ALL” is all non-B motifs considered together. (F) Scatter plot of chromosome size (Mbp) and telomere length (kbp). The box plots compare telomere lengths between chromosome types based on their TCHEST model categories. Hollow datapoints represent outliers.

**Figure 5.. Characterization of structural variants.**
(A) Heterozygosity, measured as the total number of SNPs per 1kbp, across all chromosomes with centromeres labelled as red triangles. (B) Maternal and paternal haplotype size difference for each chromosome colored by the chromosome type. (C) Sequence length measured as percentage of the size of the maternal chromosome being identified as syntenic regions (SYN), not aligned between two haplotypes (NOTAL), inversions (INV), inverted translocations (INVTR), translocations (TRANS) and duplications (DUP). Each chromosome is colored by the chromosome type. (D) Chr5 for the maternal and paternal haplotypes, as displayed in PretextView and SVbyEye. (E) Schematic of translocations and inversions between the maternal and paternal haplotypes. Left side shows 1 inversion and 3 translocations and the right side, 2 inversions and 2 translocations. (F) Comparison of chrZ between bTaeGut1.4 and bTaeGut7, displayed using SVbyEye and NCBI’s Comparative Genome Browser to highlight the genes present in the inversions. Red stars indicate the location of the tangles present on chrZ.

**Figure 6.. Zebra finch genome annotation.**
(A) Gene count versus chromosome size for protein-coding genes, pseudogenes, lncRNAs, and retrocopies. Scatter plots show the relationship between chromosome size (log-transformed) and the number of annotated genes for each category. Chromosomes are colored, considering their classification as micro/macro/dot chromosomes. Each point represents a chromosome, with labels indicating chromosome IDs. Dashed lines represent linear regression trends with 95% confidence intervals. (B) Distribution of gene lengths, mean intron lengths, and mean exon lengths across all chromosomes in the diploid assembly. Y-axis values are plotted on a log10 scale and measured in base pairs (bp). (C) The PCA plot visualizes the clustering of chromosomes based on their codon usage patterns. (D) The heatmap displays the codon usage distribution across the chromosomes, with color intensity indicating the frequency of the specific codon. Lighter shades (green/yellow) correspond to higher usage, while darker shades reflect lower usage. (E) The percentage of segmental duplications (SDs) for each chromosome, with chromosomes categorized based on their classification. (F) Barplots show the significantly enriched Gene Ontology (GO) terms in the duplicated regions of the genome for each category: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). The x-axis represents the statistical significance of enrichment (−log(FDR p-value)), while the bar color reflects the magnitude of observed-minus-expected gene counts (Obs–Exp), with warmer colors indicating greater enrichment.

See this image and copyright information in PMC

References

1. Kapusta A., and Suh A. (2017). Evolution of bird genomes-a transposon’s-eye view. Ann. N. Y. Acad. Sci. 1389, 164–185. - PubMed
1. Jarvis E.D., Güntürkün O., Bruce L., Csillag A., Karten H., Kuenzel W., Medina L., Paxinos G., Perkel D.J., Shimizu T., et al. (2005). Avian brains and a new understanding of vertebrate brain evolution. Nat. Rev. Neurosci. 6, 151–159. - PMC - PubMed
1. Rhie A., McCarthy S.A., Fedrigo O., Damas J., Formenti G., Koren S., Uliano-Silva M., Chow W., Fungtammasan A., Kim J., et al. (2021). Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746. - PMC - PubMed
1. Kim J., Lee C., Ko B.J., Yoo D.A., Won S., Phillippy A.M., Fedrigo O., Zhang G., Howe K., Wood J., et al. (2022). False gene and chromosome losses in genome assemblies caused by GC content variation and repeats. Genome Biol. 23, 204. - PMC - PubMed
1. Srikulnath K., Ahmad S.F., Singchat W., and Panthum T. (2021). Why do some vertebrates have microchromosomes? Cells 10, 2182. - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
- Cold Spring Harbor Laboratory
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

The complete genome of a songbird

Affiliations

The complete genome of a songbird

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Miscellaneous