Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul;6(7):1007-1023.
doi: 10.1038/s41559-022-01771-6. Epub 2022 Jun 9.

A phylogenetic and proteomic reconstruction of eukaryotic chromatin evolution

Affiliations

A phylogenetic and proteomic reconstruction of eukaryotic chromatin evolution

Xavier Grau-Bové et al. Nat Ecol Evol. 2022 Jul.

Abstract

Histones and associated chromatin proteins have essential functions in eukaryotic genome organization and regulation. Despite this fundamental role in eukaryotic cell biology, we lack a phylogenetically comprehensive understanding of chromatin evolution. Here, we combine comparative proteomics and genomics analysis of chromatin in eukaryotes and archaea. Proteomics uncovers the existence of histone post-translational modifications in archaea. However, archaeal histone modifications are scarce, in contrast with the highly conserved and abundant marks we identify across eukaryotes. Phylogenetic analysis reveals that chromatin-associated catalytic functions (for example, methyltransferases) have pre-eukaryotic origins, whereas histone mark readers and chaperones are eukaryotic innovations. We show that further chromatin evolution is characterized by expansion of readers, including capture by transposable elements and viruses. Overall, our study infers detailed evolutionary history of eukaryotic chromatin: from its archaeal roots, through the emergence of nucleosome-based regulation in the eukaryotic ancestor, to the diversification of chromatin regulators and their hijacking by genomic parasites.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests

The authors declare no competing interests.

Figures

Extended Data Fig. 1
Extended Data Fig. 1. Histone classification and evolution.
a, Primary and secondary alignments of histone-fold containing proteins classified as canonical H2A, H2B, H3 and H4, based on identity to reference sequences in HistoneDB. Pie plots represent the number of alignments to HistoneDB-annotated sequences, for the entire dataset (prokaryotic, eukaryotic and viral sequences, large pie plots in the inset) and the eukaryotic subset (smaller plots in the inset). For those proteins that align to more than one canonical histone or major variant (macroH2A, H2A.Z or cenH3), the scatter plots represent the relative identity between the primary (horizontal axis) and secondary alignment(s) (vertical axis). b, Aggregated counts of histone gene pairs, classified according to histone type and orientation. c, Presence of histone variants (left) and number of collinear pairs of histone-encoding genes (right) per species, classified according to their histone types and relative orientation (head-to-head, hh; head-to-tail, ht; and tail-to-tail, tt). Source data available in Supplementary Data 2. Histone variant classification is based on the highest-scoring HMM profile from HistoneDB. Asterisks colors in the macroH2A column indicate species where histone-less Macro domains orthologous to the macroH2A genes are found (see panel d). Lighter colors in the variant classification indicate ambiguously classified histones (i.e. cases in which the highest-scoring HMM profile exhibited a low bitscore, defined as a probability below 0.05 in the profile-wise distribution function of scaled bitscores; or cases in which the first-to-second ratio between high scoring profiles was below 1.01). d, Alignments of putatively conserved histone N-tails in archaea. Conserved amino-acids are color-coded according to chemical properties. Dots next to species names are color-coded according to taxonomy (same as Fig. 2c). e, Phylogenetic analysis of the Macro motif of macroH2A histones across eukaryotes, highlighting the macroH2A ortholog group (green), and, within this group, Macro-containing genes lacking histone domains (orange), and their protein domain architectures.
Extended Data Fig. 2
Extended Data Fig. 2. Histone post-translational modifications.
a, Proteomics detection coverage (% of amino acids), number of hPTMs and number of hPTMs per covered position, for the best-covered histone in each species in our proteomics survey. b, Number of samples in which each histone-matching peptide with post-translational modifications (peptide spectral matches defined by Proteome Discoverer) has been identified, per species. For each species, we report the percentage of modified peptides found in more than one replicate. c, Number of samples in which histone-matching modified peptide has been identified, across all the samples from this study. The tree pie charts represent these distributions for all hPTMs, acetylations, and methylations. d, Evidence of hPTM conservation in the major histone variants H2A.Z and macroH2A (conserved positions only), as well as any position in the linker histones H1.
Extended Data Fig. 3
Extended Data Fig. 3. Gene family counts.
a-c, Number of taxa within each lineage that contain chromatin-associated genes, for archaeal, bacterial (per phyla) or viral (per family) genomes. Numbers indicate the exact number of taxa. d, Number of genes encoding core domains that define chromatin-associated gene families per eukaryotic genome/transcriptome. Numbers indicate exact number of proteins.
Extended Data Fig. 4
Extended Data Fig. 4. Evolutionary reconstruction and domain architecture conservation.
a, Species tree of eukaryotes used in the ancestral reconstruction analysis, with branch lengths calibrated to the gain/loss rates of Pfam domains (see Methods). Available in Supplementary Data 1. b, Conservation of archetypical protein domain architectures across orthogroups, in acetylases, deacetylases, methyltransferases, demethylases, remodellers and chaperones. In each heatmap, we indicate the fraction of genes within an orthogroup (rows) that contain a specific protein domain (columns). Domains in bold are catalytic (black) or reader (purple) functions. At the right of each heat-map, we summarize the presence/absence profile of each orthogroup across eukaryotic lineages (as listed in Fig. 1a).
Extended Data Fig. 5
Extended Data Fig. 5. Evolution of the hPTM reader toolkit.
a, Pie plot representing the number of genes classified as part of the catalytic (acetylases, deacetylases, methyltransferases, demethylases, remodellers or chaperones) or reader families, or as both. The barplot at the right shows the most common reader domains in genes classified with both reader and catalytic functions. b, Pie plot representing the number of reader domain-encoding genes classified according to whether they contain one type of reader domain (e.g., PHD) or more than one (e.g., PHD + PWWP). The barplot at the right shows the most common combinations of reader domains among genes with multiple reader domains. c, Summary of gene family gains per reader family, with example cases highlighted in selected nodes. Node size is proportional to number of gains at 90% probability.
Extended Data Fig. 6
Extended Data Fig. 6. Transposon-chromatin gene fusions.
a, Number of candidate fusion genes classified by the level of gene model validation evidence, based on contiguity of the gene model over the genome assembly (i.e. lack of poly-N stretches in the genomic region between the TE- and chromatin-associated domains), evidence of expression, and evidence of contiguous expression (see inset at the right). b, Summary of candidate gene fusions within each chromatin-associated gene family, divided by gene family. For each gene, we indicate their similarity to known TE families, presence of TE-associated domains, the evidence of gene model validity, and information on their gene structure (whether they are monoexonic or are located in clusters with other fusion genes). Source data available in Supplementary Data 6. c, Number of species with at least one valid fusion, divided by gene family. d, Mapping positions of RNA-seq reads supporting candidate gene-transposon fusions (selected examples from Fig. 5e). For each fusion, we show reads spanning the region along the spliced transcript that fully covers the transposon-associated domains (highlighted in green), the chromatin-associated domains, and the inter-domain region. Uninterrupted stretches of mapped positions between domains indicate the validity of a domain co-occurrence. For clarity purposes, reads mapping entirely within a single domain have been excluded from this visualization.
Extended Data Fig. 7
Extended Data Fig. 7. Chromatin proteins in viruses.
a-c, Selected gene trees highlighting examples of eukaryotic- and prokaryotic-like viral homologs. d, Number of viral genes of each chromatin-associated gene family, classified according to their closest neighbours from cellular clades in gene tree analyses based on phylogenetic affinity scores (see Methods). Within each gene family, viral sequences are classified according to their PFAM domain architecture – the most common architecture being single-domain in most gene families except for remodellers and BIR readers. e, Id., but classifying viral genes according to their phylogenetic affinity to eukaryotic orthology groups. Source data available in Supplementary Data 6.
Figure 1
Figure 1. Diversity of post-translational modifications in eukaryotic canonical and variant histones.
a, Eukaryotic taxon sampling used in this study. Colored dots indicate the number of species used in the comparative histone proteomics reconstruction, with solid dots indicating new species added in this analysis. Numbers in brackets indicate the number of genomes/transcriptomes used in the comparative genomics analyses. Dashed lines indicate uncertain phylogenetic relationships. Complete list of sampled species in Supplementary Data 1. Silhouettes adapted from http://phylopic.org/. b, Networks of pairwise protein similarity between histone protein domains in eukaryotes, archaea and viruses. Each node represents one histone domain, colored according to their best alignment in the HistoneDB database (see Methods). Edges represent local alignments (bitscore ≥ 20). c, Schematic representation of the hPTM proteomics strategy employed in this study. d, Conservation of hPTMs in eukaryotic histones. hPTM coordinates are reported according to the amino-acid position in human orthologs (if conserved). In H2A and H2B, question marks indicate the presence of hPTMs in stretches of lysine residues of uncertain homology. In species with previously reported hPTMs, we further indicate which variants were also identified in our reanalysis. Only positions with hPTMs conserved in more than one species are reported (full table and consensus alignments available in Supplementary Data 3). e, Maximum likelihood phylogenetic trees of the connected components in panel b, corresponding to eukaryotic histones (H3, H4, H2A, H2B). Canonical histones included in panel d and variant histones detected are highlighted in red. hPTMs detected in non-canonical histones are indicated. Bottom, distributions of pairwise phylogenetic distances between all proteins in each gene tree. Violin plots above each distribution represent the distribution of distances between reference histones present in the HistoneDB database and histones with proteomic evidence included in our study, for each of the main canonical (H3, H4, H2A, and H2B) and variant histones (H2A.Z and macroH2A). Dots in the violin plot distributions represent the median.
Figure 2
Figure 2. Archaeal histone diversity and post-translational modifications.
a, Distribution of histones (fraction of taxa in each lineage) and histone tails (presence/absence) across Archaea phyla. b, Summary of proteomics evidence of archaeal histones, including the presence of modifications, tails, coverage, fraction of lysines identified, and isoelectric points. Human Histone H3 and H4 are included for reference. The alignments at the bottom depict the position of lysine modifications in the globular part of Methanospirillum stamsii and Methanobrevibacter cuticularis HMfB histones (modified residues in bold). c, Archaeal HMfB histones with N-terminal tails (at least 10 aa before a complete globular domain), sorted by frequency of lysine residues in the tail and color-coded according to taxonomy (same as panel A). Amino-acid sequences shown for selected examples. The dotted line indicates the median frequency of lysines in canonical eukaryotic H3 and H4 histone tails. Source data available in Supplementary Data 2. d, Mass spectra of three modified archaeal peptides, representing the relative abundance of fragments at various mass-to-charge ratios (m/z). Spectra were annotated using IPSA. b and y ions and their losses of H2O are marked in green and purple, respectively; precursor ions are marked in dark grey. Unassigned peaks are marked in light grey. Some labels have been omitted to facilitate readability.
Figure 3
Figure 3. Taxonomic distribution of chromatin-associated gene classes.
a, Summary of the seven classes of genes with chromatin-related activity covered in our survey: histone-specific hPTM writers (acetylases and methyltransferases), erasers (deacetylases and demethylases), readers, remodellers, and chaperones. b, Percentage of surveyed taxa containing homologs from each chromatin-associated gene class, for eukaryotes (top), archaea, bacteria, and viruses (bottom). Species-level tables are available in Extended Data Fig. 3. c, Number of eukaryotic genes classified in each of the chromatin-associated modification enzymes, readers, remodellers, and chaperones. d, Overlap between the taxon-level phylogenetic distribution of histones and chromatin-associated domains in archaea and four bacterial phyla, measured using the Jaccard index. e, Number of genes encoding writer, eraser, reader and remodeller domains, per species.
Figure 4
Figure 4. Origin and evolution of chromatin-associated gene families.
a, Summary of phylogenetic affinities of the eukaryotic homologs of gene classes that are also present in prokaryotes. For each gene family, we evaluate whether it is phylogenetically closer to a majority (≥50%) of eukaryotic sequences from a different orthogroup (indicating intra-eukaryotic diversification), or to sequences from Bacteria or Archaea. b, Left, gene tree of eukaryotic and prokaryotic Sirtuin deacetylases, showcasing an example of a eukaryotic family that diversified within eukaryotes (SIRT6) and another one with close relatives in Asgard archaea (SIRT7). Right, gene tree of KAT14 acetylase, a eukaryotic orthogroup with bacterial origins. Statistical supports (UF bootstrap) are shown at selected internal nodes of the highlighted clades. c, Evolutionary reconstruction of hPTM writer and eraser gene families, remodellers, and histone chaperones along the eukaryotic phylogeny, including the number of genes present in the last eukaryotic common ancestor (LECA). Barplots indicate the number of orthologs of each gene family present at the LECA (at 90% posterior probability; see Methods) and whether the presence of a given orthogroup at LECA is supported by its conservation in various early-branching eukaryotic lineages (Amorphea, Discoba, Diaphoretickes and others). The list of ancestral gene families below each plot is non-exhaustive. Two ancestral gene counts are provided: all families at presence probability above 90%, and, in brackets, the subset of these that is present in at least two of the main eukaryotic early-branching lineages (Amorphea, Diaphoretickes, and Discoba). Source data in Supplementary Data 5. d-e, Reconstructed evolutionary origins of the different subunits of the Polycomb repressive complexes (PRC2 and PRC1) and Trithorax-group complexes (KMT1 to 5). f-h, Side-by-side comparison of the presence of individual hPTM marks and various subunits of the Polycomb and Trithorax complexes, as well as other hPTM writers, responsible for their deposition.
Figure 5
Figure 5. Evolution of chromatin readers and capture of chromatin proteins by transposable elements and viruses.
a, Evolutionary reconstruction of reader gene families along the eukaryotic phylogeny, highlighting the number of gains along the eukaryotic phylogeny (at 90% posterior probability). The Euler diagram at the top shows the overlap between presence of chromatin-associated catalytic domains and readers. The barplot at the left indicates the number of orthologs of each gene family present at the LECA and whether their presence is supported by its conservation in various early-branching eukaryotic lineages (Amorphea, Discoba, Diaphoretickes, and others). Pie plots at the right summarize the number of orthogroups from each gene family gained within selected lineages: Metazoa, Holomycota, Viridiplantae and SAR+Haptophyta. b, Number of reader or catalytic orthogroups gained at each node in the species tree, for selected species. Source data in Supplementary Data 5. c, Networks of protein domain co-occurrence for Chromo and PHD readers. Each node represents a protein domain that co-occurs with Chromo or PHD domains, and node size denotes the number of co-occurrences with either Chromo or PHD. Edges represent co-occurrences between domains. Groups of frequently co-occurring protein domains have been manually annotated and color-coded, which has revealed sub-sets of retrotransposon and DNA transposon-associated domains. d, Number of chromatin-related eukaryotic genes fused with transposons grouped by gene family (left), including the fraction that are classified as valid gene models based on expression and assembly data (centre); and the number of species where each type of fusion is found (right). The number of fusion events are colored according to their similarity with known DNA transposons (red) or retrotransposons (orange) from the Dfam database (see Methods). (*) The ‘Chromo’ category excludes genes containing other chromatin-associated protein domains such as SNF2_N (listed separately as ‘Chromo+SNF2_N’, which includes remodellers with the domain of unknown function DUF1087, which is also common in DNA transposons). e, Selected examples of transposon fusion domains classified by orthogroup, including their archetypical protein domain architecture, homology to transposon class, their phylogenetic distribution, and number of fusion genes. Only orthogroups with at least one valid gene model are listed. Source data available in Supplementary Data 6. f, Example tree of Chromo readers, highlighting genes with fused TE-associated domains and their consensus domain architectures. g, Fraction of viral genomes containing homologs from each chromatin gene family, for nucleocytoplasmic giant DNA virus families (top) and other taxa containing histone domains (Nudiviridae, Polydnaviridae; bottom). h, Phylogenetic analysis of histone domains, with a focus on viral homologs. Statistical supports (approximate Bayes posterior probabilities) are shown for the deepest node of each canonical eukaryotic or archaeal histone clade. The inset table summarizes the presence of doublet histone genes per linage. i, Number of viral homologs in each chromatin-associated gene family, classified according to their closest cellular homologs (eukaryotes, bacteria or archaea) in phylogenetic analyses (see Methods). Source data available in Supplementary Data 6.
Figure 6
Figure 6. Chromatin evolution and eukaryogenesis.
a, Summary of events in chromatin evolution prior to, during and after the origin of eukaryotes. b, Number of chromatin-related gene families and hPTM marks inferred to have been present at the LECA. Ancestral gene counts are indicated at >90% probability. For gene counts, numbers within bars indicate the subset of families present in at least two of the most deeply-sampled early-branching eukaryotic lineages (Amoropha, Diaphoretickes, and Discoba). For hPTMs, the ancestral counts have been inferred using Dollo parsimony assuming a Diaphoratickes – Amorphea split at the root of eukaryotes, and numbers within bars indicate the number of hPTMs whose ancestral presence is supported by more than one species at both sides of the root. c, hPTMs inferred to be present in the last eukaryotic common ancestor (LECA) based on Dollo parsimony. Only amino-acid positions conserved in all eukaryotes in our dataset are shown. Asterisks indicate modifications whose presence at the LECA is supported by just one species at either side of the root. The inferred LECA presence of known writing/erasing enzymes associated to these hPTM is indicated.

References

    1. Struhl K. Fundamentally different logic of gene regulation in eukaryotes and prokaryotes. Cell. 1999;98:1–4. - PubMed
    1. Kornberg RD, Lorch Y. Primary Role of the Nucleosome. Mol Cell. 2020;79:371–375. - PubMed
    1. Jenuwein T, Allis CD. Translating the Histone Code. Science. 2001;293:1074–1080. - PubMed
    1. Berger SL. The complex language of chromatin regulation during transcription. Nature. 2007;447:407–12. - PubMed
    1. Banaszynski La, Allis CD, Lewis PW. Histone variants in metazoan development. Dev Cell. 2010;19:662–74. - PMC - PubMed

Publication types