Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Sep;57(9):2289-2301.
doi: 10.1038/s41588-025-02293-0. Epub 2025 Aug 19.

A comparison of 27 Arabidopsis thaliana genomes and the path toward an unbiased characterization of genetic polymorphism

Affiliations
Comparative Study

A comparison of 27 Arabidopsis thaliana genomes and the path toward an unbiased characterization of genetic polymorphism

Anna A Igolkina et al. Nat Genet. 2025 Sep.

Abstract

Making sense of whole-genome polymorphism data is challenging, but it is essential for overcoming the biases in SNP data. Here we analyze 27 genomes of Arabidopsis thaliana to illustrate these issues. Genome size variation is mostly due to tandem repeat regions that are difficult to assemble. However, while the rest of the genome varies little in length, it is full of structural variants, mostly due to transposon insertions. Because of this, the pangenome coordinate system grows rapidly with sample size and ultimately becomes 70% larger than the size of any single genome, even for n = 27. Finally, we show how short-read data are biased by read mapping. SNP calling is biased by the choice of reference genome, and both transcriptome and methylome profiling results are affected by mapping reads to a reference genome rather than to the genome of the assayed individual.

PubMed Disclaimer

Conflict of interest statement

Competing interests: D.W. holds equity in Computomics, which advises plant breeders. D.W. also consults for KWS SE, a plant breeder and seed producer with activities throughout the world. J.F. is an employee of Tropic TI, Lda. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Genome assemblies and size variation across 27 A. thaliana accessions.
a, Origin of the sequenced accessions with colors indicating ‘Admixture’ group. The reference accession Col-0 (6909) lacks reliable collection data (but hails from Central Europe based on genotype as well as historical records). Please note that 22005 and 22006 appear as a single point as they are geographically close and belong to the same admixture group. Maps were generated using public domain data from the Natural Earth project via the R package rnaturalearth. b, Histogram of genome sizes estimated from k-mers in PCR-free short reads with total assembly sizes superimposed (values to the left of ‘zero’). Most of the variation in genome size can be attributed to unassembled regions (values to the right of ‘zero’). c, The amount of centromeric, 5S rDNA and 45S rDNA repeats, estimated from PCR-free short reads with a BLAST-based approach. d, Correlation between genome size and each of the three major satellite repeats. Their estimated amounts jointly explain up to 92% (P < 2.2 × 10−16) of total genome size variation (for details, see Supplementary Note 3—‘Estimation of satellite repeats’). The linear regression lines with shaded 95% confidence intervals exclude accession 6981 (Ws-2; indicated with a triangle) because of its very high centromeric repeat content.
Fig. 2
Fig. 2. Distribution of variation on chromosome 1.
a, WGA illustrates that the genomes are structurally conserved away from centromeric regions. Chromosomes are aligned from both ends to emphasize the contiguity of the arms, with the unassembled centromeric regions indicated in yellow and inversions in pink. Oscillating gray shades highlight homologous regions (the periodicity of the gradients reflects repeated use of the color scheme and has no biological meaning). b, The density of pangenome graph bubbles, reflecting higher diversity in pericentromeric regions and at ~20 Mb, where an ancestral centromere was lost through chromosome fusion. Synteny level refers to the average sharing of links between nodes across the 27 genomes in 300-kb blocks. c, Distribution of nucleotide diversity based on SNPs called from a multiple alignment. Blue dots represent values of diversity in 200-kb windows. The red line is a smoothed fit to these points, with the standard error shown. The dark blue area corresponds to the centromeric region, with the lighter blue area highlighting the pericentromeric area and the lightest blue the ancestral lost centromere. For chromosomes 2–5, see Extended Data Fig. 1.
Fig. 3
Fig. 3. The allele frequency distribution of SVs.
a, The frequency distribution of the presence alleles of all sSVs grouped by variant length (in bp). The height of the gray block (left) is equivalent to the height of the complete panel (right). b, The proportion of length classes of sSVs in each frequency bin. Colors correspond to those in a.
Fig. 4
Fig. 4. The role of TEs in bi-allelic indels.
a, Categories of overlap between the presence allele of sSVs and annotated TE sequences. b, Number of sSVs in each TE-content category described in a. Colors correspond to a—gray, no TE content. c,d, Histograms of presence frequency showing annotated TE content as a function of frequency, either raw counts (c) or relative to TE-related SVs (d). e, Distribution of the length of presence alleles in different categories. For an analysis broken down by TE superfamily (Supplementary Fig. 12).
Fig. 5
Fig. 5. The graph structure of sSVs.
a, The number of sSVs included (or not) in the graph of nestedness as a function of TE content. b, Graph illustrating the nestedness of sSVs, including only presence alleles with nonzero TE content. Each node represents a set of mutually nested sequences, using a similarity threshold of 0.85, with node size reflecting the number of included sequences. Nodes are colored by TE content as shown in a. c, Same graph as in b, but colored by TE superfamilies, as shown in the center on top. Note that the algorithm used no prior knowledge of TE superfamilies. A large, dominant component connecting all TE superfamilies is also seen in a graph built using the entire A. thaliana TE annotation, suggesting that our sSVs and the reference annotation have comparable properties (Extended Data Fig. 4). d, A graph of nestedness for sSVs without TE content, colored by open reading frame content based on protein BLAST, as shown on the right.
Fig. 6
Fig. 6. Analysis of the gene-ome.
a, The 34,153 annotated genes categorized based on whether they correspond to the TAIR10 annotation and whether they are TEs or PC (based on sequence similarity and TE-like protein domains; Extended Data Fig. 5a). b, Genes categorized based on locus presence frequency across the 27 genomes. For TE presence frequency distribution and a comparison with published results, please see Extended Data Fig. 5b,c. c, Synteny comparison with A. lyrata. For TEs, see Extended Data Fig. 5d. d, Ancestral status of genes of TAIR10 and newly annotated genes (‘position + sequence’ means that both gene sequence and syntenic position are conserved). e, Ancestral status of genes by presence frequency. f, Distribution of genes and TEs along chromosome 5, grouped by frequency (top) and ancestral status (bottom). For all chromosomes, see Extended Data Fig. 5h,i. g, Functional domains of annotated genes, based on comparisons with UniProtKB. h, Functional domains of PC genes grouped by ancestral status. In g and h, genes not matching a functional category have been excluded; for plots including these and breakdowns by presence frequency, see Extended Data Fig. 6e–g. i,j, Expression (in 9-leaf rosettes) and H3K9me2/H3K27me3 levels (in mature leaves) by gene frequency (i) and ancestral status (j). Data shown are for accession 6024; for other accessions, as well as DNA methylation and 24-nt sRNA coverage, see Extended Data Fig. 6. Box plots show the median (center line), the 25th and 75th percentiles (box bounds) and the smallest and largest values within 1.5× the interquartile range (whiskers); outliers beyond this range are not plotted. Expression and H3K9me2/H3K27me3 were compared between frequency/ancestry gene categories using a two-sided Wilcoxon rank-sum test. The number of genes in each category is shown above the top box plot. No multiple-comparison corrections were performed.
Fig. 7
Fig. 7. The growth of the pangenome.
a, The dependence on sample size for the union (‘pan’) and intersection (‘core’) sequence length, separately for the full genome, the mobile-ome and the gene-ome (for saturation by sSV length and normalized views, see Extended Data Fig. 7). b, Pangenome versus reference genome coordinates. The pericentromeric region (light blue) shows a higher ‘dilution’ of the spatial coordinates due to the increased number of SVs in this region, but please note that the slope is never 1, reflecting their ubiquity. The centromeric region is dark blue (for other chromosomes, see Extended Data Fig. 7). c, The contribution of sSVs of different lengths to differences between pairs of genomes. Numbers are counts of sSVs. Shorter sSVs are more numerous, but longer and rarer sSVs contribute more to total length variation.
Fig. 8
Fig. 8. Read mapping and reference bias.
a, To investigate SNP-calling errors, SNPs were identified from pairwise WGA as well as by designating one genome as a reference and calling SNPs using short reads from the other (next generation sequencing (NGS); see Supplementary Note 7 for details). Numbers are averages across all accession pairs (±s.d.). From the point of view of WGA, each site can either be aligned or not. Conversely, from the point of view of NGS, each site in the reference genome can be covered by sample reads or not. Arrows pointing right refer to WGA SNPs, which we assume to be correct; arrows pointing left refer to NGS SNPs. WGA SNPs are TPs if also called by NGS, FNs if missed by NGS and uncalled if in regions not covered by NGS reads. Conversely, NGS SNPs are TPs if also found in the WGA and FPs if not. FPs come in two flavors: those that correspond to bona fide non-SNPs in the WGA and those that correspond to regions that were not aligned. Heterozygous calls in completely inbred lines are obviously FPs and are treated separately. b, SNP-calling error rates depend on the relationship between the reference and the sampled genome. Each line is the regression of SNP-calling errors for a different choice of reference genome (identified by color). FNR, false negative rate. c, Correlation between expression- and methylation-level estimates derived from mapping reads to the TAIR10 genome and the genome of the sampled individual. Sample sizes for each data type also apply to d and e. d, Percentage of genes for which expression or methylation levels differ by more than 30% or 50%, respectively, when mapping reads to the TAIR10 reference rather than their ‘own’ genome. e, Enrichment of copy-number variable genes among those with discordant expression- or methylation-level estimates. Box plots show the median plus the 25th and 75th percentiles, with whiskers covering the smallest and largest values within 1.5× the interquartile range. The percentage of copy-number variable (CNV) genes was compared between wrongly and correctly estimated genes across available samples using a two-sided Wilcoxon test. TPs, true positives; FPs, false positives; FNs, false negatives.
Extended Data Fig. 1
Extended Data Fig. 1. The pattern of polymorphism on chromosomes 2–5.
ad, Whole-genome alignments. eh, Density of PGGB graph bubbles. il, Nucleotide diversity. For chromosome 1 and further explanation, see Fig. 2a–c.
Extended Data Fig. 2
Extended Data Fig. 2. SV type and gene-relative location.
a, Cartoons illustrating our classification of length variants into sSVs and cSVs in the whole-genome alignment and graph representations. b, The distribution of sSVs and cSVs across chromosomes—cSVs are more common in pericentromeric regions. c, Fraction of sSVs and cSVs by length. The prevalence of short sSVs is at least partly explained by the low probability of the multiple events required for cSVs to occur in short DNA segments. d, The number of putative insertions (presence allele in 1–3 accessions) and deletions (absence allele in 1–3 accessions) of varying lengths in genes and intergenic regions. e, Same for exons and introns.
Extended Data Fig. 3
Extended Data Fig. 3. Size distribution of graph nodes by TE-content.
Large nodes are found not only for presence alleles that correspond to complete annotated TE but also for other categories, demonstrating that these are also part of the mobile-ome.
Extended Data Fig. 4
Extended Data Fig. 4. The graph of the nestedness of existing A. thaliana TE annotations.
Each node is a cluster of similar sequences (with length and identity thresholds of 0.85), where the size of the node indicates its relative abundance. The graph can be decomposed into one dominant connected component and several smaller ones, as shown on the right.
Extended Data Fig. 5
Extended Data Fig. 5. Details of gene-ome analysis.
a, TE genes were defined as the union of two sets of genes: (1) those with >50% annotated TE sequence in at least one accession, and (2) those with ORFs similar to proteins known to be involved in TE function. b, Comparison of gene presence variation in this study to a recent analysis in 69 A. thaliana accessions 23. For both PC genes and TE genes, our presence–absence variability is higher for gene models compared to gene loci, probably because our gene annotation pipeline misses lowly expressed genes in the four tissues we used. The PC gene model variability in the study from the Mercier lab 23 is substantially higher than in our data. This likely reflects the fact that while we used pan-genome coordinates to match genes between accessions, they used an orthogroup search approach. Consider this case: a gene is both present and has a gene model in every accession, but in three accessions, a shorter gene model was annotated. The orthogroup approach would identify this case as two gene families residing in the locus and call both gene models dispensable, while our approach would treat this locus as one and identify both gene and gene model as core. c, Gene presence frequency distribution for TE genes. d, A. lyrata synteny and sequence similarity analysis for TE genes (TAIR10 and new) with the same 80% sequence-similarity stringency as in Fig. 6. TE genes are plotted by their gene IDs along the chromosomes. The small number of data points compared to PC genes is because most TE genes were not found in the A. lyrata genome. e, Functional domains (‘NA’ indicates genes with no match) for genes grouped by new vs. old and PC vs. TE. f, The same for PC genes grouped by frequency of presence. g, The same PC and TE genes grouped by ancestral status. h, The distribution of PC and TE genes along all five chromosomes grouped by frequency of presence. i, The same grouped by ancestral status.
Extended Data Fig. 6
Extended Data Fig. 6. Gene silencing and expression by gene presence frequency and ancestral status.
a, Median expression vs. frequency for all accessions. b, H3K9me2 and H3K27me3 54 genes vs. their frequency. Values plotted are medians of quantile- and input-normalized ChIP-seq signals. c, 24-nt sRNAs coverage of genes of different frequencies. Left: distribution in flowers of accession 6024. Right: median coverage. d, Locus-wide CHG, CG and CHH methylation levels in leaves 53 for genes of different frequency. Left: distribution in accession 6024. Right: median levels. e, Expression (median TPM) vs. ancestral status for all accessions. f, H3K9me2 and H3K27me3 54 vs. ancestral status. Values plotted are medians of quantile- and input-normalized ChIP-seq signals. g, 24-nt sRNAs locus-wide coverage in accession 6024, flowers, for genes of different ancestral status. h, Methylation levels in leaves 53 for genes of different ancestral status. Left: accession 6024. Right: median methylation levels. All heatmaps were created using the pheatmap package in R, and all, except for expression, are scaled by row to facilitate within- and between-accession comparison. Box plots show the median and the 25th and 75th percentiles (box bounds), and the smallest and largest values within 1.5× the interquartile range (whiskers); outliers are not plotted. Significance of median differences was assessed using two-tailed Wilcoxon tests. Sample sizes are shown above the top boxplot in Fig. 6i–j.
Extended Data Fig. 7
Extended Data Fig. 7. Extended pan-genome analysis.
a, Saturation curves for sSVs for different lengths. Each curve is drawn through points that are the averaged values from 20 repetitions. The curves were normalized to start and end at the same points. Larger sSVs saturate more quickly. b, Normalized saturation curves for different genome components: the dependence on sample size for the union (‘pan’) and intersection (‘core’) sequence length, separately for the full genome, the mobile-ome and the gene-ome. Within the ‘pan’ curves, the mobile-ome saturates the fastest, the gene-ome the slowest and the entire genome at an intermediate rate. Gene-ome and whole genome ‘core’ curves show similar trends. cf, Pan-genome vs. reference genome coordinates for chromosomes 2–5. The pericentromeric region (light blue) shows a higher ‘dilution’ of the spatial coordinates due to the increased number of SVs in this region. The centromeric region is dark blue.

References

    1. Gallone, B. et al. Domestication and divergence of Saccharomyces cerevisiae beer yeasts. Cell166, 1397–1410 (2016). - PMC - PubMed
    1. Istace, B. et al. De novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer. GigaScience6, 1–13 (2017). - PMC - PubMed
    1. Peter, J. et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature556, 339–344 (2018). - PMC - PubMed
    1. Walkowiak, S. et al. Multiple wheat genomes reveal global variation in modern breeding. Nature588, 277–283 (2020). - PMC - PubMed
    1. Alonge, M. et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell182, 145–161 (2020). - PMC - PubMed

Publication types

Substances

LinkOut - more resources