Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 9;182(1):145-161.e23.
doi: 10.1016/j.cell.2020.05.021. Epub 2020 Jun 17.

Major Impacts of Widespread Structural Variation on Gene Expression and Crop Improvement in Tomato

Affiliations

Major Impacts of Widespread Structural Variation on Gene Expression and Crop Improvement in Tomato

Michael Alonge et al. Cell. .

Abstract

Structural variants (SVs) underlie important crop improvement and domestication traits. However, resolving the extent, diversity, and quantitative impact of SVs has been challenging. We used long-read nanopore sequencing to capture 238,490 SVs in 100 diverse tomato lines. This panSV genome, along with 14 new reference assemblies, revealed large-scale intermixing of diverse genotypes, as well as thousands of SVs intersecting genes and cis-regulatory regions. Hundreds of SV-gene pairs exhibit subtle and significant expression changes, which could broadly influence quantitative trait variation. By combining quantitative genetics with genome editing, we show how multiple SVs that changed gene dosage and expression levels modified fruit flavor, size, and production. In the last example, higher order epistasis among four SVs affecting three related transcription factors allowed introduction of an important harvesting trait in modern tomato. Our findings highlight the underexplored role of SVs in genotype-to-phenotype relationships and their widespread importance and utility in crop improvement.

Keywords: breeding; cis-regulatory; copy number variation; cryptic variation; domestication; dosage; epistasis; long-read sequencing; structural variation; tomato.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests W.R.M. is a founder and shareholder of Orion Genomics, a plant genetics company. Z.B.L. is a consultant for and a member of the Scientific Strategy Board of Inari Agriculture. Orion Genomics and Inari Agriculture had no role in the planning, execution, or analysis of the experiments described here.

Figures

Figure 1.
Figure 1.. The tomato panSV-genome
(A) SNP-based phylogenetic tree based on short-read sequencing of more than 800 tomato accessions. Major taxonomic groups are marked by colored lines along the circumference. Colored dots indicate a subset of the 100 accessions selected for long-read sequencing. (B) Stacked bar graph showing SV number and type from the 100 accessions. Colored dots indicate the taxonomic group of each accession, corresponding to colors in (A). (C) Hierarchical clustering dendrogram of the SV presence/absence matrix across the 100 accessions, with colors corresponding to (A). Bold branches and names highlight an outgroup of two SLL processing tomato accessions. (D) SVCollector curves of SVs in the three major taxonomic groups. The “greedy” algorithm determines the order of accessions and depicts the cumulative number of SVs as a function of the number of accessions included. (E) Graph showing the number of SVs (y-axis) in “no more than” or “at least” the number of accessions indicated on the x-axis. (F) Histograms of detection frequencies for different SV sizes. (G) Histogram of SV sizes for insertions and deletions. (H) Annotation of the panSV-genome. The proportion of repeat types for all insertions and deletions annotations is shown in stacked bar graphs. “Count” shows the proportion of individual repeat annotations, and “bp” shows the proportion of cumulative repeat (not indel) sequence length. “Other” refers to other repeat types. Only indels at least 100 bp in size were considered. See also Figure S1.
Figure 2.
Figure 2.. SV distribution reveals large-scale admixture and introgression between wild and domesticated genotypes
(A) Heatmap (top) showing SV frequency in 1 Mbp windows (columns) of chromosome 4 relative to the reference genome. Accessions (rows) are grouped by taxonomic group (colored bars). Dotted colored lines mark three notable regions: black, a large SV hotspot for 5 SLLs; red, a small hotspot shared by most UFL SLL lines; yellow, a SP group with reduced SV frequency, reflecting a small SP introgression in the reference genome. Circos plot (bottom) depicting genome-wide SV frequency for five notable accessions. Rings depict line plots showing the SV number in successive 1Mbp windows (y-axes are not shared between rings). Chromosomes 4, 5, 7 and 11 are highlighted to show regions of high SV frequency. (B) Heatmaps showing admixture and introgressions on chromosome 4 measured by Jaccard similarity between accessions of SLL and SP (top) and SLC (bottom) in the same row-order as (A, top). For each 1 Mbp window, the SVs for a given SLL accession are compared to the SVs for all SP (top) or SLC (bottom) accessions, and the maximum Jaccard similarity is reported. Windows with fewer than 5 SVs in the SLL set are excluded and colored grey. Black and red dotted regions correlate with marked SV hotspots in (A, top). (C) Timeline of UFL fresh market variety release over the last century. Approximate periods of introgression of key disease resistance genes are shown in red, along with major donor genotypes for Fusarium wilt (I, I2, I3) and grey leaf-spot (Sm). (D) Jaccard similarity for chromosome 11 between the UFL lines (ordered chronologically) and LA1589, the closest SP to this introgression. Locations of I, Sm and I2 are shown in red. (E) The UFL varieties on chromosome 7 showing a small SP introgression in all but two accessions; Fla.7481 and Fla.7907B carry a unique SV hotspot (left) due to introgression of the I3 resistance gene (red) from S. pennellii. See also Figure S3.
Figure 3.
Figure 3.. Gene associated SVs impact expression
(A) Stacked bar chart showing total counts of SVs overlapping different genomic features in major taxonomic groups. N represents the number of accessions in each taxonomic group. (B) Percentage of SVs overlapping different genomic features in 100 accessions. Each point is one sample. Fewer SVs are found within genes compared to surrounding regulatory regions. (C) Stacked bar charts showing numbers of differentially expressed genes affected by insertion, deletion, and duplication SVs overlapping coding sequences (left) and regulatory regions (right)*. Differential expression was tested on common SVs in the 23 accessions used for RNA-sequencing (frequency between 0.2 and 0.8) (see STAR Methods). (D) ROC curves for the top three SV annotation types, with high AUROC (Area Under the Receiver Operating Characteristics) scores across the three tissues demonstrating the ability to identify genes containing SVs using changes in expression across the accession split. The AUROC is specified within the ROC curve in each case. The steep rise of the curves in the top panel correspond to a near-perfect identification of a large fraction of the genes containing SVs based on differential expression. CDS, coding sequence. (E) Differential expression significantly predicts genes with SVs. Overall performance of using “SV splits” and differential expression to predict associated gene(s) (see STAR Methods). Analyses are broken down into 9 categories across three tissues. Each category is defined based on SV type and relative position to genes. Circle sizes and colors represent the significance of performance (−log10 p-value) the magnitude of AUROC, respectively. SV categories are ranked in decreasing order of average AUC (Area Under the Curve) across the three tissues. Note that the significance of performance for each SV type is enhanced by the number of annotated SV-gene pairs (for example, p < 1×10−4 for ≈ 16 duplications, while p < 1×10−4 for ≈ 468 insertions in introns). (F) Volcano plots for four regulatory SV-gene pair examples with the highest AUROC score highlight the extent of differential expression of SV-containing genes (marked in orange circles), compared to all expressed genes (black dots). Additional examples are presented in Figure S4F. p-values and expression fold changes are computed across two groups of accessions (with and without the indicated SV). Data shown for apex tissue. Exons (orange), UTRs (yellow), and SVs (red) are not drawn to scale. Distances between genes and SVs are shown. * Significance is defined as an adjusted p-value less than 0.05. See also Figure S4.
Figure 4.
Figure 4.. New reference genomes anchor candidate genes and resolve multiple SV and coding sequence haplotypes for the “smoky” volatile GWAS locus
(A) Schematic showing a key step of the metabolic pathway underlying the “smoky” aroma trait. During fruit ripening, activation of glycosyltransferase NSGT1 prevents release of smoky-related volatiles by converting them into non-cleavable triglycosides (top). nsgt1 mutations result in the release of the smoky volatile guaiacol. (B) Genomic resources used to resolve the GWAS locus for guaiacol (top) and summary of haplotypes (bottom). The published locus mapped to a region of chromosome 9 with one candidate gene and multiple gaps, and also to an unanchored contig with a fragment of an NSGT gene (top). MAS2.0 assemblies revealed multiple haplotypes that include copy number variation for the NSGT1 and NSGT2 paralogs and loss-of-function mutations (Bottom). A local assembly revealed haplotype V (asterisk) (see STAR Methods). (C) Schematics depicting the five resolved haplotypes. The assemblies and major taxonomic groups from which the haplotypes were identified are shown below. Red “X”s mark coding sequence (CDS) mutations. Grey bars mark duplication in haplotype IV. Red rectangle marks a large deletion in haplotype V. (D) PCR confirmation of the deletion in haplotype V. Primers (F1, F2, R1) are shown in (C). (E) Quantification of NSGT1/2 expression by RNA-sequencing. Haplotypes are grouped according to functional NSGT1 (I, II, III), nsgt1 CDS mutation (IV) and nsgt1 deletion (V) (see STAR Methods). Expression data are from pericarp tissue of ripe fruit (Zhu et al., 2018). (F-G) Guaiacol content of fruits from a previous GWAS study (F) (Tieman et al., 2017) and a new GWAS analysis using a collection of 155 SP and SLC accessions (G). Mutations in NSGT1 are associated with guaiacol accumulation. Accessions are grouped as in (E). (H) Quantification of guaiacol and methylsalicylate content in an SLC x SLC F2 population segregating for the haplotype V 23 kbp deletion. In (E-H), n represents sample size in each group. All p-values are based on two-tailed, two-sample t-tests.
Figure 5.
Figure 5.. The fruit weight QTL fw3.2 resulted from a tandem duplication that increased expression of a cytochrome P450 gene
(A) Published mechanism for fw3.2 positing that a SNP in the promoter of the cytochrome P450 gene SlKLUH increased expression ~2-fold, resulting in larger fruits. (B) SV analyses revealed a 50 kb tandem duplication at the fw3.2 locus that included SlKLUH (left). PCR validation of the duplication (right). Primers (F1, F2, R1) are labeled on the left. “No duplication” refers to the accession without this duplication and “fw3.2dup” refers to the accession that carries the duplicated copy of fw3.2 as shown by the PCR product across the duplication junction (F2 + R1). (C) Expressions of genes within the fw3.2 duplication are ~2-fold higher. Gene coordinates and the duplication region (top), and RNA-seq box plots of duplicated and flanking genes (bottom) are shown. Each point is one biological replicate from one accession (see STAR Methods). n, number of accessions. (D) An SLC x SLC F2 population segregating for the fw3.2 duplication, but fixed for the promoter SNP (see STAR Methods). Increased fruit weight is associated with the duplication. (E) CRISPR-Cas9 mutagenesis of SlKLUH in the M82 background. SlKLUH gene model with gRNA targets (top), PCR genotyping (middle) and representative inflorescences (bottom) of slkluhCR T0 plants. The three slkluhCR T0 plants shown have mutations in all four copies of SlKLUH and exhibit similar tiny inflorescences, suggesting a null phenotype. Strong phenotypes were also observed for other T0 plants with sequenced indels (red font) except T0–1, which showed a weaker phenotype and was fertile, allowing a genetic test of dosage. (F) Altering tomato KLUH gene dosage shows that copy number variation explains fw3.2. Schematic showing the M82/M82CR slkluh T0–1 (SL) x LA1589 (SP) crossing scheme used to test the phenotypic effects of altering tomato KLUH functional copy number in an F1 hybrid isogenic background. Genotypic groups A and B are isogenic for M82 x LA1589 genome-wide heterozygosity and differ only in having 3 or 1 functional copies of tomato KLUH, respectively. Genotypic group C effectively has 0 functional copies due to inheritance of the single insertion Cas9 transgene that targets the single SpKLUH allele in trans. (G) Mutated slkluh alleles and the SpKLUH allele in genotypic group B. Red font, guide RNA targets. Cyan font, mutations. An LA1589 SNP (blue font) permits distinction of KLUH allele parent-of-origin. All SpKLUH sequences in genotypic group B are wild type. (H) Decreasing tomato KLUH functional copy number reduces flower organ size. Representative inflorescences (left) and quantifications of flower and sepal length (right) from all three genotypic groups. (I) Decreasing tomato KLUH functional copy number reduces fruit weight. Representative fruits (left) and fruit weight quantification (right) from genotypic groups A and B. Reducing tomato KLUH copy number from three to one reduces fruit size by 30%. Genotypic group C plants with mutated SpKLUH alleles fail to produce fruits. Scale bar is 1 cm in (E and H) and is 2 cm in (I). In (H and I), N indicates plant number; n indicates flower/fruit number. All p-values are based on two-tailed, two-sample t-tests. See also Figure S5.
Figure 6.
Figure 6.. Four SVs in three MADS-box genes were required to breed for the jointless trait
(A) Genetic suppressors were selected to overcome a negative epistatic interaction on yield caused by mutations in two MADS-box genes. The SV mutation j2TE causes a desirable jointless pedicel that facilitates harvesting. Introducing j2TE in backgrounds carrying the cryptic SV mutation ej2w results in excessive inflorescence branching and low fertility. The sb1 and sb3 QTLs were selected to suppress j2TE ej2w negative epistasis. sb3 is an 83 kb duplication harboring ej2w. sb1 is cloned in this study. (B) Quantification of sb1 partial suppression of branching in the j2TE ej2w background. The SB1 j2TE ej2W and sb1 j2TE ej2W genotypes were derived from F3 families. Each data point is one inflorescence from F4 plants (n). (C) Delta SNP index (deltaSNPi, QTL-seq) plot shows the sb1 locus contains the TM3-STM3 MADS-box gene cluster (see STAR Methods). (D) Schematic of the TM3-STM3 locus in the SLL genotypes M82 and Fla.8924, with M82 having a ~22 kb tandem duplication (designated SB1) containing STM3. (E) RNA-seq showing increased expression of STM3 from the SB1 duplication compared to sb1. (F) CRISPR-Cas9 mutagenesis of the TM3-STM3 cluster (sb1CR) suppresses branching in the j2TE ej2w background. Schematics at top depict two CRISPR lines with indel mutations in the STM3 and TM3 genes (sb1CR−1) and a large deletion spanning all three genes (sb1CR-del) (top). Representative inflorescences from the indicated genotypes (bottom). Arrowheads mark branch points. (G) Quantification and comparison of suppression of inflorescence branching by homozygous and heterozygous sb1CR−1 and sb1CR-del mutations in the background of j2TE ej2w. Genotypes were derived from F2 populations (see STAR Methods). N, plant number. n, inflorescence number. (H) STM3 duplication allele frequency in wild tomato species (distant relatives, SP), early domesticates and cultivars (SLC, SLL vintage) and modern cultivars (SLL fresh market and processing). (I) Distribution of J2 EJ2 SB1 genotypes in fresh market and processing/roma tomato types. All j2 fresh market genotypes carry sb1 and sb3, whereas processing/roma genotypes have SB1 or sb1, because EJ2 is functional. (J) Schematic showing the history of breeding for the jointless trait, including when SVs in EJ2 and STM3 arose. The pre-existing sb1 cryptic variant (single copy STM3) mitigated the severity of branching caused by introduction of j2TE in varieties carrying the cryptic variant ej2w. Selection of the sb3 cryptic variant (two copies of ej2w) resulted in the complete suppression of branching and restoration of normal yield. Gradient colored bar represents timeline. The table summarizes genotypic combinations. Blue and black bold fonts indicate solutions for jointless breeding in fresh market and processing/roma types, respectively (I and J). In (B, E, H and I), n represents sample size. P-values in (B and G) are based on two-tailed, two-sample t-tests. See also Figure S6.

Comment in

References

    1. Abyzov A, Urban AE, Snyder M, and Gerstein M (2011). CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984. - PMC - PubMed
    1. Aflitos S, Schijlen E, De Jong H, De Ridder D, Smit S, Finkers R, Wang J, Zhang G, Li N, Mao L, et al. (2014). Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing. Plant J. 80, 136–148. - PubMed
    1. Aflitos SA, Sanchez-Perez G, de Ridder D, Fransz P, Schranz ME, de Jong H, and Peters SA (2015). Introgression browser: high-throughput whole-genome SNP visualization. Plant J. 82, 174–182. - PubMed
    1. Aguet F, Brown AA, Castel SE, Davis JR, He Y, Jo B, Mohammadi P, Park YS, Parsana P, Segrè AV, et al. (2017). Genetic effects on gene expression across human tissues. Nature 550, 204–213. - PMC - PubMed
    1. Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, and Schatz MC (2019). RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol. 20, 224. - PMC - PubMed

Publication types

Substances