Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr;57(4):973-985.
doi: 10.1038/s41588-025-02113-5. Epub 2025 Mar 3.

The genomic landscape of gene-level structural variations in Japanese and global soybean Glycine max cultivars

Affiliations

The genomic landscape of gene-level structural variations in Japanese and global soybean Glycine max cultivars

Ryoichi Yano et al. Nat Genet. 2025 Apr.

Erratum in

Abstract

Japanese soybeans are traditionally bred to produce soy foods such as tofu, miso and boiled soybeans. Here, to investigate their distinctive genomic features, including genomic structural variations (SVs), we constructed 11 nanopore-based genome references for Japanese and other soybean lines. Our assembly-based comparative method, designated 'Asm2sv', identified gene-level SVs comprehensively, enabling pangenome analysis of 462 worldwide cultivars and varieties. Based on these, we identified selective sweeps between Japanese and US soybeans, one of which was the pod-shattering resistance gene PDH1. Genome-wide association studies further identified several quantitative trait loci that accounted for large-seed phenotypes of Japanese soybean lines, some of which were also close to regions of the selective sweeps, including PDH1. Notably, specific combinations of alleles, including SVs, were found to increase the seed size of some Japanese landraces. In addition to the differences in cultivation environments, distinct food processing usages might result in changes in Japanese soybean genomes.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Nanopore-based genome references for Japanese and other soybean lines.
a, Phylogenetic tree of 462 worldwide soybean cultivars and varieties. The analysis was conducted using a resequencing (SNPs and short indels) dataset based on the Williams 82 Gmax_508 v4.0 reference. One wild soybean accession was used as an outgroup taxon. b, Principal component analysis of 462 soybeans based on the dataset used in a. c, Genome assembly and annotation statistics in newly constructed genomes. The ‘Num. gaps’ values indicate the number of gaps with ≥100 bp on 20 chromosomes. BUSCO5 benchmark was used to evaluate the completeness of the protein sequence datasets. chrs., chromosomes; num., number; seqs., sequences. d, Circos plots comparing the Enrei and Williams 82 Gmax_508 genome references. One-to-one orthologous gene partners (48,518 links) are indicated by blue (genes are located in the same direction) and gray (genes are located in the opposite direction) lines. Purple, red and orange links indicate 719 tandem copy number variations (CNV), 1,944 other SV types and 114 genes that were unanchored to either genome, respectively. e, Gene ID nomenclature in the Enrei v3.31 genome reference. The consensus gene ID started with a ‘Glyma.’ string attached to Enrei genes that have one-to-one orthologs in either version of Williams 82 v4.0, v2.0 or v1.0 (see also Supplementary Fig. 5).
Fig. 2
Fig. 2. Gene-SV analysis using the Asm2sv pipeline.
a, Schematic illustration of the flexible SV that may be detected between distinct genomes. b, Overview of the Asm2sv pipeline. This pipeline was designed to capture flexible SVs in each gene region based on algorithmic genomic alignment. The degree of gene conservation, disruption or extension in the target genome is calculated as a numeric score. The gene-SV scores can be combined to compare multiple genomes. Asm2sv can also output VCF data to use for pangenome construction. The detailed principles and usage of this pipeline are described in Supplementary Fig. 6. c, Examples of gene-SVs detected by Asm2sv. Gene-SV scores and genomic alignments are shown for the GmFT2b and CG-α-2 genes that carry insertion or deletion SVs, respectively. The left panels indicate bar charts showing the gene-SV scores obtained by Asm2sv; the center and right panels indicate examples of genomic alignment and microsynteny, respectively. In the left panel, the cultivar Enrei is highlighted in red text as its genome was used as the base reference.
Fig. 3
Fig. 3. An overview of gene-SV in 42 global soybean genomes.
a, Cumulative gene-SV scores obtained by Asm2sv in 42 soybean genomes. The Enrei v3.31 genome was used as the base reference. In the cumulative bar chart shown in the top panel, soybean genomes are indicated by different colors. The dot plot in the bottom panel is presented to emphasize the genes or elements that are less conserved between these genomes (the red box). Some examples of SV-prone regions were marked by asterisks as they exhibited a sharp downward peaks. b, Overrepresented terms in the less-conserved genes and elements shown in a. TEA was performed based on the InterPro ID. The q values were calculated from P values obtained by two-tailed Fisher’s exact test. Terms relating to three different categories (retrotransposon, disease response and signal transduction) are indicated by light blue, red and green arrows, respectively. AP, apoptotic protease; RTase, reverse transcriptase; TIR, toll/interleukin-1 receptor. c, Hierarchical clustering of 42 soybean genomes based on gene-SV scores. Cultivated and wild soybeans are shown in different colors; Japanese soybeans are indicated by red characters with a yellow background.
Fig. 4
Fig. 4. Tissue-wide RNA-seq atlas and co-expression analysis in Enrei and Williams 82.
a, Schematic illustration of tissue-wide RNA-seq dataset for Enrei and Williams 82. b, Heatmap illustrating the co-expression groups identified by WGCNA. c DEG groups identified by comparison between Enrei and Williams 82. The red lines indicate the expression patterns of computationally calculated eigengenes that represent the co-expression group. Genes were categorized into a co-expression group according to P values that were calculated by WGCNA (P < 1 × 10−6, n = 44). d, Volcano plot showing the significance levels of each DEG. To identify genes predominantly expressed in either genotype, P values were calculated with two-tailed Student’s t-test based on the all-expression levels in 22 tissues of Enrei and Williams 82 (P value < 0.01 and fold change of ≥2). Genes belonging to the co-expression groups shown in c were analyzed. The red and blue dots indicate Enrei-upregulated or Williams 82-upregulated genes, respectively. e, Overrepresented terms in Enrei-upregulated or Williams 82-upregulated gene groups identified in d. TEA was performed based on the InterPro ID. The q values were calculated from P values obtained by two-tailed Fisher’s exact test. Terms relating to three different categories (retrotransposon, disease response and signal transduction) are indicated by light blue, red and green arrows, respectively.
Fig. 5
Fig. 5. Genome-wide survey of genetic variants that distinguish Japanese and other world soybeans.
a, Manhattan plot of gene-SVs that distinguish G.max I and II members. Two-tailed Student’s t-test was performed based on gene-SV scores between 10 G.max I and 28 G.max II soybean lines, as described in the right panel. The Bonferroni-corrected threshold for P = 0.05 is indicated by a red dashed line. Positions of significant −log10(P value) peaks are indicated by square braces, and lines which overlap with XP-CLR peaks in b are shown in light blue. b, Genome-wide selective sweep scan using XP-CLR. The top panel indicates the comparison between 140 Japanese and 176 US soybeans; the bottom panel compares 108 East Asian (except Japanese) and 176 US soybeans. The XP-CLR values were calculated based on the VCF data that integrated conventional resequencing datasets (SNPs and short indels) and pangenome SV datasets. For clarity, primitive soybean accessions were not included in these analyses. Positions of significant XP-CLR peaks are indicated by square braces, and lines which overlap with gene-SV peaks in a are in light blue. c, The frequency of the stop-codon mutation (K31*) of the PDH1 gene in Japanese and other soybeans. Note that this mutation leads to increased pod-shattering resistance.
Fig. 6
Fig. 6. Phenotypic variations in seed size and shape in Japanese and other world soybeans.
a, Photographs of the seeds of Enrei and Williams 82. b, Cartoon illustrating the definition of seed area size, length and width. These values were quantified with SmartGrain software, both horizontally and vertically. c, Histograms showing the frequencies of seed area size, length and width in Japanese and other world soybeans. The ratios of length to width or horizontal width to vertical width were calculated to represent seed roundness. Seeds were obtained from 221 soybean lines cultivated in 2021, 2022 and 2023 (n ≥ 536). The positions of soybean lines whose genome references were constructed in this study are shown by dark blue numbers. d, Dot plots showing correlations between seed phenotype parameters. The coefficients of determination described in the plots were calculated by linear regression (n = 520).
Fig. 7
Fig. 7. GWAS of seed size parameters.
a, Manhattan plots showing the results of the GWAS. Seed area size, length and width in 221 soybean lines (Fig. 6) were used as the phenotype data. The −log10(P values) were separately calculated using the rMVP tool with the VCF data from single-reference (SNPs and short indels) or pangenome (SV datasets) analyses. The corrected significance thresholds that were calculated using rMVP in SNP-based or SV-based GWAS are indicated by red dotted and dashed lines, respectively, in the plots. For each phenotype, a plot was created by combining the results of independent GWAS that were conducted based on the phenotype data of three years (2021, 2022 and 2023) and their averages. In rMVP, the general linear model (GLM), mixed linear model (MLM) and FarmCPU were used, but the −log10(P values) from the GLM are presented in the plots only when those of MLM and/or FarmCPU tests for the variant were above the threshold level in at least two tests. The positions of significant −log10(P value) peaks are indicated by square braces and lines together with candidate gene IDs (if present). The −log10(P value) peaks that were commonly observed between distinct phenotype parameters are highlighted in light blue. bf, Box plots showing the effect of variants on seed size. The combined effects of two variants are analyzed in bd. The frequencies of Japanese and other world soybeans are indicated separately by different colors. For box plots, the center line denotes the median value; the box contains the 25th to 75th percentiles of the dataset; whiskers extend 1.5 times the interquartile range; and dots represent individual phenotype values. P values calculated by two-tailed Student’s t-test indicate significant differences between groups. SV10, SV50, SV100 and SV3000 indicate 10–50 bp, 50–100 bp, 100–1,000 bp and 3–5 kb differences between reference and target genomes, respectively. REF and ALT indicate reference and non-reference genotypes, respectively. SVs obtained through pangenome analysis are indicated by ‘[pan]’. ‘Large seed’ indicates eight Japanese landraces with particularly large seeds that were selected by the following criteria: averaged area_size (H) > 80 mm2 and averaged area_size (V) > 60 mm2 (see also Supplementary Data 7).
Fig. 8
Fig. 8. GWAS of seed shape parameters.
a, Manhattan plots showing the results of the GWAS. Ratios of length to width and horizontal width to vertical width in 221 soybean lines (Fig. 6) were used as seed shape (roundness) parameters. The −log10(P values) were calculated by the same method described in Fig. 7. The positions of significant −log10(P value) peaks are indicated by square braces and lines together with candidate gene IDs (if present). b,c, Box plots showing the effect of the candidate variant within Glyma.08G168400.v4.JE1 (b) and Glyma.09G048300.v4.JE1 (c), which are soybean homologs of the Arabidopsis genes EPIDERMAL PATTERNING FACTOR 2 (EPF2) and GSO1/SGN3, respectively. The frequencies of Japanese and other world soybeans are indicated separately by different colors. P values calculated by two-tailed Student’s t-test indicate significant differences between groups. ‘Large seed’ indicates eight Japanese landraces with particularly large seeds that were selected by the following criteria: averaged area_size (H) > 80 mm2 and averaged area_size (V) > 60 mm2 (see also Supplementary Data 7). d, Photographs of the seeds of induced soybean mutants. Unlike wild-type Enrei (WT), the EnT-6455 mutant carried the A560T amino acid substitution mutation within Glyma.09G048300.v4.JE1. e, A boxplot showing the frequencies of seed shape in the segregating progenies of the EnT-6455 mutant. Genotypes of each segregant are also shown in the plots; their parent is homozygous ALT (ALT), heterozygote (Htz) or no mutation (REF). The ratio of horizontal length to width was quantified in biologically independent individual seeds with SmartGrain software (n = 30 for wild-type Enrei, n ≥ 15 for ALT and Htz genotypes, n = 10 for REF genotype). f, Cartoon illustrating a possible member of the seed-specific TWS1-GSO1/SGN3 signaling component in soybeans. Four genes are described as possible candidates that may have a role in the regulation of seed shape in soybeans. The 3D structure images of soybean GSO1/SGN3 and TSW1 homologs were obtained using the AlphaFold 3 (ref. ). SERK, SOMATIC EMBRYOGENESIS RECEPTOR-LIKE KINASE. For box plots in b, c and e, the center line denotes the median value; box contains the 25th to 75th percentiles of the dataset; whiskers extend 1.5 times the interquartile range; and dots represent individual phenotype values.

References

    1. Food & Agriculture Organization of the United Nations. FAOSTAT Statistical Database (FAO, 2021).
    1. Ray, D. K., Mueller, N. D., West, P. C. & Foley, J. A. Yield trends are insufficient to double global crop production by 2050. PLoS ONE8, e66428 (2013). - PMC - PubMed
    1. Schmutz, J. et al. Genome sequence of the palaeopolyploid soybean. Nature463, 178–183 (2010). - PubMed
    1. Cannon, S. B. & Shoemaker, R. C. Evolutionary and comparative analyses of the soybean genome. Breed. Sci.61, 437–444 (2012). - PMC - PubMed
    1. Schmutz, J. et al. A reference genome for common bean and genome-wide analysis of dual domestications. Nat. Genet.46, 707–713 (2014). - PMC - PubMed

LinkOut - more resources