Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 3:8:e46497.
doi: 10.7554/eLife.46497.

Single-amino acid variants reveal evolutionary processes that shape the biogeography of a global SAR11 subclade

Affiliations

Single-amino acid variants reveal evolutionary processes that shape the biogeography of a global SAR11 subclade

Tom O Delmont et al. Elife. .

Abstract

Members of the SAR11 order Pelagibacterales dominate the surface oceans. Their extensive diversity challenges emerging operational boundaries defined for microbial 'species' and complicates efforts of population genetics to study their evolution. Here, we employed single-amino acid variants (SAAVs) to investigate ecological and evolutionary forces that maintain the genomic heterogeneity within ubiquitous SAR11 populations we accessed through metagenomic read recruitment using a single isolate genome. Integrating amino acid and protein biochemistry with metagenomics revealed that systematic purifying selection against deleterious variants governs non-synonymous variation among very closely related populations of SAR11. SAAVs partitioned metagenomes into two main groups matching large-scale oceanic current temperatures, and six finer proteotypes that connect distant oceanic regions. These findings suggest that environmentally-mediated selection plays a critical role in the journey of cosmopolitan surface ocean microbial populations, and the idea 'everything is everywhere but the environment selects' has credence even at the finest resolutions.

Keywords: SAR11; biogeography; evolution; genetics; genomics; metagenomics; metapangenomics; single-amino acid variants.

PubMed Disclaimer

Conflict of interest statement

TD, EK, OK, OE, IU, MR, SG, AE No competing interests declared

Figures

Figure 1.
Figure 1.. The SAR11 metapangenome.
Panel A describes the pangenome of 21 SAR11 isolate genomes based on the occurrence of 6175 gene clusters, in conjunction with their phylogeny (clade level) and relative distribution of recruited reads in 103 metagenomes ordered by latitude from the North Pole to the South Pole (top right heat map). The relative distributions were displayed for a minimum value of 0.1% and a maximum value of 1%. The layer named ‘Core 1a.3.V genes’ displays the occurrence of the 799 core 1a.3.V genes (in green) and those found in HIMB83 but not in the 1a.3.V lineage (in purple). Panel B describes the relative distribution of reads the 799 core 1a.3.V genes recruited across surface metagenomes from TARA Oceans.
Figure 1—figure supplement 1.
Figure 1—figure supplement 1.. Distribution and diversity of the core 1a.3.
V genes. Panel A displays the relative distribution of HIMB83 genes across 78 metagenomes, along with their coefficient of variation and the selection of 799 core 1a.3.V genes (blue outer circle). A world map provides the location of 74 metagenomes corresponding to the main ecological niche of 1a.3.V metagenomes (four metagenomes were disregarded due to high coefficients of variation). Panel B shows the number of metagenomes in which genes were consistently present. Genes were considered to be present in a metagenome consistently only if their coverage was within a factor of 5 of the average coverage of all genes for that metagenome. Those that passed the filter criteria in all 74 metagenomes (far right) were defined as the core 1a.3.V genes. Panel C displays the SNV density of core 1a.3.V genes across these 74 metagenomes. SNV density varied between 2.9% and 37.3% across genes and metagenomes. Panel D summarizes the heterogeneity extent of the core 1a.3.V genes within the population main ecological niche. Specifically, the panel displays the density of single nucleotide variants (>1% from consensus), environmentally disconnected nucleotide position (i.e., positions stable in the environment but differing from the reference,<1% from consensus) and single amino acid variants (>10% from consensus) within the core genes of 1a.3.V across 103 metagenomes as a function of the mean coverage of HIMB83.
Figure 2.
Figure 2.. Statistics of recruited reads.
Left panel shows percent identity distributions in each of the 74 metagenomes. Curves are colored based on height. Metagenomes are ordered according to how the percent identity distributions hierarchically cluster based on Euclidean distance (dendrogram). Right panels display a summary of distribution statistics for each percent identity distribution compared against in situ temperature in a linear regression (correlations to all other available parameters are summarized in Figure 2—figure supplement 2). Each point is a metagenome and black lines are lines of best fit. For visual clarity, the data in left panel considers only the median read length and interpolates between data points, whereas the data in right panels consider all read lengths with no interpolation.
Figure 2—figure supplement 1.
Figure 2—figure supplement 1.. Percent identity distributions resulting from the competitive mapping experiment of the metagenomic short reads onto the 21 SAR11 reference genomes.
For each reference genome, metagenomes were only included if they recruited at least 50X coverage. 10 references failed to recruit 50X coverage in any metagenome and were excluded from the plot. Curves were colored according to N, the number of metagenomes passing the 50X threshold, and each curve represents the mean distribution of these metagenomes, where the shaded area reflects the ± standard deviation. For visual clarity, the data only include reads equal to the median read length.
Figure 2—figure supplement 2.
Figure 2—figure supplement 2.. A matrix illustrating the degree of correlation (via linear regression) between oceanic metadata and the statistics (mean, standard deviation, skewness) of the percent read identity distributions of reads recruited by HIMB83 for the 74 metagenomes in which HIMB83 was covered at least 50X.
For example, cells in the temperature column of this matrix quantify the linear correlation coefficients of the scatterplots shown in Figure 2B. Cell colors correspond to their linear correlation coefficient and sizes are proportional to R-squared values. We did not correct for multiple testing due to the potential for strong inter-dependence of the parameters. Supplementary file 1k reports full numerical statistics including p-values.
Figure 3.
Figure 3.. Physico-chemical properties of amino acid variants.
The top panel describes the structure of 20 amino acids grouped by their main chemical properties. Panel A describes the solvent accessibility of amino acids, their relative distribution in both the core 1a.3.V genes and SAAVs, and their percentage increase in SAAVs as compared to the core 1a.3.V genes. The solvent accessibility of amino acids derives from the analysis of 55 proteins (Bordo and Argos, 1991). Panel B describes the relative abundance of the top 25 most prevalent amino acid substitution types (AASTs) across 74 metagenomes (boxplots), along with the classes their amino acids belong to and the correlation coefficient between AAST prevalence and in situ temperature calculated via linear regression (see Figure 3—figure supplement 2 for p-values). The area shaded in light gray shows bounds for the expected frequency distribution given strictly neutral processes. The upper bound is Model one and the lower bound is Model 2 (see Materials and methods). The four insets example the relationship between AAST prevalence and in situ temperature for the AASTs 'aspartic/glutamic acid', 'isoleucine/threonine', 'alanine/serine', and 'leucine/phenylalanine' (Figure 3—figure supplement 2 illustrate similar plots for all 25 of the most prevalent AASTs). The 25 AASTs included in the analysis cover 87.1% of all SAAVs. Panel C displays SAAVs on the predicted protein structures of four core 1a.3.V genes across six metagenomes from distant locations.
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Panel A shows a direct comparison between the amino acid composition in all positions compared to the amino acid composition within SAAVs.
The amino acid composition of all positions (y-axis) was quantified by calculating the frequency that amino acids appeared within the core 1a.3.V genes of the reference sequence, HIMB83, whereas amino acid composition in SAAVs (x-axis) was quantified by calculating the frequency that an amino acid was one of the two dominant alleles across the 738,324 SAAVs. The black diagonal line signifies a one-to-one correspondence between these two variables, and black vertical lines illustrate the deviation of amino acids from this null expectation. Vertical lines are labeled with percent differences between frequencies of amino acids in SAAVs relative to all positions. The linear correlation coefficient of panel A is 0.81, with an R-squared of 0.65, and a probability that no correlation exists of 9.8×10-6. Panel B shows a comparison between the average solvent accessibility of amino acids (x-axis) to the percent difference between frequencies of amino acids in SAAVs relative to all positions (y-axis). Average solvent accessibilities for amino acids were taken from Table 2 of Bordo and Argos (1991). The linear correlation coefficient of panel B excluding isoleucine and valine (in red) is 0.64, with an R-squared of 0.41, and a probability that no correlation exists of 0.004. Including isoleucine and valine, these values are 0.40, 0.16, and 0.08, respectively. The blue line shows the line of best fit excluding isoleucine and valine, with shaded in regions representing 95% confidence intervals.
Figure 3—figure supplement 2.
Figure 3—figure supplement 2.. The top 25 most abundant amino acid substitution types (AASTs) and their relationship with in situ temperature.
Each dot represents a metagenome, the x-axis is in situ temperature, and the y-axis is the percentage of SAAVs that were a given AAST. The red line is the line of best fit and the shaded-in region illustrates the 95% confidence interval. The linear correlation coefficient is given as r, and the probability of no relationship with temperature is given as p (uncorrected for multiple testing). Corrected p-values can be obtained by dividing p by 25, the number of linear regressions performed (Bonferonni multiple testing).
Figure 3—figure supplement 3.
Figure 3—figure supplement 3.. Allele frequency trajectories and in situ temperature.
Panel A illustrates allele frequency trajectories across 74 metagenomes with respect to temperature for 16 manually chosen SAAV positions in the context of protein structures predicted from the core 1a.3.V genes. Only the two most abundant amino acids were considered for each SAAV. The linear correlation coefficient is denoted as r and the probability that no correlation exists is denoted as p(Benjamini–Hochberg corrected p-values <0.05 (i.e., false discovery rate of 5%)). r is defined such that a positive value refers to the first amino acid (in dark red) positively correlating with temperature. Panel B shows the 4,592 positions within the core 1a.3.V genes that had temperature-correlated allele frequency trajectories (Benjamini–Hochberg corrected p-value <0.05), and the number of times these positions corresponded to the top 25 most prevalent AASTs. On the x-axis are the 25 most prevalent AASTs, where red and blue bars indicate the number of times the first and second amino acid, respectively, had allele frequency trajectories that positively correlated with temperature. The insets illustrate two example allele frequency trajectories for the AAST 'alanine/serine'.
Figure 3—figure supplement 4.
Figure 3—figure supplement 4.. Analysis of how temperature-correlated variant positions distribute within Gene 1727, a glycine betaine ATP-binding cassette permease subunit identified for its rare proportion of temperature-correlated variant positions.
Panel A illustrates the membrane topology predicted by Phobius (See Materials and methods), which associates to each position a probability it is periplasmic, cytosolic, transmembrane, or within the signaling peptide (colored lines). We categorized each position according to the maximum probability (shaded regions). Temperature-correlated and uncorrelated variant positions are denoted by solid red and dashed red vertical lines, respectively. Panel B illustrates the frequency that variant positions are found in the membrane, cytosol, and periplasm, based on whether or not they correlate with temperature. The y-axis is the fraction of variant positions observed (excluding positions observed in the signaling peptide), and numbers within each bar denote the number of variant positions observed within each class. The probability that positions are distributed independent of temperature correlation was 0.034. Formally, this is the probability that temperature-correlated positions were distributed according to a trinomial distribution with a probability vector equal to the empirical distribution observed in temperature-uncorrelated positions.
Figure 4.
Figure 4.. Biogeography of SAR11 subclade 1a.3.V based on single amino acid variants.
Panel A describes the organization of 74 metagenomes based on 57,277 codon position-specific AASTs affiliated with 37,415 unique codon positions and summarizes the number of detected SAAVs and percent identity of reads HIMB83 recruited for each metagenome. The world map in panel B displays the geographic partitioning of the two main metagenomic groups and six proteotypes. Panel B also describes the relative abundance of 1a.3.V and the number of invariant proteins across the six proteotypes.
Figure 4—figure supplement 1.
Figure 4—figure supplement 1.. A comparison of the geographic partitioning of the 1a.3.
V groups and proteotypes between metagenomes sampled from the surface versus those sampled from the deep maximal chlorophyll layer. Panels A and B contain the same information as shown in Figure 4, and panel C shows the locations of deep maximal chlorophyll layer metagenomes and the proteotypes to which they belong.
Figure 4—figure supplement 2.
Figure 4—figure supplement 2.. K-means clustering results (250 iterations) of the Deep Learning distance metric of 74 metagenomes based on the coordinates and identity of 738,324 SAAVs.
The elbow of this curve is k = 6, which informed our decision to define six proteotypes.
Figure 4—figure supplement 3.
Figure 4—figure supplement 3.. A comparison of dendrograms that organize metagenomes based on the genomic variability observed in the core 1a.3.
V genes. Above, the metagenomes are organized by applying a novel graph-based activity regularization technique for competitive learning from hyper-dimensional data to the 738,324 SAAVs (see Materials and methods). Below, the metagenomes are organized by a less comprehensive approach in which the percent identity distributions are hierarchically clustered via Euclidean distance. The first method therefore is sensitive to the identity and positions of amino acid variability, whereas the second method is based solely on a summary statistic that quantifies the degree of sequence divergence at the DNA level and is agnostic to position and identity. Each metagenome is connected to itself through a straight line and is colored according to the proteotype to which it belongs.
Figure 4—figure supplement 4.
Figure 4—figure supplement 4.. Biogeography of SAR11 subclade 1a.3.
V based on single amino acid variants using Deep Learning (left) and Fixation Index (right) (see Material and methods). The world maps display the geographic partitioning of six proteotypes based on the two methodologies. Finally, ANOVA tests determine whether there are statistically significant differences between the means of either in situ temperature or SAAV density across (1) the two main groups and (2) the six proteotypes as inferred from the two methodologies.
Figure 4—figure supplement 5.
Figure 4—figure supplement 5.. Geographic partitioning of SAR11 by matching surface metagenomes analyzed in our study to simulated results determined using a neutral-agent based model (Hellweger et al., 2014).
The figure emphasizes biogeographic differences between this simulation focused on neutral evolution and our large-scale empirical analysis.
Author response image 1.
Author response image 1.
Author response image 2.
Author response image 2.
Author response image 3.
Author response image 3.

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
    1. Anantharaman K, Brown CT, Hug LA, Sharon I, Castelle CJ, Probst AJ, Thomas BC, Singh A, Wilkins MJ, Karaoz U, Brodie EL, Williams KH, Hubbard SS, Banfield JF. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nature Communications. 2016;7:13219. doi: 10.1038/ncomms13219. - DOI - PMC - PubMed
    1. Anderson RE, Reveillaud J, Reddington E, Delmont TO, Eren AM, McDermott JM, Seewald JS, Huber JA. Genomic variation in microbial populations inhabiting the marine subseafloor at deep-sea hydrothermal vents. Nature Communications. 2017;8:1114. doi: 10.1038/s41467-017-01228-6. - DOI - PMC - PubMed
    1. Baas-Becking LGM. Geobiologie of Inleiding Tot De Milieukunde. Den Haag: W.P. Van Stockum & Zoon; 1934.
    1. Bendall ML, Stevens SL, Chan LK, Malfatti S, Schwientek P, Tremblay J, Schackwitz W, Martin J, Pati A, Bushnell B, Froula J, Kang D, Tringe SG, Bertilsson S, Moran MA, Shade A, Newton RJ, McMahon KD, Malmstrom RR. Genome-wide selective sweeps and gene-specific sweeps in natural bacterial populations. The ISME Journal. 2016;10:1589–1601. doi: 10.1038/ismej.2015.241. - DOI - PMC - PubMed

Publication types

Substances