Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 16;71(1):190-207.
doi: 10.1093/sysbio/syab032.

Analysis of Paralogs in Target Enrichment Data Pinpoints Multiple Ancient Polyploidy Events in Alchemilla s.l. (Rosaceae)

Affiliations

Analysis of Paralogs in Target Enrichment Data Pinpoints Multiple Ancient Polyploidy Events in Alchemilla s.l. (Rosaceae)

Diego F Morales-Briones et al. Syst Biol. .

Abstract

Target enrichment is becoming increasingly popular for phylogenomic studies. Although baits for enrichment are typically designed to target single-copy genes, paralogs are often recovered with increased sequencing depth, sometimes from a significant proportion of loci, especially in groups experiencing whole-genome duplication (WGD) events. Common approaches for processing paralogs in target enrichment data sets include random selection, manual pruning, and mainly, the removal of entire genes that show any evidence of paralogy. These approaches are prone to errors in orthology inference or removing large numbers of genes. By removing entire genes, valuable information that could be used to detect and place WGD events is discarded. Here, we used an automated approach for orthology inference in a target enrichment data set of 68 species of Alchemilla s.l. (Rosaceae), a widely distributed clade of plants primarily from temperate climate regions. Previous molecular phylogenetic studies and chromosome numbers both suggested ancient WGDs in the group. However, both the phylogenetic location and putative parental lineages of these WGD events remain unknown. By taking paralogs into consideration and inferring orthologs from target enrichment data, we identified four nodes in the backbone of Alchemilla s.l. with an elevated proportion of gene duplication. Furthermore, using a gene-tree reconciliation approach, we established the autopolyploid origin of the entire Alchemilla s.l. and the nested allopolyploid origin of four major clades within the group. Here, we showed the utility of automated tree-based orthology inference methods, previously designed for genomic or transcriptomic data sets, to study complex scenarios of polyploidy and reticulate evolution from target enrichment data sets.[Alchemilla; allopolyploidy; autopolyploidy; gene tree discordance; orthology inference; paralogs; Rosaceae; target enrichment; whole genome duplication.].

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Homolog and ortholog inference workflow used in this study. a) Flow chart of paralog processing and homolog tree inference. b) Only homologs with outgroup present and monophyletic were used for orthology inference. Monophyletic outgroups (MO) will prune single-copy genes keeping clades with at least a user-defined minimum number of ingroup taxa. Rooted ingroups (RT) will keep all subtrees with at least a user-defined minimum number of ingroups taxa. If the homolog trees can be pruned using both MO and RT, then RT orthologs are added to the same root. Homologs that lack monophyletic outgroups were excluded from further consideration.
Figure 2.
Figure 2.
a) Maximum likelihood phylogeny of Alchemilla s.l. inferred from RAxML analysis of the concatenated 910-nuclear exon supermatrix from the “monophyletic outgroup” (MO) orthologs. Bootstrap support (BS) and Local posterior probability (LLP) are shown above branches. Nodes with full support (BS formula image/LLP formula image) are noted with an asterisk (*). Em dashes (—) denoted alternative topology compared to the ASTRAL tree (not shown). Quartet Sampling (QS) scores for major clades are shown below branches. QS scores in blue indicate strong support and red scores indicate weak support. QS scores: Quartet concordance/Quartet differential/Quartet informativeness. QS score formula image 1/—/1 denotes maximum support. Pie charts for major clades represent the proportion of exon ortholog trees that support that clade (blue), the proportion that support the main alternative bifurcation (green), the proportion that support the remaining alternatives (red), and the proportion (conflict or support) that have formula image50% bootstrap support (gray). Gene trees with missing data that were uninformative for the node were ignored. Branch lengths are in number of substitutions per site (scale bar on the bottom). Inset: b) Summary maximum likelihood phylogeny inferred from RAxML analysis of the concatenated 1,894-nuclear exon supermatrix from the “rooted ingroup” orthologs (RT). BS and LLP are shown above branches and QS scores below the branches. Branch lengths are in number of substitutions per site. See Supplementary Fig. S1 available on Dryad for expanded tree.; c) Summary ASTRAL-Pro tree inferred from 923 multilabeled exon homolog trees. LLP are shown next to nodes. Branch lengths are in coalescent units. See Supplementary Fig. S2 available on Dryad for expanded tree; d) Summary maximum likelihood phylogeny inferred from RAxML analysis of concatenated partial plastomes. BS and LLP are shown above branches and QS scores below the branches. Branch lengths are in number of substitutions per site. See Supplementary Fig. S4 available on Dryad for expanded tree.
Figure 3.
Figure 3.
Orthogroup gene duplication mapping results. a) Summarized cladogram of Alchemilla s.l. from the ASTRAL analysis of “monophyletic outgroup” (MO) ortholog trees. Percentages next to nodes denote the proportion of duplicated genes when using orthogroups from the longest homologs (250 after orthogroup inference and filtering). Nodes with elevated proportions of gene duplications are numbered 1–4 as referenced in the main text. See Supplementary Fig. S5 available on Dryad for the full tree. b) Histogram of percentages of gene duplication per branch. c) Number of paralogs per taxa in the final homolog trees. In final homologs, clades and paraphyletic grades of the same species were pruned, leaving only one tip per species. Each locus is represented by the longest homolog (the single longest aligned exon per gene; 256 total).
Figure 4.
Figure 4.
Summary of optimal multilabeled tree (MUL-tree) inferred from GRAMPA analyses. a) MUL-tree from reconciliations of homologs against the ASTRAL tree inferred from “monophyletic outgroup” (MO) orthologs including all taxa. Red branches denote the allopolyploid origin of the “lobed” clade of Eualchemilla. b) MUL-tree after removing the “lobed” clade of Eualchemilla as in a). Green branches denote the allopolyploid origin of Afromilla. c) MUL-tree after removing Afromilla as in b). Blue branches denote the allopolyploid origin of the “dissected” clade of Eualchemilla. d). MUL-tree after further removing the “dissected” clade as in c). Yellow lines denote the allopolyploid origin of Lachemilla. e) MUL-tree from reconciliations of constrained homologs on the MRCA of Alchemilla s.l. against the cpDNA tree. Orange branches denote the autopolyploid origin Alchemilla s.l. f) Putative summary network of all reticulation events in Alchemilla s.l. Colored curved branches denote different polyploid events as in (a–e). Dashed curved lines represent the chloroplast donor (cpDNA) in allopolyploid events.

References

    1. Andermann T., Cano Á., Zizka A., Bacon C., Antonelli A. 2018. SECAPR—a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments. PeerJ. 6:e5175. - PMC - PubMed
    1. Andermann T., Torres Jiménez M.F., Matos-Maraví P., Batista R., Blanco-Pastor J.L., Gustafsson A.L.S., Kistler L., Liberal I.M., Oxelman B., Bacon C.D., Antonelli A. 2020. A guide to carrying out a phylogenomic target sequence capture project. Front. Genet. 10:1407. - PMC - PubMed
    1. Bagley J.C., Uribe-Convers S., Carlsen M.M., Muchhala N. 2020. Utility of targeted sequence capture for phylogenomics in rapid, recent angiosperm radiations: Neotropical Burmeistera bellflowers as a case study. Mol. Phylogenet. Evol. 152:106769. - PubMed
    1. Benaglia T., Chauveau D., Hunter D.R., Young D. 2009. mixtools?: an R package for analyzing finite mixture models. J. Stat. Softw. 32:1–29.
    1. Brown J.W., Walker J.F., Smith S.A. 2017. Phyx - phylogenetic tools for unix. Bioinformatics 33:1886–1888. - PMC - PubMed

Publication types