. 2022 Dec 6:10:e14525.

doi: 10.7717/peerj.14525. eCollection 2022.

A pipeline for assembling low copy nuclear markers from plant genome skimming data for phylogenetic use

Marcelo Reginato¹

Affiliations

PMID: 36523475
PMCID: PMC9745922
DOI: 10.7717/peerj.14525

A pipeline for assembling low copy nuclear markers from plant genome skimming data for phylogenetic use

Marcelo Reginato. PeerJ. 2022.

. 2022 Dec 6:10:e14525.

doi: 10.7717/peerj.14525. eCollection 2022.

Author

Marcelo Reginato¹

Affiliation

¹ Departamento de Botânica, Instituto de Biociências, Universidade Federal do Rio Grande do Sul, Porto Alegre, Rio Grande do Sul, Brazil.

PMID: 36523475
PMCID: PMC9745922
DOI: 10.7717/peerj.14525

Abstract

Background: Genome skimming is a popular method in plant phylogenomics that do not include a biased enrichment step, relying on random shallow sequencing of total genomic DNA. From these data the plastome is usually readily assembled and constitutes the bulk of phylogenetic information generated in these studies. Despite a few attempts to use genome skims to recover low copy nuclear loci for direct phylogenetic use, such endeavor remains neglected. Causes might include the trade-off between libraries with few reads and species with large genomes (i.e., missing data caused by low coverage), but also might relate to the lack of pipelines for data assembling.

Methods: A pipeline and its companion R package designed to automate the recovery of low copy nuclear markers from genome skimming libraries are presented. Additionally, a series of analyses aiming to evaluate the impact of key assembling parameters, reference selection and missing data are presented.

Results: A substantial amount of putative low copy nuclear loci was assembled and proved useful to base phylogenetic inference across the libraries tested (4 to 11 times more data than previously assembled plastomes from the same libraries).

Discussion: Critical aspects of assembling low copy nuclear markers from genome skims include the minimum coverage and depth of a sequence to be used. More stringent values of these parameters reduces the amount of assembled data and increases the relative amount of missing data, which can compromise phylogenetic inference, in turn relaxing the same parameters might increase sequence error. These issues are discussed in the text, and parameter tuning through multiple comparisons tracking their effects on support and congruence is highly recommended when using this pipeline. The skimmingLoci pipeline (https://github.com/mreginato/skimmingLoci) might stimulate the use of genome skims to recover nuclear loci for direct phylogenetic use, increasing the power of genome skimming data to resolve phylogenetic relationships, while reducing the amount of sequenced DNA that is commonly wasted.

Keywords: Genome skimming; High-throughput sequencing; Low copy; Mapping reads; Phylogenetics; Pipeline; R package; systematics.

PubMed Disclaimer

Conflict of interest statement

The author declares he has no competing interests.

Figures

**Figure 1. Flowchart illustrating key steps and software used in the skimmingLoci pipeline, as well as in downstream and upstream major steps.**

Figure 2. Key parameters comparisons, including the minimum depth to keep a base call in the consensus sequence (–d parameter) and the minimum coverage of a sequence to be included in the final locus alignment (–C parameter).
(A) Minimum coverage vs. aligned base pairs (bp), variable sites (Variable), parsimony informative sites (PIS), and missing data (Missing). (B) Minimum coverage (–C) vs. number of loci (Loci n), mean bootstrap support (Bootstrap mean) and percent of missing data (Missing data %). (C) Minimum depth (−d) vs. aligned base pairs (bp), variable sites (Variable), parsimony informative sites (PIS), and missing data (Missing). (D) Minimum depth (–d) vs. number of loci (Loci n), mean bootstrap support (Bootstrap mean) and percent of missing data (Missing data %).

**Figure 3. Treespace and comparative descriptors of outlier loci and the remaining ones.**
(A) Treespace analysis indicating putative outlier loci identified (63 out of 683 loci were flagged as outliers). (B–G) Descriptors distribution comparison between outlier loci and in the remaining ones (violin plots). (B) Total base pairs. (C) Median depth. (D) Mean coverage. (E) Missing data. (F) Mean bootstrap. (G) Distance (RF) to the concatenate tree. (H–K) Biplots of selected descriptors vs. mean bootstrap support. (H) Total base pairs. (I) Missing data percent. (J) Coverage standard deviation. (K) Median depth. In all plots outliers are shown in gray and the remaining loci in black. The asterisk (*) indicates significant difference between groups.

**Figure 4. The species tree inferred with Astral (A) and the maximum likelihood tree of the concatenate alignment (B).**
Both trees from the “Full” assembly (–d 2, –C 0.1). Support values are depicted following the legend (A, Gene bootstrap; B, Bootstrap). Terminals with distinct phylogenetic positioning in bold face.

**Figure 5. The maximum likelihood tree of the target enrichment data set (Myrtales, Angiospersm343 probe set) including the published terminals along with the skimmingLoci assemblies (in blue).**
The total number of loci and median coverage for each terminal are plotted on the right side. Bootstrap support is depicted at the nodes following the legend.

See this image and copyright information in PMC

References

1. Andermann T, Fernandes AM, Olsson U, Töpel M, Pfeil B, Oxelman B, Aleixo A, Faircloth BC, Antonelli A. Allele phasing greatly improves the phylogenetic utility of ultraconserved elements. Systematic biology. 2019;68(1):32–46. doi: 10.1093/sysbio/syy039. - DOI - PMC - PubMed
1. Besnard G, Bianconi ME, Hackel J, Manzi S, Vorontsova MS, Christin PA. Herbarium genomics retraces the origins of C4-specific carbonic anhydrase in Andropogoneae (Poaceae) Botany Letters. 2018;165(3–4):419–433. doi: 10.1080/23818107.2018.1469429. - DOI
1. Besnard G, Christin PA, Malé PJG, Lhuillier E, Lauzeral C, Coissac E, Vorontsova MS. From museums to genomics: old herbarium specimens shed light on a C3 to C4 transition. Journal of Experimental Botany. 2014;65(22):6711–6721. doi: 10.1093/jxb/eru395. - DOI - PubMed
1. Cai L, Zhang H, Davis CC. PhyloHerb: a high-throughput phylogenomic pipeline for processing genome skimming data. Applications in Plant Sciences. 2022;10(3):e11475. doi: 10.1002/aps3.11475. - DOI - PMC - PubMed
1. Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution. 2000;17(4):540–552. doi: 10.1093/oxfordjournals.molbev.a026334. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A pipeline for assembling low copy nuclear markers from plant genome skimming data for phylogenetic use

Affiliation

A pipeline for assembling low copy nuclear markers from plant genome skimming data for phylogenetic use

Author

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources