. 2019 Jul 1;68(4):594-606.

doi: 10.1093/sysbio/syy086.

A Universal Probe Set for Targeted Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids Clustering

Matthew G Johnson^{1

2}, Lisa Pokorny³, Steven Dodsworth^{3

4}, Laura R Botigué^{3

5}, Robyn S Cowan³, Alison Devault⁶, Wolf L Eiserhardt^{3

7}, Niroshini Epitawalage³, Félix Forest³, Jan T Kim³, James H Leebens-Mack⁸, Ilia J Leitch³, Olivier Maurin³, Douglas E Soltis^{9

10}, Pamela S Soltis^{9

10}, Gane Ka-Shu Wong^{11

12

13}, William J Baker³, Norman J Wickett^{2

14}

Affiliations

¹ Department of Biological Sciences, Texas Tech University, Lubbock, TX 79409, USA.
² Plant Science and Conservation, Chicago Botanic Garden, 1000 Lake Cook Road, Glencoe, IL 60022, USA.
³ Department of Comparative Plant and Fungal Biology, Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AE, UK.
⁴ School of Life Sciences, University of Bedfordshire, University Square, Luton LU1 3JU, UK.
⁵ Centre for Research in Agricultural Genomics, Campus UAB, Edifici CRAG, Bellaterra Cerdanyola del Vallès, 08193 Barcelona, Spain.
⁶ Arbor Biosciences, 5840 Interface Dr, Suite 101, Ann Arbor, MI 48103, USA.
⁷ Department of Bioscience, Aarhus University, 8000 Aarhus C, Denmark.
⁸ Department of Plant Biology, University of Georgia, 2502 Miller Plant Sciences, Athens, GA 30602, USA.
⁹ Department of Biology, University of Florida, 220 Bartram Hall, Gainesville, FL 32611-8525, USA.
¹⁰ Florida Museum of Natural History, University of Florida, 3215 Hull Road, Gainesville, FL 32611-2710, USA.
¹¹ BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China.
¹² Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada.
¹³ Department of Medicine, University of Alberta, Edmonton, AB T6G 2E1, Canada.
¹⁴ Program in Plant Biology and Conservation, Northwestern University, 2205 Tech Drive, Evanston, IL 60208, USA.

PMID: 30535394
PMCID: PMC6568016
DOI: 10.1093/sysbio/syy086

A Universal Probe Set for Targeted Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids Clustering

Matthew G Johnson et al. Syst Biol. 2019.

. 2019 Jul 1;68(4):594-606.

doi: 10.1093/sysbio/syy086.

Authors

Affiliations

¹ Department of Biological Sciences, Texas Tech University, Lubbock, TX 79409, USA.
² Plant Science and Conservation, Chicago Botanic Garden, 1000 Lake Cook Road, Glencoe, IL 60022, USA.
³ Department of Comparative Plant and Fungal Biology, Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AE, UK.
⁴ School of Life Sciences, University of Bedfordshire, University Square, Luton LU1 3JU, UK.
⁵ Centre for Research in Agricultural Genomics, Campus UAB, Edifici CRAG, Bellaterra Cerdanyola del Vallès, 08193 Barcelona, Spain.
⁶ Arbor Biosciences, 5840 Interface Dr, Suite 101, Ann Arbor, MI 48103, USA.
⁷ Department of Bioscience, Aarhus University, 8000 Aarhus C, Denmark.
⁸ Department of Plant Biology, University of Georgia, 2502 Miller Plant Sciences, Athens, GA 30602, USA.
⁹ Department of Biology, University of Florida, 220 Bartram Hall, Gainesville, FL 32611-8525, USA.
¹⁰ Florida Museum of Natural History, University of Florida, 3215 Hull Road, Gainesville, FL 32611-2710, USA.
¹¹ BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China.
¹² Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada.
¹³ Department of Medicine, University of Alberta, Edmonton, AB T6G 2E1, Canada.
¹⁴ Program in Plant Biology and Conservation, Northwestern University, 2205 Tech Drive, Evanston, IL 60208, USA.

PMID: 30535394
PMCID: PMC6568016
DOI: 10.1093/sysbio/syy086

Abstract

Sequencing of target-enriched libraries is an efficient and cost-effective method for obtaining DNA sequence data from hundreds of nuclear loci for phylogeny reconstruction. Much of the cost of developing targeted sequencing approaches is associated with the generation of preliminary data needed for the identification of orthologous loci for probe design. In plants, identifying orthologous loci has proven difficult due to a large number of whole-genome duplication events, especially in the angiosperms (flowering plants). We used multiple sequence alignments from over 600 angiosperms for 353 putatively single-copy protein-coding genes identified by the One Thousand Plant Transcriptomes Initiative to design a set of targeted sequencing probes for phylogenetic studies of any angiosperm group. To maximize the phylogenetic potential of the probes, while minimizing the cost of production, we introduce a k-medoids clustering approach to identify the minimum number of sequences necessary to represent each coding sequence in the final probe set. Using this method, 5-15 representative sequences were selected per orthologous locus, representing the sequence diversity of angiosperms more efficiently than if probes were designed using available sequenced genomes alone. To test our approximately 80,000 probes, we hybridized libraries from 42 species spanning all higher-order groups of angiosperms, with a focus on taxa not present in the sequence alignments used to design the probes. Out of a possible 353 coding sequences, we recovered an average of 283 per species and at least 100 in all species. Differences among taxa in sequence recovery could not be explained by relatedness to the representative taxa selected for probe design, suggesting that there is no phylogenetic bias in the probe set. Our probe set, which targeted 260 kbp of coding sequence, achieved a median recovery of 137 kbp per taxon in coding regions, a maximum recovery of 250 kbp, and an additional median of 212 kbp per taxon in flanking non-coding regions across all species. These results suggest that the Angiosperms353 probe set described here is effective for any group of flowering plants and would be useful for phylogenetic studies from the species level to higher-order groups, including the entire angiosperm clade itself.

Keywords: Angiosperms; Hyb-Seq; k-means clustering; k-medoids clustering; machine learning; nuclear genes; phylogenomics; sequence capture; target enrichment.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of probe design and phylogenetic considerations. Given a hypothetical gene *ABCD1*, the goal of probe design is to include a sufficient diversity of 120-mers (probes) such that any angiosperm *ABCD1* sequence can be recovered by hybridization. If a number of *ABCD1* sequences are known, represented by solid branches and tips with a small gray square in the phylogeny, the minimum number of representatives of those sequences should be selected that maximize the chances of recovering *ABCD1* from any “unknown” sample (dotted lines in the phylogeny). If sequences X, Y, and Z are selected, 120-mer probes are designed, here with 2 tiling, across the entire length of the sequence. The final probe set includes all unique 120-mers; asterisks represent cases in which individual 120-mers are identical from two (*) or all three (**) of the representative sequences X, Y, and Z. In these cases only one or two 120-mers, rather than three, would be necessary for that region of the gene in the final probe set. While this is possible in probe design, we did not encounter any such cases in the Angiosperms353 probe set. For a particular “unknown” sample, here represented by the dark gray dotted line and denoted as sample Q, a sequencing library consisting of size-selected inserts, adapters, and indexes, is hybridized to the final probe set and the resulting sequence reads can be reconstructed to extract both the coding region and flanking non-coding (“splash zone”) regions. In this simplified example, the final probe set represents only gene *ABCD1* but the Angiosperms353 final probe set includes probes tiled across 353 genes.

formula image — **Figure 1.**
Overview of probe design and phylogenetic considerations. Given a hypothetical gene *ABCD1*, the goal of probe design is to include a sufficient diversity of 120-mers (probes) such that any angiosperm *ABCD1* sequence can be recovered by hybridization. If a number of *ABCD1* sequences are known, represented by solid branches and tips with a small gray square in the phylogeny, the minimum number of representatives of those sequences should be selected that maximize the chances of recovering *ABCD1* from any “unknown” sample (dotted lines in the phylogeny). If sequences X, Y, and Z are selected, 120-mer probes are designed, here with 2 tiling, across the entire length of the sequence. The final probe set includes all unique 120-mers; asterisks represent cases in which individual 120-mers are identical from two (*) or all three (**) of the representative sequences X, Y, and Z. In these cases only one or two 120-mers, rather than three, would be necessary for that region of the gene in the final probe set. While this is possible in probe design, we did not encounter any such cases in the Angiosperms353 probe set. For a particular “unknown” sample, here represented by the dark gray dotted line and denoted as sample Q, a sequencing library consisting of size-selected inserts, adapters, and indexes, is hybridized to the final probe set and the resulting sequence reads can be reconstructed to extract both the coding region and flanking non-coding (“splash zone”) regions. In this simplified example, the final probe set represents only gene *ABCD1* but the Angiosperms353 final probe set includes probes tiled across 353 genes.

**Figure 2.**
Comparison between the k-medoids method of selecting representative sequences with using the closest available angiosperm genome. a) Each point is one gene, and its position indicates the percentage of angiosperm transcripts (from OneKP) that fall within 30% sequence divergence of a representative sequence. Only genes where the k-medoids could represent 95% or more angiosperms were selected for probe design. Note the -axis and -axis ranges are not identical. Dotted dash line indicates gene 5348, which is highlighted in the other panels. b) Distribution of distances between each angiosperm sequence in 1KP and the nearest k-medoid for gene 5348. c) Distribution of distances between each angiosperm sequence in 1KP and the nearest published genome sequence for gene 5348.

**Figure 3.**
Heatmap of gene recovery efficiency. Each row is one sample, and each column is one gene. Shading indicates the percentage of the target length (calculated by the mean length of all k-medoid transcripts for each gene) recovered. Numbers indicate the Input Category (see main text).

**Figure 4.**
Relationship between reads mapping to the target genes and the number of loci recovered for 42 angiosperm species. There is a general linear increase in the number of genes recovered below 100,000 mapped reads, above which there are diminishing returns for additional sequencing.

**Figure 5.**
Total length of sequence recovery for both coding and non-coding regions across 353 loci for 42 angiosperm species. Reads were mapped back to either coding sequence (left bar) or coding sequence plus flanking non-coding (i.e., intron) sequence (right bar). Only positions with at least 8 depth were counted. The total length of coding sequence targeted was 260,802 bp. The median recovery of coding sequence was 137,046 bp and the median amount of non-coding sequence recovered was 216,816 bp (with at least 8 depth of coverage).

See this image and copyright information in PMC

References

1. Alvarez I., Wendel J.F.. 2003. Ribosomal ITS sequences and plant phylogenetic inference. Mol. Phylogenet. Evol. 29:417–434. - PubMed
1. Amborella Genome Project, Ma H., Palmer J.D., Sankoff D., Soltis P.S., Wing R.A., Ammiraju J.S.S., Chamala S., Ralph P., Rounsley S., Soltis D.E., Talag J., Tomsho L., Wanke S., Chanderbali A.S., Chang T.H., Lan T., Arikit S., Axtell M.J., Ayyampalayam S., Barbazuk W.B., Burnette J.M., dePamphilis C.W., Estill J.C., Farrell N.P., Harkess A., Jiao Y., Meyers B.C., Walts B., Wessler S.R., Zhang X., Albert V.A., Carretero-Paulet L., Lyons E., Tang H., Zheng C., Leebens-Mack J., Liu K., Mei W., Wafula E., Altman N.S., Chen F., Chen J.Q., Chiang V., De Paoli E., Determann R., Fogliani B., Guo C., Harholt J., Job C., Job D., Kim S., Kong H., Li G., Li L., Liu J., Park J., Qi X., Rajjou L., Burtet-Sarramegna V., Sederoff R., Shahid S., Sun Y.H., Ulvskov P., Villegente M., Xue J.Y., Yeh T.F., Yu X., Zhai J., Acosta J.J., Bruenn R.A., de Kochko A., Der J.P., Herrera-Estrella L.R., Ibarra-Laclette E., Kirst M., Pissis S.P., Poncet V., Schuster S.C.. 2013. The Amborella genome and the evolution of flowering plants. Science. 342:1241089. - PubMed
1. Andrews K.R., Good J.M., Miller M.R., Luikart G., Hohenlohe P.A.. 2016. Harnessing the power of RADseq for ecological and evolutionary genomics. Nat. Rev. Genet. 17:81–92. - PMC - PubMed
1. Bankevich A., Nurk S., Antipov D., Gurevich A.A., Dvorkin M., Kulikov A.S., Lesin V.M., Nikolenko S.I., Pham S., Prjibelski A.D., Pyshkin A.V., Sirotkin A.V., Vyahhi N., Tesler G., Alekseyev M.A., Pevzner P.A.. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19:455–477. - PMC - PubMed
1. Barba-Montoya J., Dos Reis M., Schneider H., Donoghue P.C.J., Yang Z.. 2018. Constraining uncertainty in the timescale of angiosperm evolution and the veracity of a Cretaceous Terrestrial Revolution. New Phytol. 218:819–834. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- Dryad Digital Repository - Access Curated Datasets

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Universal Probe Set for Targeted Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids Clustering

Affiliations

A Universal Probe Set for Targeted Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids Clustering

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources