Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 12;71(6):1404-1422.
doi: 10.1093/sysbio/syac033.

Towards Large-Scale Integrative Taxonomy (LIT): Resolving the Data Conundrum for Dark Taxa

Affiliations

Towards Large-Scale Integrative Taxonomy (LIT): Resolving the Data Conundrum for Dark Taxa

Emily Hartop et al. Syst Biol. .

Abstract

New, rapid, accurate, scalable, and cost-effective species discovery and delimitation methods are needed for tackling "dark taxa," here defined as groups for which $<$10$\%$ of all species are described and the estimated diversity exceeds 1,000 species. Species delimitation for these taxa should be based on multiple data sources ("integrative taxonomy") but collecting multiple types of data risks impeding a discovery process that is already too slow. We here develop large-scale integrative taxonomy (LIT), an explicit method where preliminary species hypotheses are generated based on inexpensive data that can be obtained quickly and cost-effectively. These hypotheses are then evaluated based on a more expensive type of "validation data" that is only obtained for specimens selected based on objective criteria applied to the preliminary species hypotheses. We here use this approach to sort 18,000 scuttle flies (Diptera: Phoridae) into 315 preliminary species hypotheses based on next-generation sequencing barcode (313 bp) clusters (using objective clustering [OC] with a 3$\%$ threshold). These clusters are then evaluated with morphology as the validation data. We develop quantitative indicators for predicting which barcode clusters are likely to be incongruent with morphospecies by randomly selecting 100 clusters for in-depth validation with morphology. A linear model demonstrates that the best predictors for incongruence between barcode clusters and morphology are maximum p-distance within the cluster and a newly proposed index that measures cluster stability across different clustering thresholds. A test of these indicators using the 215 remaining clusters reveals that these predictors correctly identify all clusters that are incongruent with morphology. In our study, all morphospecies are true or disjoint subsets of the initial barcode clusters so that all incongruence can be eliminated by varying clustering thresholds. This leads to a discussion of when a third data source is needed to resolve incongruent grouping statements. The morphological validation step in our study involved 1,039 specimens (5.8$\%$ of the total). The formal LIT protocol we propose would only have required the study of 915 (5.1$\%$: 2.5 specimens per species), as we show that clusters without signatures of incongruence can be validated by only studying two specimens representing the most divergent haplotypes. To test the generality of our results across different barcode clustering techniques, we establish that the levels of incongruence are similar across OC, Automatic Barcode Gap Discovery (ABGD), Poisson Tree Processes (PTP), and Refined Single Linkage (RESL) (used by Barcode of Life Data System to assign Barcode Index Numbers [BINs]). OC and ABGD achieved a maximum congruence score with the morphology of 89$\%$ while PTP was slightly less effective (84$\%$). RESL could only be tested for a subset of the specimens because the algorithm is not public. BINs based on 277 of the original 1,714 haplotypes were 86$\%$ congruent with morphology while the values were 89$\%$ for OC, 74$\%$ for PTP, and 72$\%$ for ABGD. [Biodiversity discovery; dark taxa; DNA barcodes; integrative taxonomy.].

PubMed Disclaimer

Figures

<sc>Figure</sc> 1.
Figure 1.
LIT protocol. Two data sources are used: the first is collected for all specimens, the second for a select subset of specimens based on analysis of the primary data.
<sc>Figure</sc> 2.
Figure 2.
a) Sites of the Swedish Insect Inventory Project, color-coded by climatic zones identified by the Swedish Horticultural Society, b) Climatic zones (odlingszoner) of the Swedish Horticultural Society (Riksförbundet Svensk Trädgård), used with permission.
<sc>Figure</sc> 3.
Figure 3.
Haplotype network for Cluster 293, color-coded according to the climatic zones of the Swedish Horticultural Society. Nodes represent each unique haplotype, pie slices of nodes indicate the proportion of specimens from a particular site, node diameters are proportional to the number of specimens the haplotype contains, and the lines connecting the nodes have hash marks corresponding to base pair differences.
<sc>Figure</sc> 4.
Figure 4.
Haplotype network for cluster 101 indicating all morphological species found with male genitalia illustrated (border colors of genitalia figures match morphospecies boundary colors). Morphospecies is equivalent to 1formula image clusters (indicated by numbers), except in cases where a 1formula image subcluster contained multiple morphospecies, in these cases the 1formula image cluster is a red dashed line around the morphospecies. For two subclusters (216 and 249), the network is too complex to accurately circumscribe morphospecies in this figure. Morphospecies designations for all specimens are in the cluster table available on the project GitHub page.
<sc>Figure</sc> 5.
Figure 5.
The number of morphospecies, and clusters across settings with PTP, ABGD, and OC. OC is plotted without 0–0.5formula image thresholds where 1–2 bp differences between haplotypes greatly inflated cluster numbers.
<sc>Figure</sc> 6.
Figure 6.
Match ratios for PTP, ABGD (all priors), and OC (all thresholds) versus morphology across methods and settings.
<sc>Figure</sc> 7.
Figure 7.
The correct delimitation (teal), splitting (dark blue), lumping (coral) and splitting/lumping (yellow) of morphological clusters with ABGD (left) and OC (right) across settings. A color version of this figure appears in the online version of this article.
<sc>Figure</sc> 8.
Figure 8.
BIN designations (each BIN designated by a different colour) of the 16 morphospecies of Cluster 101 for which we found a 100formula image match (to at least one specimen) in BOLD.
<sc>Figure</sc> 9.
Figure 9.
Congruence between morphology, PTP, ABGD, and OC methods with a) optimal settings (ABGD formula image, OC 1.7formula image) and b) conservative settings (ABGD formula image, OC 3.0formula image) and between morphology, PTP, ABGD, OC, and RESL methods with c) optimal settings (ABGD formula image, OC 1.7formula image) and d) conservative settings (ABGD formula image, OC 3.0formula image).

References

    1. Ahrens D., Fujisawa T., Krammer H.-J., Eberle J., Fabrizi S., Vogler A.P.. 2016. Rarity and incomplete sampling in DNA-based species delimitation. Syst. Biol. 65:17. - PubMed
    1. Andersen A., Simcox D.J., Thomas J.A., Nash D.R.. 2014. Assessing reintroduction schemes by comparing genetic diversity of reintroduced and source populations: a case study of the globally threatened large blue butterfly (Maculinea arion). Biol. Conserv. 175:34–41.
    1. Bergsten J., Bilton D.T., Fujisawa T., Elliott M., Monaghan M.T., Balke M., Hendrich L., Geijer J., Herrmann J., Foster G.N., Ribera I., Nilsson A.N., Barraclough T.G., Vogler A.P.. 2012. The effect of geographical scale of sampling on DNA barcoding. Syst. Biol. 61:851–869. - PMC - PubMed
    1. Bickel D. 2009. Why Hilara is not amusing: the problem of open-ended taxa and the limits of taxonomic knowledge. Diptera diversity: status, challenges, and tools. Leiden, Netherlands: E. J. Brill. p. 279–301.
    1. Blaxter M.L. 2004. The promise of a DNA taxonomy. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 359:669–679. - PMC - PubMed

Publication types