Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug;24(8):973-984.
doi: 10.1111/geb.12326. Epub 2015 May 25.

Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?

Affiliations

Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?

Carla Maldonado et al. Glob Ecol Biogeogr. 2015 Aug.

Abstract

Aim: Massive digitalization of natural history collections is now leading to a steep accumulation of publicly available species distribution data. However, taxonomic errors and geographical uncertainty of species occurrence records are now acknowledged by the scientific community - putting into question to what extent such data can be used to unveil correct patterns of biodiversity and distribution. We explore this question through quantitative and qualitative analyses of uncleaned versus manually verified datasets of species distribution records across different spatial scales.

Location: The American tropics.

Methods: As test case we used the plant tribe Cinchoneae (Rubiaceae). We compiled four datasets of species occurrences: one created manually and verified through classical taxonomic work, and the rest derived from GBIF under different cleaning and filling schemes. We used new bioinformatic tools to code species into grids, ecoregions, and biomes following WWF's classification. We analysed species richness and altitudinal ranges of the species.

Results: Altitudinal ranges for species and genera were correctly inferred even without manual data cleaning and filling. However, erroneous records affected spatial patterns of species richness. They led to an overestimation of species richness in certain areas outside the centres of diversity in the clade. The location of many of these areas comprised the geographical midpoint of countries and political subdivisions, assigned long after the specimens had been collected.

Main conclusion: Open databases and integrative bioinformatic tools allow a rapid approximation of large-scale patterns of biodiversity across space and altitudinal ranges. We found that geographic inaccuracy affects diversity patterns more than taxonomic uncertainties, often leading to false positives, i.e. overestimating species richness in relatively species poor regions. Public databases for species distribution are valuable and should be more explored, but under scrutiny and validation by taxonomic experts. We suggest that database managers implement easy ways of community feedback on data quality.

Keywords: Cinchoneae; GBIF; Rubiaceae; SpeciesGeoCoder; data quality; occurrence data; species richness.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Plot of all species occurrences in the plant tribe Cinchoneae (Rubiaceae). (a) Verified dataset (i.e., manually compiled through classical taxonomic work including herbarium visits, fieldwork, and information from monographs); (b) unverified dataset downloaded from the Global Biodiversity Information Facility GBIF using minor automated cleaning functions (e.g. excluding points in the ocean).
Figure 2
Figure 2
One‐degree grid maps showing species richness of tribe Cinchoneae. (a) Verified dataset (VD); (b) GBIF dataset; (c) Difference between VD and GBIF; (d) GBIF cleaned by the exclusion of uncertain georeferences; (e) Difference between VD and GBIF cleaned; (f) “GBIF cleaned” increased through the addition of records compiled manually; (g) Difference between VD and GBIF cleaned_increased; and (h) grid map showing a previous compilation of specimen records from the taxonomic literature, with dots proportional to species numbers (Antonelli et al., 2009).
Figure 3
Figure 3
Species richness maps coded by ecoregions using the following datasets: (a) verified (VD), (b) GBIF, (c) VDGBIF, (d) GBIF cleaned, (e) VDGBIF cleaned, (f) GBIF cleaned_increased, and (g) VDGBIF cleaned_increased. The colour coding (see legend) refers to species numbers in tribe Cinchoneae. Only ecoregions containing at least one species are delimited.
Figure 4
Figure 4
Maps of species richness coded at the biome level using the following datasets: (a) verified (VD), (b) GBIF, (c) VDGBIF, (d) GBIF cleaned, (e) VDGBIF cleaned, (f) GBIF cleaned_increased, and (g) VDGBIF cleaned_increased. The colour coding (see legend) refers to species numbers in tribe Cinchoneae. Only biomes containing at least one species are delimited.
Figure 5
Figure 5
Altitudinal range for each analysed genus in tribe Cinchoneae, using both the Verified and the GBIF datasets. Boxes indicate the interquartile range (IQ) of all estimates, with the median shown as a horizontal line and the whiskers indicating data range outside the quartiles. There were no significant differences between the ranges of any genus (Mann–Whitney U‐test; P > 0.05).
Figure 6
Figure 6
Altitudinal range for each analysed species in tribe Cinchoneae, using both the Verified and the GBIF datasets. Boxes indicate the interquartile range (IQ) of all estimates, with the median shown as a horizontal line and the whiskers indicating data range outside the quartiles. NOV means new species. There were no significant differences between the ranges of any species (Mann–Whitney U‐test; P > 0.05).

References

    1. Andersson, L. (1995) Tribes and genera of the Cinchoneae complex (Rubiaceae). Annals of the Missouri Botanical Garden, 82, 409–427.
    1. Andersson, L. & Antonelli, A. (2005) Phylogeny of the tribe Cinchoneae (Rubiaceae), its position in Cinchonoideae, and description of a new genus, Ciliosemina. Taxon, 54, 17–28.
    1. Antonelli, A. , Nylander, J.A. , Persson, C. & Sanmartín, I. (2009) Tracing the impact of the Andean uplift on Neotropical plant evolution. Proceedings of the National Academy of Sciences USA, 106, 9749–9754. - PMC - PubMed
    1. Beck, J. , Ballesteros‐Mejia, L. , Buchmann, C.M. , Dengler, J. , Fritz, S.A. , Gruber, B. , Hof, C. , Jansen, F. , Knapp, S. & Kreft, H. (2012) What's on the horizon for macroecology? Ecography, 35, 673–683.
    1. Beck, J. , Böller, M. , Erhardt, A. & Schwanghart, W. (2014) Spatial bias in the GBIF database and its effect on modeling species' geographic distributions. Ecological Informatics, 19, 10–15.

LinkOut - more resources