Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 25;11(1):129.
doi: 10.1038/s41597-024-02962-5.

Semi-automated sequence curation for reliable reference datasets in ITS2 vascular plant DNA (meta-)barcoding

Affiliations

Semi-automated sequence curation for reliable reference datasets in ITS2 vascular plant DNA (meta-)barcoding

Andreia Quaresma et al. Sci Data. .

Abstract

One of the most critical steps for accurate taxonomic identification in DNA (meta)-barcoding is to have an accurate DNA reference sequence dataset for the marker of choice. Therefore, developing such a dataset has been a long-term ambition, especially in the Viridiplantae kingdom. Typically, reference datasets are constructed with sequences downloaded from general public databases, which can carry taxonomic and other relevant errors. Herein, we constructed a curated (i) global dataset, (ii) European crop dataset, and (iii) 27 datasets for the EU countries for the ITS2 barcoding marker of vascular plants. To that end, we first developed a pipeline script that entails (i) an automated curation stage comprising five filters, (ii) manual taxonomic correction for misclassified taxa, and (iii) manual addition of newly sequenced species. The pipeline allows easy updating of the curated datasets. With this approach, 13% of the sequences, corresponding to 7% of species originally imported from GenBank, were discarded. Further, 259 sequences were manually added to the curated global dataset, which now comprises 307,977 sequences of 111,382 plant species.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Schematic representation of the curation pipeline. The components ‘Automated curation’, ‘Manual list curation’, and ‘Manual sequence addition’ can be used singly or in conjunction.
Fig. 2
Fig. 2
Taxa representation of the two reference ITS2 datasets generated for each of the 27 EU countries, using the flora information extracted from the Euro + Med PlantBase (https://www.emplantbase.org/home.html) and GBIF platforms (https://www.gbif.org/).
Fig. 3
Fig. 3
Number of sequences retained in the ITS2 dataset for Malus pumila (top chart) and Pyrus communis (bottom chart) by the automated curation workflow. Approach A: sequences with a median identity <97% in pairwise all-against-all global alignments are removed in a single iteration; Approach B: sequences are removed iteratively using an incremental drop-out identity threshold of 50%, 75%, 80%, 85%, 90%, 92.5%, 95%, and 97%; Approach C: sequences are removed using the incremental threshold of ´Approach B´ while ensuring that 50% of the initial sequences are retained in the dataset.

References

    1. Hebert PDN, Cywinska A, Ball SL, deWaard JR. Biological identifications through DNA barcodes. Proceedings of the Royal Society of London. Series B: Biological Sciences. 2003;270:313–321. doi: 10.1098/rspb.2002.2218. - DOI - PMC - PubMed
    1. Li D-Z, et al. Comparative analysis of a large dataset indicates that internal transcribed spacer (ITS) should be incorporated into the core barcode for seed plants. Proc. Natl. Acad. Sci. (PNAS) 2011;108:19641–19646. doi: 10.1073/pnas.1104551108. - DOI - PMC - PubMed
    1. Schoch CL, et al. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc. Natl. Acad. Sci. (PNAS) 2012;109:6241–6246. doi: 10.1073/pnas.1117018109. - DOI - PMC - PubMed
    1. Kress WJ, Wurdack KJ, Zimmer EA, Weigt LA, Janzen DH. Use of DNA barcodes to identify flowering plants. Proc. Natl. Acad. Sci. (PNAS) 2005;102:8369–8374. doi: 10.1073/pnas.0503123102. - DOI - PMC - PubMed
    1. Newmaster SG, Fazekas AJ, Steeves RAD, Janovec J. Testing candidate plant barcode regions in the Myristicaceae. Mol. Ecol. Resour. 2008;8:480–490. doi: 10.1111/j.1471-8286.2007.02002.x. - DOI - PubMed