Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan;6(1):e489.
doi: 10.1002/edn3.489. Epub 2023 Nov 29.

rCRUX: A Rapid and Versatile Tool for Generating Metabarcoding Reference libraries in R

Affiliations

rCRUX: A Rapid and Versatile Tool for Generating Metabarcoding Reference libraries in R

Emily E Curd et al. Environ DNA. 2024 Jan.

Abstract

The sequencing revolution requires accurate taxonomic classification of DNA sequences. Key to making accurate taxonomic assignments are curated, comprehensive reference barcode databases. However, the generation and curation of such databases has remained challenging given the large and continuously growing volumes of both DNA sequence data and novel reference barcode targets. Monitoring and research applications require a greater diversity of specialized gene regions and targeted taxa then are currently curated by professional staff. Thus there is a growing need for an easy to implement computational tool that can generate comprehensive metabarcoding reference libraries for any bespoke locus. We address this need by reimagining CRUX from the Anacapa Toolkit and present the rCRUX package in R which, like it's predecessor, relies on sequence homology and PCR primer compatibility instead of keyword-searches to avoid limitations of user-defined metadata. The typical workflow involves searching for plausible seed amplicons (get_seeds_local() or get_seeds_remote()) by simulating in silico PCR to acquire a set of sequences analogous to PCR products containing a user-defined set of primer sequences. Next, these seeds are used to iteratively blast search seed sequences against a local copy of the National Center for Biotechnology Information (NCBI) formatted nt database using a taxonomic-rank based stratified random sampling approach ( blast_seeds() ). This results in a comprehensive set of sequence matches. This database is dereplicated and cleaned (derep_and_clean_db()) by identifying identical reference sequences and collapsing the taxonomic path to the lowest taxonomic agreement across all matching reads. This results in a curated, comprehensive database of primer-specific reference barcode sequences from NCBI. Databases can then be compared (compare_db()) to determine read and taxonomic overlap. We demonstrate that rCRUX provides more comprehensive reference databases for the MiFish Universal Teleost 12S, Taberlet trnl, fungal ITS, and Leray CO1 loci than CRABS, MetaCurator, RESCRIPt, and ecoPCR reference databases. We then further demonstrate the utility of rCRUX by generating 24 reference databases for 20 metabarcoding loci, many of which lack dedicated reference database curation efforts. The rCRUX package provides a simple to use tool for the generation of curated, comprehensive reference databases for user-defined loci, facilitating accurate and effective taxonomic classification of metabarcoding and DNA sequence efforts broadly.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest The authors have no conflict of interests to report.

Figures

Figure 1.
Figure 1.
Overview of rCRUX workflow
Figure 2.
Figure 2.. Comparison of rCRUX to the original implementation of CRUX
Comparison of number of species captured by (a) rCRUX implemented get_seeds_local() and CRUX implemented ecoPCR in silico PCR tools and (b) rCRUX and CRUX implemented blast_seeds() for the MiFish 12S Universal Teleost locus. rCRUX captures the vast majority of species captured by CRUX while also incorporating thousands of additional taxa.
Figure 3.
Figure 3.. MiFISH rCRUX blast_seeds() database comparison with CRABS, ecoPCR, MetaCurator, and RESCRIPt databases created by Jeunen et al. 2023.
For the trnl reference database, only 2,308 species out of 69,705 species were shared across the five trnl reference databases. Each reference database had unique sequences that were not shared with any other database (range: 1 – 15,190). rCRUX captured 91.4% (n=63,719) of all species observed across the trnl reference databases. rCRUX uniquely had 21.8% (n=15,190) of all species observed (Figure 4).
Figure 4.
Figure 4.. trnl rCRUX blast_seeds() database comparison with CRABS, ecoPCR, MetaCurator, and RESCRIPt databases created by Jeunen et al. 2023.
For the FITS reference database, only 5.2% of all species (n=12,218) were shared across the 4 reference databases. Each reference database had unique sequences that were not shared with any other database (range: 610 – 171,358). rCRUX captured 97.2% (n=228,873) of all species observed across the FITS reference databases. rCRUX uniquely had 72.8% (n=171,358) of all species observed (Figure 5).
Figure 5.
Figure 5.. FITS rCRUX blast_seeds() database comparison with CRABS, ecoPCR, MetaCurator, and RESCRIPt databases created by Jeunen et al. 2023.
For the CO1 reference database, only 2.8% of all species (n=27,990) were shared across the 4 reference databases. Each reference database had unique species that were not shared with any other database (range: 4 – 823,363). rCRUX combined CO1 database captured 99.6% (n=990,286) of all species observed across the CO1 reference databases. rCRUX combined CO1 database uniquely had 82.8% (n=823,363) species of all species observed (Figure 6). The three distinct strategies used to generate the rCRUX CO1 combined database had complementary species (See Supplemental Results).
Figure 6.
Figure 6.. CO1 rCRUX blast_seeds() database comparison with CRABS, ecoPCR, MetaCurator, and RESCRIPt databases created by Jeunen et al. 2023.
Limiting the seeds and database generation output comparisons to only Eukaryotic reads had minimal effect on the results (Supplemental Figures S15–18). We also note that the rCRUX databases were generated after the other databases, however they include the majority of species captured by compared methods. Together, these results benchmark rCRUX favorably against CRABS, MetaCurator, ecoPCR, RESCRIPt, and CRUX across a diversity of metabarcoding loci.
Figure 7.
Figure 7.. Cross validation and novel taxonomy performance evaluations.
rCRUX had significantly higher average F-measure for cross validation at the species level than RESCRIPt, CRABS, ecoPCR, and MetaCurator 12S reference databases (a). Likewise, rCRUX had significantly higher F-measure for novel species taxonomic assignments at the species-level than RESCRIPt, CRABS, ecoPCR, and MetaCurator (b). Violins with different lower-case letters have significantly different means (paired t-test, false detection rate-corrected p < 0.05).

Update of

Similar articles

Cited by

References

    1. Adams CI, Knapp M, Gemmell NJ, Jeunen GJ, Bunce M, Lamare MD, & Taylor HR (2019). Beyond biodiversity: Can environmental DNA (eDNA) cut it as a population genetics tool?. Genes, 10(3), 192. - PMC - PubMed
    1. Ahmed M, Back MA, Prior T, Karssen G, Lawson R, Adams I, & Sapp M. (2019). Metabarcoding of soil nematodes: the importance of taxonomic coverage and availability of reference sequences in choosing suitable marker (s). Metabarcoding and Metagenomics, 3, e36408.
    1. Arranz V, Pearman WS, Aguirre JD, & Liggins L. (2020). MARES, a replicable pipeline and curated reference database for marine eukaryote metabarcoding. Scientific Data, 7(1), 209. - PMC - PubMed
    1. Asase A, Mzumara-Gawa TI, Owino JO, Peterson AT, & Saupe E. (2022). Replacing “parachute science” with “global science” in ecology and conservation biology. Conservation Science and Practice, 4(5), e517.
    1. Altschul SF, Gish W, Miller W, Myers EW, & Lipman DJ (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403–410. - PubMed

LinkOut - more resources