Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 26;25(1):60.
doi: 10.1186/s13059-024-03198-7.

Rapid and sensitive detection of genome contamination at scale with FCS-GX

Affiliations

Rapid and sensitive detection of genome contamination at scale with FCS-GX

Alexander Astashyn et al. Genome Biol. .

Abstract

Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI's Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 min. Testing FCS-GX on artificially fragmented genomes demonstrates high sensitivity and specificity for diverse contaminant species. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination, comprising 0.16% of total bases, with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/ or https://doi.org/10.5281/zenodo.10651084 .

Keywords: GenBank; Genome assembly; Genome contamination; Genome quality; RefSeq; Software.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Overview of FCS-GX pipeline. FCS-GX splits genome assembly scaffolds into contigs and chunks contigs into 100-kbp subsequences for processing. FCS-GX performs repeat detection and masking in eukaryote assemblies. The GX aligner operates in two passes using modified k-mers (h-mers) to align query sequences first to the entire indexed reference database and second to sequences corresponding to the taxid sets providing best matches for alignment refinement. After collecting coverage and score information FCS-GX assigns likely contaminant sequences by comparing the taxonomic assignment calculated for each sequence by the user-specified taxid. The final output from FCS-GX is a cleaned FASTA alongside an action report that details contaminant cleaning actions taken (FCS-GX actions EXCLUDE, TRIM, FIX) as well as details for additional sequences warranting manual review but are not automatically cleaned (FCS-GX actions REVIEW, REVIEW_RARE, INFO). See “Methods” for descriptions of FCS-GX action categories. In the cartoon example, one complete sequence and one partial sequence assigned as contaminant are removed from the input assembly to produce the final cleaned FASTA. FCS-GX uses a custom reference database totaling 709 Gbp of sequence data from assemblies and common contaminants used in current NCBI screening. Assemblies contributing to the database were screened by FCS-GX while excluding self-hits. High-confidence contaminants were removed in order to use the database for screening new genomes. This can be performed by either adding contaminated database sequence entries to a file which prevents FCS-GX from reporting alignments in subsequent runs or adding heavily contaminated genomes to a separate file which prevents the entire assembly from being used in future database builds
Fig. 2
Fig. 2
Sensitivity and specificity of FCS-GX contamination detection. a Distributions of sensitivity measurements. Distributions are shown for artificially fragmented genomes in six kingdom groups. Sensitivity is shown for genomes fragmented at three different window sizes (1 kbp, 10 kbp, 100 kbp). For each window size, sensitivity is shown for FCS-GX runs while including the same species taxids as the source genome during the alignment stage ( +) and while excluding same species taxids ( −). b Distributions of specificity measurements for the same set of fragmented genomes in a. The dotplot shows an enlarged view of the upper limit of specificity (98–100%). The full dotplot including ten outliers not visualized here is available at Additional file 2: Fig. S3. See Additional file 1: Table S2 for complete sensitivity and specificity score data
Fig. 3
Fig. 3
FCS-GX detection of contamination in NCBI databases. a Distribution of the proportion of contaminated sequence per genome detected by FCS-GX in the NCBI GenBank database. Genome counts (frequency) were computed in 5% intervals. b Aggregate length of total genome sequence (solid line) and contaminated sequence detected by FCS-GX (dashed line) in the NCBI GenBank database from 2017 to 2023. c Percentage of contaminated sequence detected by FCS-GX (dashed line) in the NCBI GenBank database from 2017 to 2023, i.e., the quotient of the contaminant amount divided by the total amount displayed in b. See Additional file 1: Table S9 for supporting numerical data. d Percentage of contaminated genomes in GenBank. Total numbers of screened genomes are shown for six taxonomic kingdom groups: Metazoa (animals), Fungi, Viridiplantae (green plants), Other eukaryotes, Bacteria, and Archaea. Within each group, genomes are placed into four bins corresponding to the amount of contamination per genome and percentages are calculated for the count of genomes in each bin divided by total screened genomes. e Aggregate contamination lengths identified in genomes from six kingdom groups. Colors of grid squares indicate aggregate contamination lengths from eight sources (six kingdoms, plus virus and synthetic) that correspond to percentages of total assembly length for each GenBank kingdom group. See Additional file 1: Table S8 for supporting numerical kingdom contamination summary data
Fig. 4
Fig. 4
FCS-GX detection of contamination in the NCBI RefSeq database. a Aggregate length of total genome sequence (solid line) and contaminated sequence detected by FCS-GX (dashed line) in NCBI RefSeq from 2017 to 2023. b Contaminant fraction detected by FCS-GX (dashed line) in NCBI RefSeq database from 2017 to 2023, i.e., the quotient of the contaminant amount divided by the total amount displayed in a. See Additional file 1: Table S17 for supporting numerical Refseq contamination summary data

Update of

Similar articles

Cited by

References

    1. Sayers EW, Cavanaugh M, Clark K, Pruitt KD, Schoch CL, Sherry ST, Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2022;50:D161–d164. doi: 10.1093/nar/gkab1135. - DOI - PMC - PubMed
    1. Cornet L, Baurain D. Contamination detection in genomic data: more is not enough. Genome Biol. 2022;23:60. doi: 10.1186/s13059-022-02619-9. - DOI - PMC - PubMed
    1. Lu J, Salzberg SL. Removing contaminants from databases of draft genomes. PLoS Comput Biol. 2018;14:e1006277. doi: 10.1371/journal.pcbi.1006277. - DOI - PMC - PubMed
    1. van der Valk T, Vezzi F, Ormestad M, Dalén L, Guschanski K. Index hopping on the Illumina HiseqX platform and its consequences for ancient DNA studies. Mol Ecol Resour. 2020;20:1171–1181. doi: 10.1111/1755-0998.13009. - DOI - PubMed
    1. Sinha R, Stanley G, Gulati GS, Ezran C, Travaglini KJ, Wei E, Chan CK, Nabhan AN, Su T, Morganti RM. Index switching causes “spreading-of-signal” among multiplexed samples in Illumina HiSeq 4000 DNA sequencing. BioRxiv. 2017 doi: 10.1101/125724. - DOI

Publication types