Rapid and sensitive detection of genome contamination at scale with FCS-GX
- PMID: 38409096
- PMCID: PMC10898089
- DOI: 10.1186/s13059-024-03198-7
Rapid and sensitive detection of genome contamination at scale with FCS-GX
Abstract
Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI's Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 min. Testing FCS-GX on artificially fragmented genomes demonstrates high sensitivity and specificity for diverse contaminant species. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination, comprising 0.16% of total bases, with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/ or https://doi.org/10.5281/zenodo.10651084 .
Keywords: GenBank; Genome assembly; Genome contamination; Genome quality; RefSeq; Software.
© 2024. This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures




Update of
-
Rapid and sensitive detection of genome contamination at scale with FCS-GX.bioRxiv [Preprint]. 2023 Jun 6:2023.06.02.543519. doi: 10.1101/2023.06.02.543519. bioRxiv. 2023. Update in: Genome Biol. 2024 Feb 26;25(1):60. doi: 10.1186/s13059-024-03198-7. PMID: 37292984 Free PMC article. Updated. Preprint.
Similar articles
-
Rapid and sensitive detection of genome contamination at scale with FCS-GX.bioRxiv [Preprint]. 2023 Jun 6:2023.06.02.543519. doi: 10.1101/2023.06.02.543519. bioRxiv. 2023. Update in: Genome Biol. 2024 Feb 26;25(1):60. doi: 10.1186/s13059-024-03198-7. PMID: 37292984 Free PMC article. Updated. Preprint.
-
RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.Nucleic Acids Res. 2024 Jan 5;52(D1):D762-D769. doi: 10.1093/nar/gkad988. Nucleic Acids Res. 2024. PMID: 37962425 Free PMC article.
-
Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank.Genome Biol. 2020 May 12;21(1):115. doi: 10.1186/s13059-020-02023-1. Genome Biol. 2020. PMID: 32398145 Free PMC article.
-
Comparison of RefSeq protein-coding regions in human and vertebrate genomes.BMC Genomics. 2013 Sep 25;14:654. doi: 10.1186/1471-2164-14-654. BMC Genomics. 2013. PMID: 24063302 Free PMC article.
-
EcoGene-RefSeq: EcoGene tools applied to the RefSeq prokaryotic genomes.Bioinformatics. 2013 Aug 1;29(15):1917-8. doi: 10.1093/bioinformatics/btt302. Epub 2013 Jun 4. Bioinformatics. 2013. PMID: 23736533 Free PMC article.
Cited by
-
A high-quality genome of the convergent lady beetle, Hippodamia convergens.G3 (Bethesda). 2024 Jun 5;14(6):jkae083. doi: 10.1093/g3journal/jkae083. G3 (Bethesda). 2024. PMID: 38620009 Free PMC article.
-
Comparative genomics of Ascetosporea gives new insight into the evolutionary basis for animal parasitism in Rhizaria.BMC Biol. 2024 May 3;22(1):103. doi: 10.1186/s12915-024-01898-x. BMC Biol. 2024. PMID: 38702750 Free PMC article.
-
Genome assembly at chromosome scale with telomere ends for Pearlspot, Etroplus suratensis.Sci Data. 2024 Nov 13;11(1):1226. doi: 10.1038/s41597-024-04096-0. Sci Data. 2024. PMID: 39537670 Free PMC article.
-
Parental assigned chromosomes for cultivated cacao provides insights into genetic architecture underlying resistance to vascular streak dieback.Plant Genome. 2024 Dec;17(4):e20524. doi: 10.1002/tpg2.20524. Epub 2024 Oct 15. Plant Genome. 2024. PMID: 39406693 Free PMC article.
-
A new chromosome-level genome assembly and annotation of Cryptosporidium meleagridis.bioRxiv [Preprint]. 2024 Feb 17:2024.02.16.580748. doi: 10.1101/2024.02.16.580748. bioRxiv. 2024. Update in: Sci Data. 2024 Dec 18;11(1):1388. doi: 10.1038/s41597-024-04235-7. PMID: 38405792 Free PMC article. Updated. Preprint.
References
-
- Sinha R, Stanley G, Gulati GS, Ezran C, Travaglini KJ, Wei E, Chan CK, Nabhan AN, Su T, Morganti RM. Index switching causes “spreading-of-signal” among multiplexed samples in Illumina HiSeq 4000 DNA sequencing. BioRxiv. 2017 doi: 10.1101/125724. - DOI
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials