Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun;29(6):954-960.
doi: 10.1101/gr.245373.118. Epub 2019 May 7.

Human contamination in bacterial genomes has created thousands of spurious proteins

Affiliations

Human contamination in bacterial genomes has created thousands of spurious proteins

Florian P Breitwieser et al. Genome Res. 2019 Jun.

Abstract

Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein "families" across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Alignment of a human whole-genome shotgun sequencing data set to GRCh38 shown in the Integrated Genome Viewer. This region, which contains a copy of the HSATII repeat, is covered extremely deeply, over 1500-fold deeper than the rest of the genome. The region at the top shows a schematic of Chromosome 1, and below that is a histogram showing the depth of coverage, which peaks at 157,072. Individual reads in their aligned positions are shown as gray rectangles in the bottom portion of the figure. Mismatches are shown by red, blue, green, or brown marks, and gaps indicated by breaks in the gray rectangles connected with a thin black line. The numerous gaps and mismatches suggest that GRCh38 is missing many other copies of the HSATII repeat, some of which would provide a better match.
Figure 2.
Figure 2.
Lengths of scaffolds in prokaryotic genomes that contain or consist entirely of human repeats. (A) Histogram showing the number of scaffolds of a given length that contain human repeats. (B) The coverage depth of contaminant scaffolds is on average 30 times lower than the average genome coverage (red box). Similar-sized scaffolds in the same assemblies do not show the same trend (gray box). Wilcoxon signed-rank test, P < 2.2 × 10−16.
Figure 3.
Figure 3.
Human repeat element HSATII-derived proteins annotated in bacteria are nearly identical to one another, as shown in this multiple alignment, despite the large evolutionary distances separating the species in which they were reported. Visualized with SeaView (Gouy et al. 2010).

References

    1. Altemose N, Miga KH, Maggioni M, Willard HF. 2014. Genomic characterization of large heterochromatic gaps in the human genome assembly. PLoS Comput Biol 10: e1003628 10.1371/journal.pcbi.1003628 - DOI - PMC - PubMed
    1. Anderson MT, Seifert HS. 2011. Opportunity and means: horizontal gene transfer from the human host to a bacterial pathogen. MBio 2: e00005-11 10.1128/mBio.00005-11 - DOI - PMC - PubMed
    1. Arakawa K. 2016. No evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proc Natl Acad Sci 113: E3057 10.1073/pnas.1602711113 - DOI - PMC - PubMed
    1. Batzer MA, Deininger PL. 2002. Alu repeats and human genomic diversity. Nat Rev Genet 3: 370–379. 10.1038/nrg798 - DOI - PubMed
    1. Berger JR, Wilson MR. 2016. Next-generation sequencing of tissue: a logical extension. Neurol Neuroimmunol Neuroinflamm 3: e261 10.1212/NXI.0000000000000261 - DOI - PMC - PubMed

Publication types