Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 21;17(10):e1009428.
doi: 10.1371/journal.pcbi.1009428. eCollection 2021 Oct.

Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

Affiliations

Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

Ryota Sugimoto et al. PLoS Comput Biol. .

Abstract

Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Classification and genetic features of CRISPR-targeted TR sequences.
(A) Length distribution of TR sequences. We used the HK97 capsid and portal proteins as tailed-phage signature genes. The dotted line at 20 kb represents an arbitrary cut-off between small and large sequences. Sequences longer than 100 kb are shown in the inset. (B) Results of the classification of TR sequences. Sequences encoding a detectable capsid gene were classified to a viral taxon according to capsid type, as follows. Caudovirales: HK97 fold capsid; Inoviridae: Inoviridae MCP; and Microviridae: Microviridae MCP. The capsid-less TR sequences with ParA, ParB, ParM, and/or MoBM were classified as Plasmid-like. The remaining sequences were labeled as “Unclassified.” (C) Distribution of singleton coverage and coding ratio. Selected k-values were higher in large TR sequences, to avoid doubletons by chance.
Fig 2
Fig 2. Predicted targeting hosts of CRISPR-targeted TR sequences.
(A) The targeting host composition of TR sequences. Hosts were predicted by mapping CRISPR DR sequences to the RefSeq database. Sequences containing ≥10 protospacer loci but less than 90% associated DR taxa exclusiveness were classified as an ambiguous targeting host. When ≥10 protospacers could not be assigned to a taxon, the predicted targeting host was denoted as not available (NA). (B) Predicted targeting host distribution according to GC content. The dotted line indicates the low- and high-GC content boundary, at 55%. (C) Bayesian phylogeny of Microviridae major capsid proteins. A total of 159 representative major capsid protein sequences from this study, and 43 RefSeq sequences were used for analysis. Taxa without a name denote the Microviridae species from this study, and taxa with text denote Microviridae species from RefSeq. Taxa were annotated based on predicted targeting hosts. The phi X174 clade was selected as the outgroup.
Fig 3
Fig 3
Venn diagrams of database comparisons for (A) large and (B) small TR sequences. Each TR sequence was compared to RefSeq virus, RefSeq plasmid, IMG/VR, and GVD using BLASTN. The database hit minimum criteria were set to 85% sequence identity with 75% aligned fraction of the query sequence to a unique subject sequence.
Fig 4
Fig 4. Classifications and genome organizations of the discovered Inoviridae species.
(A) Bayesian phylogeny of Zot domains. Representatives were selected from RefSeq and the Inoviridae major coat-protein-encoding TR sequences by clustering Zot amino acid sequences using a 50% identity threshold. Each taxon was colored according to its corresponding family. The families of the discovered genomes (Ino-01 to Ino-08) were predicted using the ICTV-provided taxonomic classification program. A sequence reported in a previous study (Ino-07) is denoted in parenthesis. (B) Genome organizations of the discovered Inoviridae species. All sequences were phased to align so that the Rep genes appear first. The predicted ORFs are colored according to the annotation results.
Fig 5
Fig 5
Hierarchical clustering of (A) large and (B) small TR sequences based on gene content. Heatmaps representing the gene content of TR sequences, in which each row is a TR sequence and each column is a gene cluster. The gray areas in the heatmap indicate sequences encoding a gene that is homologous to the gene cluster. Note that one gene can be homologous to multiple gene clusters. Sequences are annotated by database containing similar sequences, GC content, host, and capsid genes. Capsid genes are colored differently according to their types, as indicated in the figure; HK97, Microviridae major capsid protein (MicroMCP), and Inoviridae major coat protein (InoMCP). Gene clusters were annotated by searching corresponding HMMs in the UniRef50 database. Several notable RefSeq-listed clusters are denoted on the right side of the heatmaps.
Fig 6
Fig 6. Number of mapped spacers according to sequence identity threshold.
All unique CRISPR spacers were mapped to large TR sequences, small TR sequences, and scrambled sequences. The relaxed sequence identity thresholds applied initially are denoted as red- and orange-colored dashed lines. The spacer mapping process was identical to the protospacer discovery process (see Materials and Methods).

References

    1. Wommack KE, Colwell RR. Virioplankton: viruses in aquatic ecosystems. Microbiol Mol Biol Rev. 2000;64: 69–114. doi: 10.1128/MMBR.64.1.69-114.2000 - DOI - PMC - PubMed
    1. Hershey AD, Chase M. Independent functions of viral protein and nucleic acid in growth of bacteriophage. J Gen Physiol. 1952;36: 39–56. doi: 10.1085/jgp.36.1.39 - DOI - PMC - PubMed
    1. Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D, Merregaert J, et al.. Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature. 1976;260: 500–507. doi: 10.1038/260500a0 - DOI - PubMed
    1. Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, et al.. Nucleotide sequence of bacteriophage phi X174 DNA. Nature. 1977;265: 687–695. doi: 10.1038/265687a0 - DOI - PubMed
    1. Koonin EV, Senkevich TG, Dolja VV. The ancient Virus World and evolution of cells. Biol Direct. 2006;1: 29. doi: 10.1186/1745-6150-1-29 - DOI - PMC - PubMed

Publication types