Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

Ryota Sugimoto¹, Luca Nishimura^{1

2}, Phuong Thanh Nguyen^{1

2}, Jumpei Ito³, Nicholas F Parrish⁴, Hiroshi Mori⁵, Ken Kurokawa⁶, Hirofumi Nakaoka⁷, Ituro Inoue¹

Affiliations

¹ Human Genetics Laboratory, National Institute of Genetics, Research Organization of Information and Systems, Mishima, Shizuoka, Japan.
² The Graduate University for Advanced Studies, SOKENDAI, Mishima, Shizuoka, Japan.
³ Division of Systems Virology, Department of Infectious Disease Control, International Research Center for Infectious Diseases, Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, Japan.
⁴ Genome Immunobiology RIKEN Hakubi Research Team, Center for Integrative Medical Sciences, RIKEN, Tsurumi-ku, Yokohama, Kanagawa, Japan.
⁵ Genome Diversity Laboratory, National Institute of Genetics, Research Organization of Information and Systems, Mishima, Shizuoka, Japan.
⁶ Genome Evolution Laboratory, National Institute of Genetics, Research Organization of Information and Systems, Mishima, Shizuoka, Japan.
⁷ Department of Cancer Genome Research, Sasaki Institute, Chiyoda-ku, Tokyo, Japan.

PMID: 34673779
PMCID: PMC8530359
DOI: 10.1371/journal.pcbi.1009428

Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

Ryota Sugimoto et al. PLoS Comput Biol. 2021.

. 2021 Oct 21;17(10):e1009428.

doi: 10.1371/journal.pcbi.1009428. eCollection 2021 Oct.

Authors

Ryota Sugimoto¹, Luca Nishimura^{1

2}, Phuong Thanh Nguyen^{1

2}, Jumpei Ito³, Nicholas F Parrish⁴, Hiroshi Mori⁵, Ken Kurokawa⁶, Hirofumi Nakaoka⁷, Ituro Inoue¹

Affiliations

¹ Human Genetics Laboratory, National Institute of Genetics, Research Organization of Information and Systems, Mishima, Shizuoka, Japan.
² The Graduate University for Advanced Studies, SOKENDAI, Mishima, Shizuoka, Japan.
³ Division of Systems Virology, Department of Infectious Disease Control, International Research Center for Infectious Diseases, Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, Japan.
⁴ Genome Immunobiology RIKEN Hakubi Research Team, Center for Integrative Medical Sciences, RIKEN, Tsurumi-ku, Yokohama, Kanagawa, Japan.
⁵ Genome Diversity Laboratory, National Institute of Genetics, Research Organization of Information and Systems, Mishima, Shizuoka, Japan.
⁶ Genome Evolution Laboratory, National Institute of Genetics, Research Organization of Information and Systems, Mishima, Shizuoka, Japan.
⁷ Department of Cancer Genome Research, Sasaki Institute, Chiyoda-ku, Tokyo, Japan.

PMID: 34673779
PMCID: PMC8530359
DOI: 10.1371/journal.pcbi.1009428

Abstract

Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Classification and genetic features of CRISPR-targeted TR sequences.**
**(A)** Length distribution of TR sequences. We used the HK97 capsid and portal proteins as tailed-phage signature genes. The dotted line at 20 kb represents an arbitrary cut-off between small and large sequences. Sequences longer than 100 kb are shown in the inset. **(B)** Results of the classification of TR sequences. Sequences encoding a detectable capsid gene were classified to a viral taxon according to capsid type, as follows. *Caudovirales*: HK97 fold capsid; *Inoviridae*: *Inoviridae* MCP; and *Microviridae*: *Microviridae* MCP. The capsid-less TR sequences with ParA, ParB, ParM, and/or MoBM were classified as Plasmid-like. The remaining sequences were labeled as “Unclassified.” **(C)** Distribution of singleton coverage and coding ratio. Selected k-values were higher in large TR sequences, to avoid doubletons by chance.

**Fig 2. Predicted targeting hosts of CRISPR-targeted TR sequences.**
**(A)** The targeting host composition of TR sequences. Hosts were predicted by mapping CRISPR DR sequences to the RefSeq database. Sequences containing ≥10 protospacer loci but less than 90% associated DR taxa exclusiveness were classified as an ambiguous targeting host. When ≥10 protospacers could not be assigned to a taxon, the predicted targeting host was denoted as not available (NA). **(B)** Predicted targeting host distribution according to GC content. The dotted line indicates the low- and high-GC content boundary, at 55%. **(C)** Bayesian phylogeny of *Microviridae* major capsid proteins. A total of 159 representative major capsid protein sequences from this study, and 43 RefSeq sequences were used for analysis. Taxa without a name denote the *Microviridae* species from this study, and taxa with text denote *Microviridae* species from RefSeq. Taxa were annotated based on predicted targeting hosts. The phi X174 clade was selected as the outgroup.

**Fig 3**
**Venn diagrams of database comparisons for (A) large and (B) small TR sequences.** Each TR sequence was compared to RefSeq virus, RefSeq plasmid, IMG/VR, and GVD using BLASTN. The database hit minimum criteria were set to 85% sequence identity with 75% aligned fraction of the query sequence to a unique subject sequence.

**Fig 4. Classifications and genome organizations of the discovered *Inoviridae* species.**
**(A)** Bayesian phylogeny of Zot domains. Representatives were selected from RefSeq and the *Inoviridae* major coat-protein-encoding TR sequences by clustering Zot amino acid sequences using a 50% identity threshold. Each taxon was colored according to its corresponding family. The families of the discovered genomes (Ino-01 to Ino-08) were predicted using the ICTV-provided taxonomic classification program. A sequence reported in a previous study (Ino-07) is denoted in parenthesis. **(B)** Genome organizations of the discovered *Inoviridae* species. All sequences were phased to align so that the *Rep* genes appear first. The predicted ORFs are colored according to the annotation results.

**Fig 5**
**Hierarchical clustering of (A) large and (B) small TR sequences based on gene content.** Heatmaps representing the gene content of TR sequences, in which each row is a TR sequence and each column is a gene cluster. The gray areas in the heatmap indicate sequences encoding a gene that is homologous to the gene cluster. Note that one gene can be homologous to multiple gene clusters. Sequences are annotated by database containing similar sequences, GC content, host, and capsid genes. Capsid genes are colored differently according to their types, as indicated in the figure; HK97, *Microviridae* major capsid protein (MicroMCP), and *Inoviridae* major coat protein (InoMCP). Gene clusters were annotated by searching corresponding HMMs in the UniRef50 database. Several notable RefSeq-listed clusters are denoted on the right side of the heatmaps.

**Fig 6. Number of mapped spacers according to sequence identity threshold.**
All unique CRISPR spacers were mapped to large TR sequences, small TR sequences, and scrambled sequences. The relaxed sequence identity thresholds applied initially are denoted as red- and orange-colored dashed lines. The spacer mapping process was identical to the protospacer discovery process (see Materials and Methods).

See this image and copyright information in PMC

References

1. Wommack KE, Colwell RR. Virioplankton: viruses in aquatic ecosystems. Microbiol Mol Biol Rev. 2000;64: 69–114. doi: 10.1128/MMBR.64.1.69-114.2000 - DOI - PMC - PubMed
1. Hershey AD, Chase M. Independent functions of viral protein and nucleic acid in growth of bacteriophage. J Gen Physiol. 1952;36: 39–56. doi: 10.1085/jgp.36.1.39 - DOI - PMC - PubMed
1. Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D, Merregaert J, et al. Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature. 1976;260: 500–507. doi: 10.1038/260500a0 - DOI - PubMed
1. Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, et al. Nucleotide sequence of bacteriophage phi X174 DNA. Nature. 1977;265: 687–695. doi: 10.1038/265687a0 - DOI - PubMed
1. Koonin EV, Senkevich TG, Dolja VV. The ancient Virus World and evolution of cells. Biol Direct. 2006;1: 29. doi: 10.1186/1745-6150-1-29 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

Affiliations

Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources