Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Sergey A Shmakov^{1

2}, Guilhem Faure^{1

3}, Kira S Makarova¹, Yuri I Wolf¹, Konstantin V Severinov^{2

4

5}, Eugene V Koonin⁶

Affiliations

¹ National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA.
² Skolkovo Institute of Science and Technology, Skolkovo, Russia.
³ Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Waksman Institute of Microbiology, Rutgers, The State University of New Jersey, Piscataway, NJ, USA.
⁵ Institute of Molecular Genetics, Russian Academy of Sciences, Moscow, Russia.
⁶ National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA. koonin@ncbi.nlm.nih.gov.

PMID: 31520072
PMCID: PMC6938587
DOI: 10.1038/s41596-019-0211-1

Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Sergey A Shmakov et al. Nat Protoc. 2019 Oct.

. 2019 Oct;14(10):3013-3031.

doi: 10.1038/s41596-019-0211-1. Epub 2019 Sep 13.

Authors

Sergey A Shmakov^{1

2}, Guilhem Faure^{1

3}, Kira S Makarova¹, Yuri I Wolf¹, Konstantin V Severinov^{2

4

5}, Eugene V Koonin⁶

Affiliations

¹ National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA.
² Skolkovo Institute of Science and Technology, Skolkovo, Russia.
³ Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Waksman Institute of Microbiology, Rutgers, The State University of New Jersey, Piscataway, NJ, USA.
⁵ Institute of Molecular Genetics, Russian Academy of Sciences, Moscow, Russia.
⁶ National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA. koonin@ncbi.nlm.nih.gov.

PMID: 31520072
PMCID: PMC6938587
DOI: 10.1038/s41596-019-0211-1

Abstract

Functionally linked genes in bacterial and archaeal genomes are often organized into operons. However, the composition and architecture of operons are highly variable and frequently differ even among closely related genomes. Therefore, to efficiently extract reliable functional predictions for uncharacterized genes from comparative analyses of the rapidly growing genomic databases, dedicated computational approaches are required. We developed a protocol to systematically and automatically identify genes that are likely to be functionally associated with a 'bait' gene or locus by using relevance metrics. Given a set of bait loci and a genomic database defined by the user, this protocol compares the genomic neighborhoods of the baits to identify genes that are likely to be functionally linked to the baits by calculating the abundance of a given gene within and outside the bait neighborhoods and the distance to the bait. We exemplify the performance of the protocol with three test cases, namely, genes linked to CRISPR-Cas systems using the 'CRISPRicity' metric, genes associated with archaeal proviruses and genes linked to Argonaute genes in halobacteria. The protocol can be run by users with basic computational skills. The computational cost depends on the sizes of the genomic dataset and the list of reference loci and can vary from one CPU-hour to hundreds of hours on a supercomputer.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

**Fig. 1 |. The pipeline for the identification of gene families associated with a set of baits.**
Seven stages of the pipeline are shown as boxes; each box contains information on the main action and output for the stage.

**Fig. 2 |. A detailed, step by step schematic of the protocol.**
Each stage of the protocol is represented by a gray box. Stage 1: contigs are shown as gray lines and the baits as red stripes within the contigs. Stage 2: ORFs in the contigs are shown as gray polygons. Stage 3: clustering procedure; the color of each ORF reflects the cluster assignment. Stage 4: profile construction from a set of proteins; PSIBLAST hits are shown as red rectangles within ORFs; sorting and filtering of proteins in clusters is performed. Stage 5: strict clustering procedure, Icity calculation and 3D metrics space: Icity, abundance in the genomic database and distance to the baits (red crosses denote clusters that contain Cas proteins, green dots denote clusters containing predicted ancillary CRISPR-linked proteins and blue circles denote clusters that do not include any CRISPR-related proteins). Stage 6: approaches to classify metrics space. Stage 7: methods of manual curation.

**Fig. 3 |. The space of relevance metrics.**
Protein clusters characterized by their Icity, effective abundance and effective distance to the baits are shown. Annotation for each cluster was performed by using PSIBLAST to classify the clusters into categories: ‘Cas’, a known Cas protein; ‘Associated’, predicted ancillary Cas proteins; and ‘Non-Cas’, no CRISPR-related proteins.

**Fig. 4 |. Dissection of the space of relevance metrics.**
The yellow area shows the sector with the maximum F score (optimized recall/precision). Annotation for each cluster was performed by using PSIBLAST to classify the clusters into categories: ‘Cas’, a known Cas protein; ‘Associated’, predicted ancillary Cas proteins; and ‘Non-Cas’, no CRISPR-related proteins.

See this image and copyright information in PMC

References

1. Wolf YI, Rogozin IB, Kondrashov AS & Koonin EV Genome alignment, evolution of prokaryotic genome organization and prediction of gene function using genomic context. Genome Res. 11, 356–372 (2001). - PubMed
1. Rogozin IB, Makarova KS, Wolf YI & Koonin EV Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes. Brief Bioinform. 5, 131–149 (2004). - PubMed
1. Aravind L Guilt by association: contextual information in genome analysis. Genome Res. 10, 1074–1077 (2000). - PubMed
1. Galperin MY & Koonin EV Who’s your neighbor? New computational approaches for functional genomics. Nat. Biotechnol 18, 609–613 (2000). - PubMed
1. Janga SC, Collado-Vides J & Moreno-Hagelsieb G Nebulon: a system for the inference of functional relationships of gene products from the rearrangement of predicted operons. Nucleic Acids Res. 33, 2521–2530 (2005). - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 GM104071/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Affiliations

Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources