Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct;14(10):3013-3031.
doi: 10.1038/s41596-019-0211-1. Epub 2019 Sep 13.

Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Affiliations

Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Sergey A Shmakov et al. Nat Protoc. 2019 Oct.

Abstract

Functionally linked genes in bacterial and archaeal genomes are often organized into operons. However, the composition and architecture of operons are highly variable and frequently differ even among closely related genomes. Therefore, to efficiently extract reliable functional predictions for uncharacterized genes from comparative analyses of the rapidly growing genomic databases, dedicated computational approaches are required. We developed a protocol to systematically and automatically identify genes that are likely to be functionally associated with a 'bait' gene or locus by using relevance metrics. Given a set of bait loci and a genomic database defined by the user, this protocol compares the genomic neighborhoods of the baits to identify genes that are likely to be functionally linked to the baits by calculating the abundance of a given gene within and outside the bait neighborhoods and the distance to the bait. We exemplify the performance of the protocol with three test cases, namely, genes linked to CRISPR-Cas systems using the 'CRISPRicity' metric, genes associated with archaeal proviruses and genes linked to Argonaute genes in halobacteria. The protocol can be run by users with basic computational skills. The computational cost depends on the sizes of the genomic dataset and the list of reference loci and can vary from one CPU-hour to hundreds of hours on a supercomputer.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

Fig. 1 |
Fig. 1 |. The pipeline for the identification of gene families associated with a set of baits.
Seven stages of the pipeline are shown as boxes; each box contains information on the main action and output for the stage.
Fig. 2 |
Fig. 2 |. A detailed, step by step schematic of the protocol.
Each stage of the protocol is represented by a gray box. Stage 1: contigs are shown as gray lines and the baits as red stripes within the contigs. Stage 2: ORFs in the contigs are shown as gray polygons. Stage 3: clustering procedure; the color of each ORF reflects the cluster assignment. Stage 4: profile construction from a set of proteins; PSIBLAST hits are shown as red rectangles within ORFs; sorting and filtering of proteins in clusters is performed. Stage 5: strict clustering procedure, Icity calculation and 3D metrics space: Icity, abundance in the genomic database and distance to the baits (red crosses denote clusters that contain Cas proteins, green dots denote clusters containing predicted ancillary CRISPR-linked proteins and blue circles denote clusters that do not include any CRISPR-related proteins). Stage 6: approaches to classify metrics space. Stage 7: methods of manual curation.
Fig. 3 |
Fig. 3 |. The space of relevance metrics.
Protein clusters characterized by their Icity, effective abundance and effective distance to the baits are shown. Annotation for each cluster was performed by using PSIBLAST to classify the clusters into categories: ‘Cas’, a known Cas protein; ‘Associated’, predicted ancillary Cas proteins; and ‘Non-Cas’, no CRISPR-related proteins.
Fig. 4 |
Fig. 4 |. Dissection of the space of relevance metrics.
The yellow area shows the sector with the maximum F score (optimized recall/precision). Annotation for each cluster was performed by using PSIBLAST to classify the clusters into categories: ‘Cas’, a known Cas protein; ‘Associated’, predicted ancillary Cas proteins; and ‘Non-Cas’, no CRISPR-related proteins.

References

    1. Wolf YI, Rogozin IB, Kondrashov AS & Koonin EV Genome alignment, evolution of prokaryotic genome organization and prediction of gene function using genomic context. Genome Res. 11, 356–372 (2001). - PubMed
    1. Rogozin IB, Makarova KS, Wolf YI & Koonin EV Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes. Brief Bioinform. 5, 131–149 (2004). - PubMed
    1. Aravind L Guilt by association: contextual information in genome analysis. Genome Res. 10, 1074–1077 (2000). - PubMed
    1. Galperin MY & Koonin EV Who’s your neighbor? New computational approaches for functional genomics. Nat. Biotechnol 18, 609–613 (2000). - PubMed
    1. Janga SC, Collado-Vides J & Moreno-Hagelsieb G Nebulon: a system for the inference of functional relationships of gene products from the rearrangement of predicted operons. Nucleic Acids Res. 33, 2521–2530 (2005). - PMC - PubMed

Publication types