Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Oct 17;9(10):e110726.
doi: 10.1371/journal.pone.0110726. eCollection 2014.

MacSyFinder: a program to mine genomes for molecular systems with an application to CRISPR-Cas systems

Affiliations

MacSyFinder: a program to mine genomes for molecular systems with an application to CRISPR-Cas systems

Sophie S Abby et al. PLoS One. .

Abstract

Motivation: Biologists often wish to use their knowledge on a few experimental models of a given molecular system to identify homologs in genomic data. We developed a generic tool for this purpose.

Results: Macromolecular System Finder (MacSyFinder) provides a flexible framework to model the properties of molecular systems (cellular machinery or pathway) including their components, evolutionary associations with other systems and genetic architecture. Modelled features also include functional analogs, and the multiple uses of a same component by different systems. Models are used to search for molecular systems in complete genomes or in unstructured data like metagenomes. The components of the systems are searched by sequence similarity using Hidden Markov model (HMM) protein profiles. The assignment of hits to a given system is decided based on compliance with the content and organization of the system model. A graphical interface, MacSyView, facilitates the analysis of the results by showing overviews of component content and genomic context. To exemplify the use of MacSyFinder we built models to detect and class CRISPR-Cas systems following a previously established classification. We show that MacSyFinder allows to easily define an accurate "Cas-finder" using publicly available protein profiles.

Availability and implementation: MacSyFinder is a standalone application implemented in Python. It requires Python 2.7, Hmmer and makeblastdb (version 2.2.28 or higher). It is freely available with its source code under a GPLv3 license at https://github.com/gem-pasteur/macsyfinder. It is compatible with all platforms supporting Python and Hmmer/makeblastdb. The "Cas-finder" (models and HMM profiles) is distributed as a compressed tarball archive as Supporting Information.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Modelling systems with MacSyFinder.
The components of a system assemble into macromolecular systems or correspond to a biological pathway. They are typically encoded in genomes in one or a few different loci (“Genomic context”). We illustrate how systems can be modelled and distinguished with two imaginary systems “A” and “B” that have four homologous components (C1–C4, similar colours for the two systems). The system “B” has one component that is not found in “A”(C5). The parameter inter_gene_max_space (D) defines the maximal number of genes between two consecutive components (di,j). The two systems are defined by a set of mandatory (green), accessory (black) and forbidden (red) components. The quorum rules allow relaxing the definition of the system without altering the list of its components (min_genes_required and min_mandatory_genes_required parameters in XML files). If they are not specified, a default value is computed from the number of components described in the XML files. The bottom part of the figure shows the description of the systems in the XML grammar (see the documentation in File S1). Components listed here refer to protein profiles (Fig. 3). When a component is found in several systems, it is defined only once, and can be reused in another system with the system_ref keyword. Much more complex features can be defined, including exchangeable genes, distant genes and component-specific parameters (File S1).
Figure 2
Figure 2. Snapshot of MacSyFinder's results as viewed with MacSyView.
A. The MacSyView web-browser based application allows the visualization of MacSyFinder's output file “results.macsyfinder.json”. B. MacSyView displays the list of systems available in the results file. The user picks a system to visualize by clicking on it in the list. C. The page displaying the system is made of a header, and three panels. The header allows to select another input file, or to go back to the list of systems. It displays information on the system that is being visualized. The first panel shows how the detected system fits the model compliance in terms of its components. Boxes represent the number of each mandatory, accessory, and forbidden components. A tooltip gives the name of the component when the mouse hovers a box. Component boxes can be sorted by decreasing number of components. The second panel shows the genetic context of the system (as transcribed from the input fasta file), with components drawn to scale. When the mouse hovers a box, a tooltip displays information on the corresponding component, including scores of the Hmmer hit. This view can be exported as a SVG file for drawing purposes (tools circled in red). The third panel gives detailed information on the components of the system.
Figure 3
Figure 3. Functioning of MacSyFinder.
A. The user launches MacSyFinder to detect macromolecular systems A and B (example of Fig. 1). System-specific parameters are read from the corresponding XML definition files. This includes the list of the components of the systems and the corresponding HMM profiles. Other detection parameters are picked by order of priority: on the command-line, in the configuration file, and in the XML files. Sequences are indexed with the “formatdb” or “makeblastdb” tools for similarity search with the Hmmer program. MacSyFinder runs (optionally in parallel) the Hmmer searches on a non-redundant list of components' profiles. If the sequence dataset is “unordered” MacSyFinder only outputs the hits and the components detected for each type of system. B. Step #1: the co-localization criterion can be used in the ordered datasets. It involves clustering the hits separated by less than D protein-coding genes. The components described as “loner” in the XML definition files can be at any distance from other components. Step #2: the components of each cluster are used to fill the occurrences of the systems. Depending on the quorum, a cluster can describe a “full” system, or a “scattered” system. Step #3: clusters with components belonging to more than one system are split in unique systems and then re-directed separately to step #2.
Figure 4
Figure 4. Simplified operon organization of the three major types and ten subtypes of CRISPR-Cas systems.
Each cas gene family is indicated with a distinct colour, those specific to a subtype are in white. Only the main cas gene families are represented.
Figure 5
Figure 5. Frequency of co-occurrence between Cas proteins present in clusters detected with the general model (left) and the subtyping models (right).
Each matrix was normalized by the maximum of each column. The higher the frequency is, the warmer the colour is: the red diagonal corresponds to a 100% co-occurrence. Only frequencies above 1% were represented, others are in grey.

References

    1. Alberts B (1998) The cell as a collection of protein machines: preparing the next generation of molecular biologists. Cell 92: 291–294. - PubMed
    1. Pereira-Leal JB, Levy ED, Teichmann SA (2006) The origins and evolution of functional modules: lessons from protein complexes. Philos Trans R Soc Lond B Biol Sci 361: 507–517. - PMC - PubMed
    1. Michel B, Grompone G, Florès MJ, Bidnenko V (2004) Multiple pathways process stalled replication forks. Proc Natl Acad Sci U S A 101: 12783–12788. - PMC - PubMed
    1. Abby SS, Rocha EP (2012) The non-flagellar type III secretion system evolved from the bacterial flagellum and diversified into host-cell adapted systems. PLoS Genet 8: e1002983. - PMC - PubMed
    1. Galagan JE, Nusbaum C, Roy A, Endrizzi MG, Macdonald P, et al. (2002) The genome of M. acetivorans reveals extensive metabolic and physiological diversity. Genome Res 12: 532–542. - PMC - PubMed

Publication types