Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Dec;79(24):7696-701.
doi: 10.1128/AEM.02411-13. Epub 2013 Oct 4.

GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis

Affiliations

GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis

Bruno Contreras-Moreira et al. Appl Environ Microbiol. 2013 Dec.

Abstract

GET_HOMOLOGUES is an open-source software package that builds on popular orthology-calling approaches making highly customizable and detailed pangenome analyses of microorganisms accessible to nonbioinformaticians. It can cluster homologous gene families using the bidirectional best-hit, COGtriangles, or OrthoMCL clustering algorithms. Clustering stringency can be adjusted by scanning the domain composition of proteins using the HMMER3 package, by imposing desired pairwise alignment coverage cutoffs, or by selecting only syntenic genes. The resulting homologous gene families can be made even more robust by computing consensus clusters from those generated by any combination of the clustering algorithms and filtering criteria. Auxiliary scripts make the construction, interrogation, and graphical display of core genome and pangenome sets easy to perform. Exponential and binomial mixture models can be fitted to the data to estimate theoretical core genome and pangenome sizes, and high-quality graphics can be generated. Furthermore, pangenome trees can be easily computed and basic comparative genomics performed to identify lineage-specific genes or gene family expansions. The software is designed to take advantage of modern multiprocessor personal computers as well as computer clusters to parallelize time-consuming tasks. To demonstrate some of these capabilities, we survey a set of 50 Streptococcus genomes annotated in the Orthologous Matrix (OMA) browser as a benchmark case. The package can be downloaded at http://www.eead.csic.es/compbio/soft/gethoms.php and http://maya.ccg.unam.mx/soft/gethoms.php.

PubMed Disclaimer

Figures

Fig 1
Fig 1
GET_HOMOLOGUES flow chart and its outcomes. BLAST and optional Pfam searches are optimized for local (multicore) and cluster computer environments. While the BDBH algorithm uses one sequence from the reference genome to grow clusters, the COG algorithm requires a triangle of reciprocal hits. Instead, the OMCL algorithm groups nodes in a BLAST graph to build clusters. Note that these clustering algorithms can be fine-tuned by customizing parameters such as -C (minimum percentage of coverage in pairwise BLAST alignments), -E (maximum E value for a hit to be considered), -D (require equal Pfam domain composition when defining similarity-based orthology composition), -S (minimum percentage of sequence identity in BLAST query/subject pairs [BDBH|OMCL]) and -N (minimum BLAST neighborhood correlation [BDBH|OMCL]). In addition, the user can choose which genome should be used as the reference using option -r.
Fig 2
Fig 2
Pangenome analysis of 50 Streptococcus genomes from 14 species. (A) Venn diagram of core genomes generated by the BDBH, COG, and OMCL strategies. (B) Estimate of core genome size with the Tettelin (blue) and Willenbrock (red) fits (12, 22). (C) Estimate of pangenome size with the Tettelin fit. (D) Venn analysis of pangenomes generated by COG and OMCL. (E and F) Partition of the OMCL pangenomic matrix into shell, cloud, soft-core, and core compartments. These plots can be easily created with GET_HOMOLOGUES auxiliary scripts, as explained in the manual.
Fig 3
Fig 3
Parsimony pangenome tree for 50 Streptococcus proteomes derived from presence/absence data in a consensus (OMCL and COGtriangles) pangenome matrix computed from the OMA data set, as detailed in the text. This phylogeny was the most parsimonious tree found in a tree search performed with PARS from the PHYLIP suite, using 50 data jumbles. The tree has a total length of 11,473 steps.

References

    1. Altenhoff AM, Studer RA, Robinson-Rechavi M, Dessimoz C. 2012. Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Comput. Biol. 8:e1002514.10.1371/journal.pcbi.1002514 - DOI - PMC - PubMed
    1. Altenhoff AM, Dessimoz C. 2012. Inferring orthology and paralogy. Methods Mol. Biol. 855:259–279 - PubMed
    1. Koonin EV. 2005. Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 39:309–338 - PubMed
    1. Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. 2012. The Genomes OnLine Database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 40:D571–D579 - PMC - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402 - PMC - PubMed

Publication types