Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 7;19(11):e1011498.
doi: 10.1371/journal.pcbi.1011498. eCollection 2023 Nov.

CGG toolkit: Software components for computational genomics

Affiliations

CGG toolkit: Software components for computational genomics

Dimitrios Vasileiou et al. PLoS Comput Biol. .

Abstract

Public-domain availability for bioinformatics software resources is a key requirement that ensures long-term permanence and methodological reproducibility for research and development across the life sciences. These issues are particularly critical for widely used, efficient, and well-proven methods, especially those developed in research settings that often face funding discontinuities. We re-launch a range of established software components for computational genomics, as legacy version 1.0.1, suitable for sequence matching, masking, searching, clustering and visualization for protein family discovery, annotation and functional characterization on a genome scale. These applications are made available online as open source and include MagicMatch, GeneCAST, support scripts for CoGenT-like sequence collections, GeneRAGE and DifFuse, supported by centrally administered bioinformatics infrastructure funding. The toolkit may also be conceived as a flexible genome comparison software pipeline that supports research in this domain. We illustrate basic use by examples and pictorial representations of the registered tools, which are further described with appropriate documentation files in the corresponding GitHub release.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Revived software tools.
A 2008 snapshot of the ‘Key software’ section of the CGG website followed by services (partly shown), with the list of tools made available again.
Fig 2
Fig 2. Representation of a typical workflow using the reported tools.
Pre-processing may start with a genome collection (database symbol, upper left), optionally mixed with a curated sequence resource such as UniProt (database symbol in green, upper left). To cross-index entries at the sequence level or simply identify them, MagicMatch can be used as an option. The sequence collection can be submitted to GeneCAST to mask compositional bias and prepare the query for sensitive searches (disk symbol with Q, lower left). For genome-scale analysis, species codes can be generated for the reference (target) set with cogent_utils, to create a uniformly named sequence set (disk symbol with R, lower middle, optionally mixed with UniProt or any other annotated collection). Sequence comparisons are executed with BLAST or other options with query Q vs. reference R (or in the case of all-vs-all, disk symbol in green-blue gradient, upper middle). The vertical gray line divides this pre-processing phase from the next phase, signifying the computationally intensive step or long wall-time. Two (non-mutually exclusive) output alternatives are shown: the pairs-list (in pink, upper right) or full alignments (also in pink, lower right). The former can be treated with clustt_utils that launches Tribe-MCL and generates protein families or can be used as input for network visualization with BioLayout or other similar software, while the latter can be further processed for GeneRAGE or DifFuse for multi-domain or gene-fusion detection, respectively, as well as for inspection and parsing for multiple alignments.

References

    1. Ouzounis CA, Coulson RM, Enright AJ, Kunin V, Pereira-Leal JB. Classification schemes for protein structure and function. Nat Rev Genet. 2003;4(7):508–19. doi: 10.1038/nrg1113 . - DOI - PubMed
    1. Cohen BA, Mitra RD, Hughes JD, Church GM. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nat Genet. 2000;26(2):183–6. doi: 10.1038/79896 . - DOI - PubMed
    1. Hinchliff CE, Smith SA, Allman JF, Burleigh JG, Chaudhary R, Coghill LM, et al.. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc Natl Acad Sci U S A. 2015;112(41):12764–9. Epub 20150918. doi: 10.1073/pnas.1423041112 ; PubMed Central PMCID: PMC4611642. - DOI - PMC - PubMed
    1. Kunin V, Cases I, Enright AJ, de Lorenzo V, Ouzounis CA. Myriads of protein families, and still counting. Genome Biol. 2003;4(2):401. Epub 20030128. doi: 10.1186/gb-2003-4-2-401 ; PubMed Central PMCID: PMC151299. - DOI - PMC - PubMed
    1. Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng JF, et al.. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499(7459):431–7. Epub 20130714. doi: 10.1038/nature12352 . - DOI - PubMed