Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 15;35(10):1766-1767.
doi: 10.1093/bioinformatics/bty863.

cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly

Affiliations

cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly

T E Lewis et al. Bioinformatics. .

Abstract

Motivation: Many bioinformatics areas require us to assign domain matches onto stretches of a query protein. Starting with a set of candidate matches, we want to identify the optimal subset that has limited/no overlap between matches. This may be further complicated by discontinuous domains in the input data. Existing tools are increasingly facing very large data-sets for which they require prohibitive amounts of CPU-time and memory.

Results: We present cath-resolve-hits (CRH), a new tool that uses a dynamic-programming algorithm implemented in open-source C++ to handle large datasets quickly (up to ∼1 million hits/second) and in reasonable amounts of memory. It accepts multiple input formats and provides its output in plain text, JSON or graphical HTML. We describe a benchmark against an existing algorithm, which shows CRH delivers very similar or slightly improved results and very much improved CPU/memory performance on large datasets.

Availability and implementation: CRH is available at https://github.com/UCLOrengoGroup/cath-tools; documentation is available at http://cath-tools.readthedocs.io.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(A) performance of CRH, DF3 and Naïve Greedy at 100%, 60% and 30% sequence identity homology removal (see Methods). The axes show the proportion of domains assigned to: the correct domain superfamily (y-axis); an incorrect domain superfamily (x-axis). CRH assignments for all the Benchmark HMM assignments with 475, 161 hits took 3.3 s (Intel i7-7500U up to 3.5 GHz) and peak memory usage of 143 MB. A perfect result would appear at the top-left corner. B/C) Rate of use of CPU time in minutes (B)/memory in GBs (C) per 100 000 inputs to resolve a randomly chosen subset of hits to a large protein (human titin), averaged over 100 runs. The stars indicate the points beyond which DF3 failed to run, even with ample memory available

References

    1. Dawson N.L. et al. (2016) CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res., 45, D289–D295. - PMC - PubMed
    1. Finn R.D. et al. (2017) InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res., 45, D190–D199. - PMC - PubMed
    1. Lam S.D. et al. (2016) Gene3D: expanding the utility of domain assignments. Nucleic Acids Res., 44, D404–D409. - PMC - PubMed
    1. Lewis T.E. et al. (2018) Gene3D: extensive prediction of globular domains in proteins. Nucleic Acids Res., 46, D435–D439. - PMC - PubMed
    1. Markowitz V.M. et al. (2012) IMG/M: the integrated metagenome data management and comparative analysis system. Nucleic Acids Res., 40, D123–D129. - PMC - PubMed

Publication types

LinkOut - more resources