Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 8:11:e14779.
doi: 10.7717/peerj.14779. eCollection 2023.

Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

Affiliations

Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

Rachel Nguyen et al. PeerJ. .

Abstract

A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2's clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus.

Keywords: Homology; Protein clustering; Protein families.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. Example pipeline depicting Complet+ usage.
Not all files required and produced by the MMseqs2 tools and Complet+ are depicted, merely the most prominent ones, for clarity. The pipeline depicted only includes one clusterupdate module use following the initial clustering with the cluster module, but the tests discussed later feature more successive increments with clusterupdate.
Figure 2
Figure 2. The Complet+ algorithm.
The alignment files contain additional data columns, including the one containing each alignment’s e-value (not depicted) that is used in the alignment sorting and filtering step.
Figure 3
Figure 3. Gephi visualization of the MMseqs2 search of the representative sequences.
Depicted are the reciprocal hits resulting from the sequence alignment. Each dot represents a representative sequence. Dots in close proximity indicate a reciprocal hit. Groups of close reciprocal hits are effectively clusters. Complet+ merges these clusters to improve completeness. Circled in red is one such cluster. Every representative sequence within this circle is a reciprocal hit to at least one other representative sequence in the circle. Complet+ merges these representative sequences, resulting in a single cluster Every single sequence in the image has a reciprocal hit; a dot that appears to be a single sequence is actually two or more dots on top of each other.
Figure 4
Figure 4. Number of clusters having one (singletons) through ten members for the MMSeqs single-step clustering before and after applying Complet+.
Figure 5
Figure 5. The homogeneity, completeness, AMI, and ARI of the single-batch tests.
(A) All tests aside from “CD-HIT” use MMseqs2 the clustering module stated. The sensitivity specified in parentheses refers to the MMseqs2 search run by Complet+, not the clustering sensitivity, which was the default value of 4.0 where applicable. Overall, Complet+ substantially improves each test case’s completeness at little expense to homogeneity. The loss in homogeneity is more notable when evaluating the clustering results on the family-level of classification, however still to a lesser degree than the increase in completeness. Complet+ also improves the AMI and ARI of each clustering to varying degrees, having a generally greater improvement. (B) The two leftmost tests are Complet+ run at minimum, and maximum MMseqs2 search sensitivity, each on the same Default linclust test results. The two following tests are identical aside from one using the cluster-reassign setting. The last test is the Connected Component (CC) clustering method of MMSeqs2, run at the highest sensitivity. Like the results in (A), AMI and ARI are improved with each case while completeness is also improved without significant loss of homogeneity.
Figure 6
Figure 6. Homogeneity/Completeness scatterplots demonstrating base-algorithm + Complet+ vs. base algorithms (Default, Connected Component clustering both with highest sensitivity of 7.5).
Complet+ can improve completeness of each algorithm by a greater amount than is lost in homogeneity. By merging clusters whose representatives are reciprocal hits (given an e-value threshold) of each other allows more clusters to be merged than stricter connected node criteria by the CC algorithm.
Figure 7
Figure 7. Algorithm runtime vs. sensitivity levels for MMSeqs2 and Complet+.
Both algorithms’ runtimes increase polynomially.
Figure 8
Figure 8. Homogeneity, completeness, AMI, and ARI of the superfamily- vs. family- level for both (A) “new classes” and (B) “random” test batching partitions (for five batches) for MMSeqs cluster and clusterupdate followed by Complet+.
The tick labeled “All” on the graphs represents clustering all sequences in one single batch. Overall, Complet+ increases MMSeqs2 completeness by substantially more than it reduces homogeneity relative to the default MMseqs2-generated clusters. Using Complet+ results in an increased AMI and ARI at both family and super-family levels. Also, we can see that discovery of new classes yield a large variance in performance as opposed to the base algorithm obtaining most classes in the first batch. The variance is due to the number of actual families or super-families (“true” clusters). (C & D) Number of true and predicted clusters for default MMseqs2 and Complet+. The number of true clusters is always lower than what Default MMseqs2 finds, and Complet+ is able to reduce them 10–20% by merging proteins that belong to the same family/superfamily.
Figure 9
Figure 9. The Complet+ time vs. the number of clusters produced by a variety of algorithm modes in MMseqs2.
Some algorithms deviate from the line on the log–log plot due to the output cluster representatives and the different relationships between them (e.g.: cascaded tends to produce more less similar representatives while linclust produces more similar ones). Using the same type of algorithm on the 50-batch large dataset, Complet+ scales linearly vs. the number of input clusters.
Figure 10
Figure 10. Runtimes of MMseqs2 cascade clustering of sequences in a single batch and 50 increments of MMseqs2 clusterupdate, shown for a 3 million protein sequence data set from the COG database, with and without running Complet+ at each step.
While Complet+ takes significantly longer than MMseqs2 clusterupdate, it scales linearly to the number of input clusters.

References

    1. Bairoch A, Apweiler R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research. 1999;27(1):49–54. doi: 10.1093/nar/27.1.49. - DOI - PMC - PubMed
    1. Bernardes JS, Vieira FR, Costa LM, Zaverucha G. Evaluation and improvements of clustering algorithms for detecting remote homologous protein families. BMC Bioinformatics. 2015;16(1):34. doi: 10.1186/s12859-014-0445-4. - DOI - PMC - PubMed
    1. Chandonia J-M, Guan L, Lin S, Yu C, Fox NK, Brenner SE. SCOPe: improvements to the structural classification of proteins—extended database to facilitate variant interpretation and machine learning. Nucleic Acids Research. 2021;50(D1):D553–D559. doi: 10.1093/nar/gkab1054. - DOI - PMC - PubMed
    1. Chen W, Zhang CK, Cheng Y, Zhang S, Zhao H. A comparison of methods for clustering 16S rRNA sequences into OTUs. PLOS ONE. 2013;8(8):1–10. doi: 10.1371/journal.pone.0070837. - DOI - PMC - PubMed
    1. Dash S, Rahman SR, Hines HM, Feng W-c. iBLAST: incremental BLAST of new sequences via automated e-value correction. PLOS ONE. 2021;16(4):1–16. doi: 10.1371/journal.pone.0249410. - DOI - PMC - PubMed

Publication types

LinkOut - more resources