Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb 6;10(1):2022.
doi: 10.1038/s41598-020-58785-y.

Identifying multi-hit carcinogenic gene combinations: Scaling up a weighted set cover algorithm using compressed binary matrix representation on a GPU

Affiliations

Identifying multi-hit carcinogenic gene combinations: Scaling up a weighted set cover algorithm using compressed binary matrix representation on a GPU

Qais Al Hajri et al. Sci Rep. .

Abstract

Despite decades of research, effective treatments for most cancers remain elusive. One reason is that different instances of cancer result from different combinations of multiple genetic mutations (hits). Therefore, treatments that may be effective in some cases are not effective in others. We previously developed an algorithm for identifying combinations of carcinogenic genes with mutations (multi-hit combinations), which could suggest a likely cause for individual instances of cancer. Most cancers are estimated to require three or more hits. However, the computational complexity of the algorithm scales exponentially with the number of hits, making it impractical for identifying combinations of more than two hits. To identify combinations of greater than two hits, we used a compressed binary matrix representation, and optimized the algorithm for parallel execution on an NVIDIA V100 graphics processing unit (GPU). With these enhancements, the optimized GPU implementation was on average an estimated 12,144 times faster than the original integer matrix based CPU implementation, for the 3-hit algorithm, allowing us to identify 3-hit combinations. The 3-hit combinations identified using a training set were able to differentiate between tumor and normal samples in a separate test set with 90% overall sensitivity and 93% overall specificity. We illustrate how the distribution of mutations in tumor and normal samples in the multi-hit gene combinations can suggest potential driver mutations for further investigation. With experimental validation, these combinations may provide insight into the etiology of cancer and a rational basis for targeted combination therapy.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Comparison of different implementations of the multi-hit algorithm for identifying 2-hit combinations. (a) Run time for the original matrix implementation on the CPU ranges from 3–3723 sec compared to 7–223 sec for the compressed binary CPU implementation and 5–33 sec for the optimized GPU implementation. (b) Speedup is on average 10-fold for the compressed binary CPU implementation and 68-fold for the optimized GPU implementation compared to the original matrix CPU implementation. Names for the cancer types shown along the x-axis are listed in Table S1.
Figure 2
Figure 2
Comparison of different implementations of the multi-hit algorithm for identifying 3-hit combinations. (a) Run time for the original matrix CPU implementation ranges from 110 sec to an estimated 282 days, compared to 46 sec to 10 days for the compressed binary CPU implementation and 4 sec to 23 min for the optimized GPU implementation. Run times for the original matrix CPU implementation requiring over 30 days were estimated as described in Methods. (b) Speedup for the compressed binary CPU implementation ranged from 2x–28x, and from 29x–33,690x for the optimized GPU implementation. Names for the cancer types shown along the x-axis are listed in Table S1.
Figure 3
Figure 3
Average contribution of optimizations and parallelization to speedup. Breakdown of contributions due to compressed binary representation, GPU parallelization, removal of branch and bound logic, single two-gene combination per thread, and mapping of upper triangular (UT) gene combination to a sequential thread ID. (a) Breakdown of 2-hit speedup. (b) Breakdown of 3-hit combinations. Contribution due to compressed binary representation is 15x for 3-hits which is not visible in the scale of the figure.
Figure 4
Figure 4
Accuracy of 2- and 3-hit combinations. (a) Sensitivity varies from 63–100% for 2-hit combinations, and from 50–100% for 3-hit combinations, excluding KICH for which there were only a total of 9 tumor samples. (b) Specificity varies from 79–100% for 2-hit combinations, and from 78–100% for 3-hit combinations. Sensitivity and specificity were calculated on a randomly selected 25% Test data set. Error bars represent 95% CI. Cancer types with relatively large 95% CI (CHOL, DLBC, KICH, KIRP, MESO and UCS) are due to small sample size (total of 44, 43, 9, 88, 69 and 46 samples respectively).
Figure 5
Figure 5
2-hit combinations identified for ovarian serous cystadenocarcinoma (OV). The outer circle shows individual chromosomes with corresponding ideograms shown in the inner circle. Genes that comprise 2-hit combinations are labeled inside the circle. Each 2-hit combination is identified by differently colored lines connecting two genes. The red line represents the gene combination discussed in further detail. This image was  generated using RCircos.
Figure 6
Figure 6
3-hit combinations identified for ovarian serous cystadenocarcinoma (OV). The outer circle shows individual chromosomes with corresponding ideograms shown in the inner circle. Genes that comprise 3-hit combinations are labeled inside the circle. Each 3-hit combination is identified by differently colored lines connecting three genes. The red line represents the gene combination discussed in further detail. This image was generated using RCircos.
Figure 7
Figure 7
Distribution of somatic mutations in TP53 in ovarian tumor samples and normal samples. The horizontal bar shows amino acid position within the protein, with labels showing known functional domains. Vertical lines show the number of samples with protein altering mutations at each amino acid position. The most frequently mutated sites for each gene in (a) tumor and (b) normal samples are labeled for comparison. Image generated using g3viz.
Figure 8
Figure 8
Distribution of somatic mutations in KCNB1 in ovarian tumor samples and normal samples. The horizontal bar shows amino acid position within the protein, with labels showing known functional domains. Vertical lines show the number of samples with protein altering mutations at each amino acid position. The most frequently mutated sites for each gene in (a) tumor and (b) normal samples are labeled for comparison. This image was generated using g3viz.
Figure 9
Figure 9
Distribution of somatic mutations in TTN in ovarian tumor samples and normal samples. The horizontal bar shows amino acid position within the protein, with labels showing known functional domains. Vertical lines show the number of samples with protein altering mutations at each amino acid position. The most frequently mutated sites for each gene in (a) tumor and (b) normal samples are labeled for comparison. This image was generated using g3viz.
Figure 10
Figure 10
Algorithm for finding multi-hit combinations, illustrated for 2-hit combinations. The cells marked with x in the Gene-Sample Mutation matrices represent samples with mutations in the corresponding gene. There are H=G(G1)/2 possible 2-hit combinations involving two different genes. The algorithm iterates through three steps. (1) Eq. (1) is used to calculate Fi for each combination. (2) The combination (ga and gb in this example) with the maximum value of Fi, (Fk in this example) is added to the list of selected multi-hit combinations. (3) Tumor samples containing mutation in the selected combination of genes are excluded from consideration in subsequent iterations of the algorithm. The green shaded columns in the Tumor Gene-Sample Mutation matrix represent excluded samples in the iteration shown. The algorithm terminates when all tumor samples have been excluded, i.e. “covered” by the set of multi-hit combinations.
Figure 11
Figure 11
Compressed binary representation and bitwise operation for determining the number of samples with mutations in a combination of two genes. Left: Compressed binary representation of Gene-Sample Mutation matrices, illustrated for a 4-bit unsigned integer. si represents the normal or tumor samples shown in Fig. 10. Elements with 0 in the matrix indicate that the sample does not contain mutations in the corresponding gene, while 1 indicates that the sample does contain a mutation in the corresponding gene. Mutations in four samples can be represented by a single 4-bit unsigned integer. Right: Given any two genes gi, gj, the number of samples containing mutations in both these genes is determined by a bitwise AND operation for each of the integers representing mutations in gi with the corresponding set of integers for gj, and then counting the number of non-zero bits.
Figure 12
Figure 12
Mapping the multi-hit CPU algorithm to the GPU, illustrated for the 3-hit algorithm with the compressed binary representation (Fig. 11). Each GPU thread computes Fmaxi for a subset of all possible combinations. The results of each thread is stored in GPU global memory. Fmax across all subsets of combinations is calculated using parallel reduction.
Algorithm 1
Algorithm 1
Sequential algorithm to compute 3-hit combinations.
Algorithm 2
Algorithm 2
Parallel algorithm to compute 3-hit combinations.

References

    1. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2019. CA: A Cancer Journal for Clinicians. 2019;69:7–34. - PubMed
    1. Colditz, G. A., Wolin, K. Y. & Gehlert, S. Applying what we know to accelerate cancer prevention. Science Translational Medicine4, 127rv4–127rv4 (2012). - PMC - PubMed
    1. Maeda H, Khatami M. Analyses of repeated failures in cancer therapy for solid tumors: poor tumor-selective drug delivery, low therapeutic efficacy and unsustainable costs. Clin. Transl. Medicine. 2018;7:11. doi: 10.1186/s40169-018-0185-6. - DOI - PMC - PubMed
    1. Kuchenbaecker KB, et al. Risks of breast, ovarian, and contralateral breast cancer for BRCA1 and BRCA2 mutation carriers. JAMA. 2017;317:2402–2416. doi: 10.1001/jama.2017.7112. - DOI - PubMed
    1. Jasperson, K. W., Patel, S. G. & Ahnen, D. J. APC-associated polyposis conditions. In GeneReviews[Internet] (University of Washington, Seattle, 2017). - PubMed

Publication types

MeSH terms

Substances