Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Nov 15;31(22):3653-9.
doi: 10.1093/bioinformatics/btv409. Epub 2015 Jul 23.

A mutation profile for top-k patient search exploiting Gene-Ontology and orthogonal non-negative matrix factorization

Affiliations

A mutation profile for top-k patient search exploiting Gene-Ontology and orthogonal non-negative matrix factorization

Sungchul Kim et al. Bioinformatics. .

Erratum in

Abstract

Motivation: As the quantity of genomic mutation data increases, the likelihood of finding patients with similar genomic profiles, for various disease inferences, increases. However, so does the difficulty in identifying them. Similarity search based on patient mutation profiles can solve various translational bioinformatics tasks, including prognostics and treatment efficacy predictions for better clinical decision making through large volume of data. However, this is a challenging problem due to heterogeneous and sparse characteristics of the mutation data as well as their high dimensionality.

Results: To solve this problem we introduce a compact representation and search strategy based on Gene-Ontology and orthogonal non-negative matrix factorization. Statistical significance between the identified cancer subtypes and their clinical features are computed for validation; results show that our method can identify and characterize clinically meaningful tumor subtypes comparable or better in most datasets than the recently introduced Network-Based Stratification method while enabling real-time search. To the best of our knowledge, this is the first attempt to simultaneously characterize and represent somatic mutational data for efficient search purposes.

Availability: The implementations are available at: https://sites.google.com/site/postechdm/research/implementation/orgos.

Contact: sael@cs.stonybrook.edu or hwanjoyu@postech.ac.kr

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of the patient profile construction and validation processes. The mutation profiles are represented as a binary vector in which each entry indicates a binary state of a gene. The GO-based mutation profile matrix, X, is obtained by multiplying the mutation profile matrix, S, and the gene function profile matrix, G. The ONMF mutation profile matrix, W, is obtained by factorizing GO-MPs through ONMF. For stratification, we assign the patients to the cluster that has the highest value based on the encoding vector. For query search, the query profile is generated by minimizing reconstruction error between the mutation profile and the estimated profile multiplied by latent basis vector, and patients who are similar to a given query patient are identified by calculating the Euclidean distance between them and the query patient
Fig. 2.
Fig. 2.
Association of cancer subtypes and patient survival time for OV, LUAD and GBM data. A,B and C show log-rank statistics with maximum values marked (P value of significance of 104k for A (OV), 1010k for B (LUAD), and 106k for C (GBM) is indicated by k number of stars). D, E and F show boxplots of subtypes with minimum and maximum median survival time. The numbers of subtypes analyzed are two for OV, eight for LUAD, and four for GBM
Fig. 3.
Fig. 3.
Predicted survival curves for subtypes with minimum and maximum median survival time; x-axis is survival time (month) and y-axis is survival rate.
Fig. 4.
Fig. 4.
Association between UCEC cancer subtypes and histological clinical features. C1, (serous adenocarcinoma, High grade), C2, (other, High grade), C3, (endometrioid type, High grade), C4, (endometrioid type, Low grade). Only four features are presented and two features with low frequency (5) are omitted to increase the visibility

Similar articles

Cited by

References

    1. Ashburner M., et al. (2000) Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 25–29. - PMC - PubMed
    1. Dennis G., et al. (2003) DAVID: Database for annotation, visualization, and integrated discovery. Genome Biol., 4, P3. - PubMed
    1. Ding C. (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA. ACM Press, pp. 126–135.
    1. Dulak A.M., et al. (2013) Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity. Nat. Genet., 45, 478–486. - PMC - PubMed
    1. Fan J., Li R. (2002) Variable selection for cox’s proportional hazards model and frailty model. Ann. Stat., 30, 74–99.

Publication types