Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Oct 1:8:58.
doi: 10.1186/s12920-015-0130-0.

TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen

Affiliations

TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen

Andrea Marion Marquard et al. BMC Med Genomics. .

Abstract

Background: A substantial proportion of cancer cases present with a metastatic tumor and require further testing to determine the primary site; many of these are never fully diagnosed and remain cancer of unknown primary origin (CUP). It has been previously demonstrated that the somatic point mutations detected in a tumor can be used to identify its site of origin with limited accuracy. We hypothesized that higher accuracy could be achieved by a classification algorithm based on the following feature sets: 1) the number of nonsynonymous point mutations in a set of 232 specific cancer-associated genes, 2) frequencies of the 96 classes of single-nucleotide substitution determined by the flanking bases, and 3) copy number profiles, if available.

Methods: We used publicly available somatic mutation data from the COSMIC database to train random forest classifiers to distinguish among those tissues of origin for which sufficient data was available. We selected feature sets using cross-validation and then derived two final classifiers (with or without copy number profiles) using 80 % of the available tumors. We evaluated the accuracy using the remaining 20 %. For further validation, we assessed accuracy of the without-copy-number classifier on three independent data sets: 1669 newly available public tumors of various types, a cohort of 91 breast metastases, and a set of 24 specimens from 9 lung cancer patients subjected to multiregion sequencing.

Results: The cross-validation accuracy was highest when all three types of information were used. On the left-out COSMIC data not used for training, we achieved a classification accuracy of 85 % across 6 primary sites (with copy numbers), and 69 % across 10 primary sites (without copy numbers). Importantly, a derived confidence score could distinguish tumors that could be identified with 95 % accuracy (32 %/75 % of tumors with/without copy numbers) from those that were less certain. Accuracy in the independent data sets was 46 %, 53 % and 89 % respectively, similar to the accuracy expected from the training data.

Conclusions: Identification of primary site from point mutation and/or copy number data may be accurate enough to aid clinical diagnosis of cancers of unknown primary origin.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Classifier outline. Somatic point mutation data is used to determine the mutation status of a set of cancer genes and to calculate the distributions of 96 classes of base substitutions. When copy number profiles are available, they are used to infer any SCNAs in the same set of cancer genes. These features are combined and provided to a set of random forest classifiers, one per primary site, each of which generates a classification score. The PM classifier does not use copy number profiles and is trained to distinguish between all 10 primary sites. The PM + CN classifier does use copy number profiles (orange), but can only distinguish between 6 primary sites (white) due to less training data. Thus, blue boxes are components of the the PM classifier only, and orange boxes are components of the PM + CN classifier only, and white boxes are components of both classifiers. These sites were selected based on the availability of sufficient training data (>200 cases)
Fig. 2
Fig. 2
Cross-validation accuracy in the training data using various combinations of feature sets. Random forest ensembles were trained using the feature sets shown in the tables below each bar, and classification accuracy was evaluated by cross-validation. Sufficient SCNA data was available for only six of ten primary sites; thus we analyzed these six sites separately when including SCNAs. a Classification accuracy when excluding SCNAs and distinguishing between ten primary sites. b Classification accuracy when including SCNAs and distinguishing between six primary sites. Accuracy of individual sites are indicated by colored circles. The two combinations of feature sets selected for further analysis are indicated at the top; PM: point mutations only, PM + CN: point mutations and copy number aberrations
Fig. 3
Fig. 3
Performance of final PM classifier on the test data. a Confusion matrix of actual vs. predicted primary sites, with sensitivity, specificity, and marginal frequencies. b Performance of the final classifier in prioritizing primary sites. Each point indicates the cumulative accuracy when, for each sample, the top n highest-scoring sites are considered, or when sites are ranked by frequency or by random guess. c Classification accuracy increases with confidence score. Circles and bars indicate the accuracy and 95 % confidence interval for each bin of samples. Grey columns indicate the number of samples in each bin. d Accuracy vs. fraction of samples called. Accuracy (solid line) and 95 % confidence interval (grey region) of the corresponding fraction of tumors with highest confidence score. The fraction of tumors for which an accuracy of 95 % can be achieved is shown by a red circle with whiskers at the bottom
Fig. 4
Fig. 4
Performance of final PM + CN classifier on the test data. ad see Fig. 3 legend
Fig. 5
Fig. 5
Performance of the PM classifier on independent validation data. a Tumors of various types from COSMIC v70 (n = 1669). b Metastatic breast tumors from the SAFIR01 trial (n = 91). c Multiregion-sequenced non-small cell lung cancer (n = 9). See Fig. 3b legend. For comparison, the expected performance of our method in each data set was estimated according to the distribution of primary sites and the site-specific accuracies on test data
Fig. 6
Fig. 6
Consistency of the PM classifier on data from multiple samples from the same tumor. The classifier was applied to 24 specimens from 9 NSCLC patients, including primary regions (R) and lymph node metastases (L). The proposed primary site is indicated by color along with the confidence score

References

    1. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discovery. 2012;2:401–404. doi: 10.1158/2159-8290.CD-12-0095. - DOI - PMC - PubMed
    1. Oien KA. Pathologic evaluation of unknown primary cancer. Semin Oncol. 2009;36:8–37. doi: 10.1053/j.seminoncol.2008.10.009. - DOI - PubMed
    1. Pavlidis N, Pentheroudakis G. Cancer of unknown primary site. Lancet. 2012;379:1428–1435. doi: 10.1016/S0140-6736(11)61178-1. - DOI - PubMed
    1. Tothill RW, Li J, Mileshkin L, Doig K, Siganakis T, Cowin P, et al. Massively-parallel sequencing assists the diagnosis and guided treatment of cancers of unknown primary. J Pathol. 2013;231:413–423. doi: 10.1002/path.4251. - DOI - PubMed
    1. Bettegowda C, Sausen M, Leary RJ, Kinde I, Wang Y, Agrawal N, et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci Transl Med. 2014;6:224ra24. doi: 10.1126/scitranslmed.3007094. - DOI - PMC - PubMed

Publication types