. 2015 Oct 1:8:58.

doi: 10.1186/s12920-015-0130-0.

TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen

Andrea Marion Marquard¹, Nicolai Juul Birkbak^{2

3}, Cecilia Engel Thomas^{4

5}, Francesco Favero⁶, Marcin Krzystanek⁷, Celine Lefebvre⁸, Charles Ferté^{9

10}, Mariam Jamal-Hanjani¹¹, Gareth A Wilson¹², Seema Shafi¹³, Charles Swanton^{14

15}, Fabrice André^{16

17}, Zoltan Szallasi^{18

19}, Aron Charles Eklund²⁰

Affiliations

¹ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. marquard@cbs.dtu.dk.
² Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. njuul@cbs.dtu.dk.
³ Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. njuul@cbs.dtu.dk.
⁴ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. cecilia.thomas@cpr.ku.dk.
⁵ NNF Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, DK-2200, Copenhagen, Denmark. cecilia.thomas@cpr.ku.dk.
⁶ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. favero@cbs.dtu.dk.
⁷ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. marcin@cbs.dtu.dk.
⁸ Inserm Unit U981, Gustave Roussy, Villejuif, France. Celine.LEFEBVRE@gustaveroussy.fr.
⁹ Inserm Unit U981, Gustave Roussy, Villejuif, France. Charles.FERTE@gustaveroussy.fr.
¹⁰ Department of Medical Oncology, Gustave Roussy, Villejuif, France. Charles.FERTE@gustaveroussy.fr.
¹¹ Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. m.jamal-hanjani@ucl.ac.uk.
¹² Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. gareth.wilson@ucl.ac.uk.
¹³ Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. s.shafi@ucl.ac.uk.
¹⁴ Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. c.swanton@ucl.ac.uk.
¹⁵ Cancer Research UK London Research Institute, London, UK. c.swanton@ucl.ac.uk.
¹⁶ Inserm Unit U981, Gustave Roussy, Villejuif, France. Fabrice.ANDRE@gustaveroussy.fr.
¹⁷ Department of Medical Oncology, Gustave Roussy, Villejuif, France. Fabrice.ANDRE@gustaveroussy.fr.
¹⁸ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. zoltan@cbs.dtu.dk.
¹⁹ Children's Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology (CHIP@HST), Harvard Medical School, Boston, USA. zoltan@cbs.dtu.dk.
²⁰ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. eklund@cbs.dtu.dk.

PMID: 26429708
PMCID: PMC4590711
DOI: 10.1186/s12920-015-0130-0

TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen

Andrea Marion Marquard et al. BMC Med Genomics. 2015.

. 2015 Oct 1:8:58.

doi: 10.1186/s12920-015-0130-0.

Authors

Affiliations

¹ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. marquard@cbs.dtu.dk.
² Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. njuul@cbs.dtu.dk.
³ Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. njuul@cbs.dtu.dk.
⁴ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. cecilia.thomas@cpr.ku.dk.
⁵ NNF Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, DK-2200, Copenhagen, Denmark. cecilia.thomas@cpr.ku.dk.
⁶ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. favero@cbs.dtu.dk.
⁷ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. marcin@cbs.dtu.dk.
⁸ Inserm Unit U981, Gustave Roussy, Villejuif, France. Celine.LEFEBVRE@gustaveroussy.fr.
⁹ Inserm Unit U981, Gustave Roussy, Villejuif, France. Charles.FERTE@gustaveroussy.fr.
¹⁰ Department of Medical Oncology, Gustave Roussy, Villejuif, France. Charles.FERTE@gustaveroussy.fr.
¹¹ Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. m.jamal-hanjani@ucl.ac.uk.
¹² Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. gareth.wilson@ucl.ac.uk.
¹³ Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. s.shafi@ucl.ac.uk.
¹⁴ Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK. c.swanton@ucl.ac.uk.
¹⁵ Cancer Research UK London Research Institute, London, UK. c.swanton@ucl.ac.uk.
¹⁶ Inserm Unit U981, Gustave Roussy, Villejuif, France. Fabrice.ANDRE@gustaveroussy.fr.
¹⁷ Department of Medical Oncology, Gustave Roussy, Villejuif, France. Fabrice.ANDRE@gustaveroussy.fr.
¹⁸ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. zoltan@cbs.dtu.dk.
¹⁹ Children's Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology (CHIP@HST), Harvard Medical School, Boston, USA. zoltan@cbs.dtu.dk.
²⁰ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark. eklund@cbs.dtu.dk.

PMID: 26429708
PMCID: PMC4590711
DOI: 10.1186/s12920-015-0130-0

Abstract

Background: A substantial proportion of cancer cases present with a metastatic tumor and require further testing to determine the primary site; many of these are never fully diagnosed and remain cancer of unknown primary origin (CUP). It has been previously demonstrated that the somatic point mutations detected in a tumor can be used to identify its site of origin with limited accuracy. We hypothesized that higher accuracy could be achieved by a classification algorithm based on the following feature sets: 1) the number of nonsynonymous point mutations in a set of 232 specific cancer-associated genes, 2) frequencies of the 96 classes of single-nucleotide substitution determined by the flanking bases, and 3) copy number profiles, if available.

Methods: We used publicly available somatic mutation data from the COSMIC database to train random forest classifiers to distinguish among those tissues of origin for which sufficient data was available. We selected feature sets using cross-validation and then derived two final classifiers (with or without copy number profiles) using 80 % of the available tumors. We evaluated the accuracy using the remaining 20 %. For further validation, we assessed accuracy of the without-copy-number classifier on three independent data sets: 1669 newly available public tumors of various types, a cohort of 91 breast metastases, and a set of 24 specimens from 9 lung cancer patients subjected to multiregion sequencing.

Results: The cross-validation accuracy was highest when all three types of information were used. On the left-out COSMIC data not used for training, we achieved a classification accuracy of 85 % across 6 primary sites (with copy numbers), and 69 % across 10 primary sites (without copy numbers). Importantly, a derived confidence score could distinguish tumors that could be identified with 95 % accuracy (32 %/75 % of tumors with/without copy numbers) from those that were less certain. Accuracy in the independent data sets was 46 %, 53 % and 89 % respectively, similar to the accuracy expected from the training data.

Conclusions: Identification of primary site from point mutation and/or copy number data may be accurate enough to aid clinical diagnosis of cancers of unknown primary origin.

PubMed Disclaimer

Figures

**Fig. 1**
Classifier outline. Somatic point mutation data is used to determine the mutation status of a set of cancer genes and to calculate the distributions of 96 classes of base substitutions. When copy number profiles are available, they are used to infer any SCNAs in the same set of cancer genes. These features are combined and provided to a set of random forest classifiers, one per primary site, each of which generates a classification score. The PM classifier does *not* use copy number profiles and is trained to distinguish between all 10 primary sites. The PM + CN classifier *does* use copy number profiles (orange), but can only distinguish between 6 primary sites (white) due to less training data. Thus, blue boxes are components of the the PM classifier only, and orange boxes are components of the PM + CN classifier only, and white boxes are components of both classifiers. These sites were selected based on the availability of sufficient training data (>200 cases)

**Fig. 2**
Cross-validation accuracy in the training data using various combinations of feature sets. Random forest ensembles were trained using the feature sets shown in the tables below each bar, and classification accuracy was evaluated by cross-validation. Sufficient SCNA data was available for only six of ten primary sites; thus we analyzed these six sites separately when including SCNAs. a Classification accuracy when excluding SCNAs and distinguishing between ten primary sites. b Classification accuracy when including SCNAs and distinguishing between six primary sites. Accuracy of individual sites are indicated by colored circles. The two combinations of feature sets selected for further analysis are indicated at the top; PM: point mutations only, PM + CN: point mutations and copy number aberrations

**Fig. 3**
Performance of final PM classifier on the test data. a Confusion matrix of actual vs. predicted primary sites, with sensitivity, specificity, and marginal frequencies. b Performance of the final classifier in prioritizing primary sites. Each point indicates the cumulative accuracy when, for each sample, the top n highest-scoring sites are considered, or when sites are ranked by frequency or by random guess. c Classification accuracy increases with confidence score. Circles and bars indicate the accuracy and 95 % confidence interval for each bin of samples. Grey columns indicate the number of samples in each bin. d Accuracy vs. fraction of samples called. Accuracy (solid line) and 95 % confidence interval (grey region) of the corresponding fraction of tumors with highest confidence score. The fraction of tumors for which an accuracy of 95 % can be achieved is shown by a red circle with whiskers at the bottom

**Fig. 4**
Performance of final PM + CN classifier on the test data. a–d see Fig. 3 legend

**Fig. 5**
Performance of the PM classifier on independent validation data. a Tumors of various types from COSMIC v70 (n = 1669). b Metastatic breast tumors from the SAFIR01 trial (n = 91). c Multiregion-sequenced non-small cell lung cancer (n = 9). See Fig. 3b legend. For comparison, the expected performance of our method in each data set was estimated according to the distribution of primary sites and the site-specific accuracies on test data

**Fig. 6**
Consistency of the PM classifier on data from multiple samples from the same tumor. The classifier was applied to 24 specimens from 9 NSCLC patients, including primary regions (R) and lymph node metastases (L). The proposed primary site is indicated by color along with the confidence score

See this image and copyright information in PMC

References

1. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discovery. 2012;2:401–404. doi: 10.1158/2159-8290.CD-12-0095. - DOI - PMC - PubMed
1. Oien KA. Pathologic evaluation of unknown primary cancer. Semin Oncol. 2009;36:8–37. doi: 10.1053/j.seminoncol.2008.10.009. - DOI - PubMed
1. Pavlidis N, Pentheroudakis G. Cancer of unknown primary site. Lancet. 2012;379:1428–1435. doi: 10.1016/S0140-6736(11)61178-1. - DOI - PubMed
1. Tothill RW, Li J, Mileshkin L, Doig K, Siganakis T, Cowin P, et al. Massively-parallel sequencing assists the diagnosis and guided treatment of cancers of unknown primary. J Pathol. 2013;231:413–423. doi: 10.1002/path.4251. - DOI - PubMed
1. Bettegowda C, Sausen M, Leary RJ, Kinde I, Wang Y, Agrawal N, et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci Transl Med. 2014;6:224ra24. doi: 10.1126/scitranslmed.3007094. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen

Affiliations

TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical