Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018:2:CCI.16.00079.
doi: 10.1200/CCI.16.00079. Epub 2018 Mar 22.

Artificial Intelligence Approach for Variant Reporting

Affiliations

Artificial Intelligence Approach for Variant Reporting

Michael G Zomnir et al. JCO Clin Cancer Inform. 2018.

Abstract

Purpose: Next-generation sequencing technologies are actively applied in clinical oncology. Bioinformatics pipeline analysis is an integral part of this process; however, humans cannot yet realize the full potential of the highly complex pipeline output. As a result, the decision to include a variant in the final report during routine clinical sign-out remains challenging.

Methods: We used an artificial intelligence approach to capture the collective clinical sign-out experience of six board-certified molecular pathologists to build and validate a decision support tool for variant reporting. We extracted all reviewed and reported variants from our clinical database and tested several machine learning models. We used 10-fold cross-validation for our variant call prediction model, which derives a contiguous prediction score from 0 to 1 (no to yes) for clinical reporting.

Results: For each of the 19,594 initial training variants, our pipeline generates approximately 500 features, which results in a matrix of > 9 million data points. From a comparison of naive Bayes, decision trees, random forests, and logistic regression models, we selected models that allow human interpretability of the prediction score. The logistic regression model demonstrated 1% false negativity and 2% false positivity. The final models' Youden indices were 0.87 and 0.77 for screening and confirmatory cutoffs, respectively. Retraining on a new assay and performance assessment in 16,123 independent variants validated our approach (Youden index, 0.93). We also derived individual pathologist-centric models (virtual consensus conference function), and a visual drill-down functionality allows assessment of how underlying features contributed to a particular score or decision branch for clinical implementation.

Conclusion: Our decision support tool for variant reporting is a practically relevant artificial intelligence approach to harness the next-generation sequencing bioinformatics pipeline output when the complexity of data interpretation exceeds human capabilities.

PubMed Disclaimer

Conflict of interest statement

AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/jco/site/ifc. Michael G. Zomnir No relationship to disclose Lev Lipkin Stock and Other Ownership Interests: TEVA Pharmaceuticals Industries, Pfizer, Novartis Maciej Pacula Patents, Royalties, Other Intellectual Property: Ute Geigenmuller, Doris Damian, Maciej Pacula, Mark A. DePristo. Methods and Systems for Determining Autism Spectrum Disorder Risk (US patent 9,176,113), granted November 3, 2015 (Inst) Enrique Dominguez Meneses No relationship to disclose Allison MacLeay Travel, Accommodations, Expenses: InterSystems, Athenahealth (I) Sekhar Duraisamy No relationship to disclose Nishchal Nadhamuni No relationship to disclose

Figures

Fig 1.
Fig 1.
The complexity of variant reporting in clinical practice. (A) The amount and complexity of raw next-generation sequencing (NGS) data requires NGS pipelines for read alignment, variant calling, and variant annotation to provide a (filtered) variant call format (VCF) file for manual review by a pathologist/geneticist. The reporting decision is a complex process that requires experience, involves management of the VCF file and various resources, and ultimately results in a reporting decision. (B) Distribution of tumor types included in the variant training data set (V1). Variants are represented in 37 principal tumor types that combine 383 histologic subtypes. (C) After manual review of 19,594 variants, only 24% (n = 4,787) are reported, and 76% of the review effort is not captured in the final report. (D) The reporting fraction by site (left) and gene (right) shows considerable variation (range, 0% to 100%). (E) The effect of the variant reporting decisions illustrated on a variant frequency matrix; green bars represent the number of variants within each disease site or gene. We used the formula "all" minus "no" equals "yes." Specifically, the filtered pipeline output represents "all" reviewed variants and after subtraction of the variants that received "no" calls (ie, are vetted not to be included in the report), the resulting matrix shows the variant frequencies by gene and site in the final report (ie, "yes" calls). The resulting "yes" matrix is similar to that in recent publications; however, in clinical practice, pathologists/geneticists are confronted with all data ("all" matrix on the left). The portrayed distribution of variants by gene and site represents only two of approximately 500 pipeline features attached to each variant. The full pipeline output and the dimensionality of interrelations exceed the human ability to handle all available data efficiently. AD, adenocarcinoma; BAM, binary alignment map; CRC, colorectal cancer; CUP, carcinoma of unknown primary; EGC, esophagogastric cancer; GIST, GI stromal tumor; Heme, hematologic malignancies; LCNEC, large-cell neuroendocrine carcinoma; NE, neuroendocrine carcinoma; Non-Ca, nonepithelial malignancy; NSCLC, non–small-cell lung cancer; PDAC, pancreatic cancer; QC, quality control; SAM, sequence alignment map; SCLC, small-cell lung cancer; SQ, squamous cell carcinoma.
Fig 2.
Fig 2.
Performance assessment of the artificial intelligence model for variant reporting. (A) Concept of a decision support tool for variant reporting. Current practice (top) is shown with the tested implementation (bottom). The artificial intelligence/machine learning model was built on the basis of prior human reporting decisions. Note that the implemented model provides a reporting decision for each variant on a scale from 0 (no) to 1 (yes) without regard for potential clinical actionability; contextual or clinical consequences (eg, oncology knowledge database) have been excluded intentionally, and we have addressed the topic in prior studies. (B) The number of calls in the aggregate model (by using a naive threshold of 0.5) as well as distribution of no and yes calls per pathologist (A to F). (C) Distribution of 19,954 model scores in the reported and not reported variants. Two call thresholds illustrate two use cases: (1) a more-sensitive 0.25 threshold with fewer false-negative (FN) results (n = 150) and (2) a more-specific 0.75 threshold with fewer false-positive (FP) results (n = 323). (D) Receiver operating characteristic curves with selected performance metrics for model threshold scores of > 0.01 (screening test) and > 0.9 (confirmatory test). (E) The FN rate decreases with increasing prevalence; however, for several genes, the model performance is excellent despite low prevalence (eg, ALK). (F) Specificity over sensitivity (red) and precision over recall (blue) for the aggregate and individual models (A to E). The outline of all individual models can be viewed as the model-based performance spectrum of the examined group practice composed of six pathologists (A to E). TN, true negative; TP, true positive; Sens., sensitivity; VCF, variant call format.
Fig 3.
Fig 3.
Model decision exploration in clinical practice. (A) Screenshot shows our variant review and graphic user interface used to select variants for inclusion in the report (background; old assay, V1). The inset shows individual pathologists' model scores (P1 to P6) and the aggregate. When hovering over one model, the drill-down option shows the top five predictors derived from the logistic regression pathologist's model that contributed to the report recommendation (report). (B) Screenshot shows our variant review and graphic user interface used to select variants for inclusion in the report (background; new assay, V2). The machine learning (ML) score links out to the ML tree module, which allows for exploration of 15 random forest decision branches. Each branch contains the order of contributing features and findings that resulted in the decision (green argues for reporting, red against). Each circle represents one feature, and the drill-down option (inset) shows the feature (eg, a quality control [QC] metric of a caller), the finding in this variant (eg, 1), and the cutoff used by the model (here > 0.5). The added level of transparency that allows review of the features that underlie a model-derived decision is an important design component of our implementation in clinical practice, and we propose the term next-generation decision support. CADD, combined annotation dependent depletion; LOFREQ, low frequency; SNV, single-nucleotide variant.

References

    1. Haber DA, Gray NS, Baselga J: The evolving war on cancer. Cell 145:19-24, 2011 - PubMed
    1. Sobel ME, Bagg A, Caliendo AM, et al. : The evolution of molecular genetic pathology: Advancing 20th-century diagnostic methods into potent tools for the new millennium. J Mol Diagn 10:480-483, 2008 - PMC - PubMed
    1. Goodwin S, McPherson JD, McCombie WR: Coming of age: Ten years of next-generation sequencing technologies. Nat Rev Genet 17:333-351, 2016 - PMC - PubMed
    1. Buermans HP, den Dunnen JT: Next generation sequencing technology: Advances and applications. Biochim Biophys Acta 1842:1932-1941, 2014 - PubMed
    1. Hagemann IS, O’Neill PK, Erill I, et al. : Diagnostic yield of targeted next generation sequencing in various cancer types: An information-theoretic approach. Cancer Genet 208:441-447, 2015 - PubMed