Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016:21:207-18.

REPURPOSING GERMLINE EXOMES OF THE CANCER GENOME ATLAS DEMANDS A CAUTIOUS APPROACH AND SAMPLE-SPECIFIC VARIANT FILTERING

Affiliations

REPURPOSING GERMLINE EXOMES OF THE CANCER GENOME ATLAS DEMANDS A CAUTIOUS APPROACH AND SAMPLE-SPECIFIC VARIANT FILTERING

Amanda Koire et al. Pac Symp Biocomput. 2016.

Abstract

When seeking to reproduce results derived from whole-exome or genome sequencing data that could advance precision medicine, the time and expense required to produce a patient cohort make data repurposing an attractive option. The first step in repurposing is setting some quality baseline for the data so that conclusions are not spurious. This is difficult because there can be variations in quality from center to center, clinic to clinic and even patient to patient. Here, we assessed the quality of the whole-exome germline mutations of TCGA cancer patients using patterns of nucleotide substitution and negative selection against impactful mutations. We estimated the fraction of false positive variant calls for each exome with respect to two gold standard germline exomes, and found large variability in the quality of SNV calls between samples, cancer subtypes, and institutions. We then demonstrated how variant features, such as the average base quality for reads supporting an allele, can be used to identify sample-specific filtering parameters to optimize the removal of false positive calls. We concluded that while these germlines have many potential applications to precision medicine, users should assess the quality of the available exome data prior to use and perform additional filtering steps.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest: none declared.

Figures

Fig. 1
Fig. 1
Illustration depicting the steps taken to calculate λ and Ti/Tv parameters from exome data.
Fig. 2
Fig. 2
Simulated noise in exome SNV calls. (a) Effect of increased noise on λ and Ti/Tv values. Shaded regions indicate the standard deviation around the mean. (b) Correlation between λ and Ti/TV.
Fig. 3
Fig. 3
Application of λ to TCGA cohorts. (a) Predicted noise across 21 TCGA cancer types. The data are represented in a box-and-whiskers plot that uses the center line to indicate median, the box to indicate quartiles, and the whiskers to indicate range. Cancer types are ordered by median. (b) Exponential relationship between λ and number of missense SNVs in Lung Adenocarcinoma. Associated open-access clinical data provided by TCGA was used to separate patients by their self-identified race. The average lambda/number of missense mutations for the 1000 Genomes Project Caucasian (CEU) and African-American (ASW) cohorts are noted with a blue and red star, respectively.
Fig. 4
Fig. 4
KICH SNV calls for three centers (a) λ and Ti/Tv of calls. For each patient assessed by each center, λ and Ti/Tv were calculated and the average and standard deviations of these values are displayed by institution. For centers 1 and 3, internal ‘pass’ filters were available and are displayed as well. (b) Predicted percentage of true calls for calls agreed upon by 1, 2, or 3 institutions. For 65 KICH patients assessed by all three centers, all calls regardless of internal filtering were separated by the institution(s) that identified them. The average number of missense mutations per patient, as well as the predicted percentage of true positive calls derived from the λ value of the call set, is shown for each possible combination of sites.
Fig. 5
Fig. 5
Relationship between SNV features and λ for two HNSC patients. (a) Relationship between BQ and λ. For each patient, all missense SNVs were partitioned by BQ value such that every bin contained at least 50 calls; points represent the λ and average BQ of the bin. Solid lines represent sigmoidal fits. (b) Relationship between QUAL and λ. For each patient, all missense SNVs were partitioned by QUAL value such that every bin contained at least 50 calls; points represent the λ and average QUAL of the bin. Solid line represents fit to equation y=Ae-kx+b. For display purposes values of QUAL higher than 200 were not shown.

References

    1. Hayden EC. Nature. 2014;507(7492):294–5. - PubMed
    1. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N, Pretel S, Ziyabari L, Lee M, Shao Y, Wang ZY, Sirotkin K, Ward M, Kholodov M, Zbicz K, Beck J, Kimelman M, Shevelev S, Preuss D, Yaschenko E, Graeff A, Ostell J, Sherry ST. Nat Genet. 2007;39(10):1181–6. - PMC - PubMed
    1. Leinonen R, Sugawara H, Shumway M. Nucleic Acids Res. 39(Database issue):D19–D21. - PMC - PubMed
    1. Clark MJ, Chen R, Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M. Nat Biotechnol. 2011;29(10):908–14. - PMC - PubMed
    1. Meynert AM, Bicknell LS, Hurles ME, Jackson AP, Taylor MS. BMC Bioinformatics. 2013;14:195. - PMC - PubMed

Publication types

MeSH terms