Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug 26:5:13321.
doi: 10.1038/srep13321.

MSIseq: Software for Assessing Microsatellite Instability from Catalogs of Somatic Mutations

Affiliations

MSIseq: Software for Assessing Microsatellite Instability from Catalogs of Somatic Mutations

Mi Ni Huang et al. Sci Rep. .

Abstract

Microsatellite instability (MSI) is a form of hypermutation that occurs in some tumors due to defects in cellular DNA mismatch repair. MSI is characterized by frequent somatic mutations (i.e., cancer-specific mutations) that change the length of simple repeats (e.g., AAAAA…., GATAGATAGATA...). Clinical MSI tests evaluate the lengths of a handful of simple repeat sites, while next-generation sequencing can assay many more sites and offers a much more complete view of their somatic mutation frequencies. Using somatic mutation data from the exomes of a 361-tumor training set, we developed classifiers to determine MSI status based on four machine-learning frameworks. All frameworks had high accuracy, and after choosing one we determined that it had >98% concordance with clinical tests in a separate 163-tumor test set. Furthermore, this classifier retained high concordance even when classifying tumors based on subsets of whole-exome data. We have released a CRAN R package, MSIseq, based on this classifier. MSIseq is faster and simpler to use than software that requires large files of aligned sequenced reads. MSIseq will be useful for genomic studies in which clinical MSI test results are unavailable and for detecting possible misclassifications by clinical tests.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Variation in T, S, and S/T across TCGA’s three laboratory-based MSI categories: MSI-H, microsatellite instable high; MSI-L, microsatellite instable low; MSS, microsatellite stable. P values by Wilcoxon rank-sum tests. Dark horizontal lines are medians; boxes extend from first to third quartiles; whiskers mark the most extreme data points that are ≤1.5 times the length of the box distant from the box.
Figure 2
Figure 2. 3-D plot of the variables S.ind, T.sns, and S in the training and test sets.
The lower panel is a close-up view for S.ind  ≤1, T.sns  ≤10, and S  ≤1.23. Tumors with discordant classification by NGSclassifier and laboratory tests are labeled by the last four characters of the tumor identifier.
Figure 3
Figure 3. Prediction accuracy of NGSclassifier (y axis) on exome subsets of varying lengths (x axis).
“Length of exome subset" on the x axis refers to the region that was targeted for sequencing. The prediction accuracy is the number of tumors with concordant MSI status between NGSclassifier and the laboratory test, divided by the total number of tumors. Error bars indicate standard deviations for 1,000 different, random exome subsets at each length. Supplementary Table 1 shows the underlying data.
Figure 4
Figure 4. 30× depth provides adequate somatic variant calls for NGSclassifier.
Shown are MSI-status classification receiver operating characteristic (ROC) curves. S.ind was calculated from the mutations list generated by a GATK pipeline similar that used in reference 18. Full-depth or down-sampled exome BAM files from 22 tumor-normal pairs were analyzed. AUC, area under the curve.
Figure 5
Figure 5. Workflow for the R MSIseq package.
Functions and variables in the package are highlighted in blue. MSIseq provides Compute.input.variables() to calculate the potential input variables (S.ind, T.sns, etc.) from (i) a mutation annotation file, (ii) an annotation of the locations of simple repeats in the genome, and (iii) the lengths of the sequenced regions of the genome that were searched for somatic mutations. MSIseq provides these data as used in this paper in the variables NGStraindata, Hg19repeats, and NGStrainseqLen. MSIseq.train() takes the input variables plus (optionally) cancer type information and creates a classifier. Please refer to the MSISeq documentation and vignette for details. MSIseq also provides a pre-computed classifier (called NGSclassifier in the package) that implements the NGSclassifier presented in this paper. For classification of samples with unknown MSI status, input variables can be prepared from the mutation annotation file by Compute.input.variables() and then passed to MSIseq.classify() along with a classifier generated by MSIseq.train().

References

    1. Iacopetta B., Grieu F. & Amanuel B. Microsatellite instability in colorectal cancer. Asia-Pac J Clin Onco 6, 260–269, 10.1111/J.1743-7563.2010.01335.X (2010). - DOI - PubMed
    1. Boland C. R. et al. A National Cancer Institute workshop on microsatellite mnstability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. Cancer Res 58, 5248–5257 (1998). - PubMed
    1. Eshleman J. R. & Markowitz S. D. Mismatch repair defects in human carcinogenesis. Hum Mol Genet 5, 1489–1494 (1996). - PubMed
    1. Greenman C. et al. Patterns of somatic mutation in human cancer genomes. Nature 446, 153–158, 10.1038/nature05610 (2007). - DOI - PMC - PubMed
    1. Veigl M. L. et al. Biallelic inactivation of hMLH1 by epigenetic gene silencing, a novel mechanism causing human MSI cancers. P Natl Acad Sci USA 95, 8698–8702, 10.1073/Pnas.95.15.8698 (1998). - DOI - PMC - PubMed

Publication types