Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Mar-Apr;20(2):349-55.
doi: 10.1136/amiajnl-2012-000928. Epub 2012 Jul 21.

Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm

Affiliations

Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm

Justin A Strauss et al. J Am Med Inform Assoc. 2013 Mar-Apr.

Abstract

Objective: Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem.

Materials and methods: SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report.

Results: Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups.

Discussion: Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability.

Conclusion: SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research.

PubMed Disclaimer

Conflict of interest statement

Competing interests: None.

Figures

Figure 1
Figure 1
Process diagram for a SAS-based coding, extraction, and nomenclature tool (SCENT).
Figure 2
Figure 2
Sample pathology report text following preprocessing by a SAS-based coding, extraction, and nomenclature tool (SCENT).
Figure 3
Figure 3
Sample pathology report text following preprocessing and code assignment by a SAS-based coding, extraction, and nomenclature tool (SCENT).
Figure 4
Figure 4
Sample chart review form used by abstractors to classify the pathology reports of breast and prostate cancer patients.

Similar articles

Cited by

References

    1. Warren JL, Feuer E, Potosky AL, et al. Use of Medicare hospital and physician data to assess breast cancer incidence. Med Care 1999;37:445–56 - PubMed
    1. Lamont EB, Herndon JE, 2nd, Weeks JC, et al. Measuring disease-free survival and cancer relapse using Medicare claims from CALGB breast cancer trial participants (companion to 9344). J Natl Cancer Inst 2006;98:1335–8 - PMC - PubMed
    1. Office of the National Coordinator for Health Information Technology (ONC) Department of Health and Human Services Health information technology: Revisions to initial set of standards, implementation specifications, and certification criteria for electronic health record technology. Interim final rule with request for comments. Fed Regist 2010;75:62686–90 - PubMed
    1. Hsiao CJ, Hing E, Socey TC, et al. Electronic Health Record Systems and Intent to Apply for Meaningful Use Incentives Among Office-based Physician Practices: United States, 2001–11. Hyattsville, MD: National Center for Health Statistics, 2011 - PubMed
    1. Carrell D, Miglioretti D, Smith-Bindman R. Coding free text radiology reports using the Cancer Text Information Extraction System (caTIES). AMIA Annu Symp Proc 2007:889. - PubMed

Publication types

MeSH terms