Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm
- PMID: 22822041
- PMCID: PMC3638182
- DOI: 10.1136/amiajnl-2012-000928
Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm
Abstract
Objective: Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem.
Materials and methods: SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report.
Results: Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups.
Discussion: Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability.
Conclusion: SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research.
Conflict of interest statement
Figures




Similar articles
-
Extracting data from electronic medical records: validation of a natural language processing program to assess prostate biopsy results.World J Urol. 2014 Feb;32(1):99-103. doi: 10.1007/s00345-013-1040-4. Epub 2013 Feb 17. World J Urol. 2014. PMID: 23417341
-
Using a statistical natural language Parser augmented with the UMLS specialist lexicon to assign SNOMED CT codes to anatomic sites and pathologic diagnoses in full text pathology reports.AMIA Annu Symp Proc. 2009 Nov 14;2009:386-90. AMIA Annu Symp Proc. 2009. PMID: 20351885 Free PMC article.
-
Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives.J Biomed Inform. 2014 Apr;48:54-65. doi: 10.1016/j.jbi.2013.11.008. Epub 2013 Dec 4. J Biomed Inform. 2014. PMID: 24316051
-
Large language models vs human for classifying clinical documents.Int J Med Inform. 2025 Mar;195:105800. doi: 10.1016/j.ijmedinf.2025.105800. Epub 2025 Jan 21. Int J Med Inform. 2025. PMID: 39848078
-
Extracting Housing and Food Insecurity Information From Clinical Notes Using cTAKES.Health Serv Res. 2025 May;60 Suppl 3(Suppl 3):e14440. doi: 10.1111/1475-6773.14440. Epub 2025 Jan 28. Health Serv Res. 2025. PMID: 39871689 Free PMC article.
Cited by
-
Deep learning approach to detection of colonoscopic information from unstructured reports.BMC Med Inform Decis Mak. 2023 Feb 7;23(1):28. doi: 10.1186/s12911-023-02121-7. BMC Med Inform Decis Mak. 2023. PMID: 36750932 Free PMC article.
-
Validation of Claims Algorithms for Progression to Metastatic Cancer in Patients with Breast, Non-small Cell Lung, and Colorectal Cancer.Front Oncol. 2016 Feb 1;6:18. doi: 10.3389/fonc.2016.00018. eCollection 2016. Front Oncol. 2016. PMID: 26870695 Free PMC article.
-
Development of an Automatic Rule-Based Algorithm for the Detection of Ovarian Cancer Recurrence From Electronic Health Records.JCO Clin Cancer Inform. 2024 Mar;8:e2300150. doi: 10.1200/CCI.23.00150. JCO Clin Cancer Inform. 2024. PMID: 38442323 Free PMC article.
-
An accessible, efficient, and accurate natural language processing method for extracting diagnostic data from pathology reports.J Pathol Inform. 2022 Nov 8;13:100154. doi: 10.1016/j.jpi.2022.100154. eCollection 2022. J Pathol Inform. 2022. PMID: 36605108 Free PMC article.
-
Using natural language processing to construct a metastatic breast cancer cohort from linked cancer registry and electronic medical records data.JAMIA Open. 2019 Sep 18;2(4):528-537. doi: 10.1093/jamiaopen/ooz040. eCollection 2019 Dec. JAMIA Open. 2019. PMID: 32025650 Free PMC article.
References
-
- Warren JL, Feuer E, Potosky AL, et al. Use of Medicare hospital and physician data to assess breast cancer incidence. Med Care 1999;37:445–56 - PubMed
-
- Office of the National Coordinator for Health Information Technology (ONC) Department of Health and Human Services Health information technology: Revisions to initial set of standards, implementation specifications, and certification criteria for electronic health record technology. Interim final rule with request for comments. Fed Regist 2010;75:62686–90 - PubMed
-
- Hsiao CJ, Hing E, Socey TC, et al. Electronic Health Record Systems and Intent to Apply for Meaningful Use Incentives Among Office-based Physician Practices: United States, 2001–11. Hyattsville, MD: National Center for Health Statistics, 2011 - PubMed
-
- Carrell D, Miglioretti D, Smith-Bindman R. Coding free text radiology reports using the Cancer Text Information Extraction System (caTIES). AMIA Annu Symp Proc 2007:889. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources