Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov 3;8(11):e20826.
doi: 10.2196/20826.

Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study

Affiliations

Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study

Carlos R Oliveira et al. JMIR Med Inform. .

Abstract

Background: Accurate identification of new diagnoses of human papillomavirus-associated cancers and precancers is an important step toward the development of strategies that optimize the use of human papillomavirus vaccines. The diagnosis of human papillomavirus cancers hinges on a histopathologic report, which is typically stored in electronic medical records as free-form, or unstructured, narrative text. Previous efforts to perform surveillance for human papillomavirus cancers have relied on the manual review of pathology reports to extract diagnostic information, a process that is both labor- and resource-intensive. Natural language processing can be used to automate the structuring and extraction of clinical data from unstructured narrative text in medical records and may provide a practical and effective method for identifying patients with vaccine-preventable human papillomavirus disease for surveillance and research.

Objective: This study's objective was to develop and assess the accuracy of a natural language processing algorithm for the identification of individuals with cancer or precancer of the cervix and anus.

Methods: A pipeline-based natural language processing algorithm was developed, which incorporated machine learning and rule-based methods to extract diagnostic elements from the narrative pathology reports. To test the algorithm's classification accuracy, we used a split-validation study design. Full-length cervical and anal pathology reports were randomly selected from 4 clinical pathology laboratories. Two study team members, blinded to the classifications produced by the natural language processing algorithm, manually and independently reviewed all reports and classified them at the document level according to 2 domains (diagnosis and human papillomavirus testing results). Using the manual review as the gold standard, the algorithm's performance was evaluated using standard measurements of accuracy, recall, precision, and F-measure.

Results: The natural language processing algorithm's performance was validated on 949 pathology reports. The algorithm demonstrated accurate identification of abnormal cytology, histology, and positive human papillomavirus tests with accuracies greater than 0.91. Precision was lowest for anal histology reports (0.87, 95% CI 0.59-0.98) and highest for cervical cytology (0.98, 95% CI 0.95-0.99). The natural language processing algorithm missed 2 out of the 15 abnormal anal histology reports, which led to a relatively low recall (0.68, 95% CI 0.43-0.87).

Conclusions: This study outlines the development and validation of a freely available and easily implementable natural language processing algorithm that can automate the extraction and classification of clinical data from cervical and anal cytology and histology.

Keywords: HPV; accuracy; anal cancer; automated data extraction; cancer; cervical cancer; human papillomavirus; natural language processing; pathology reporting; precancer; surveillance.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: LMN reports previous work as a scientific advisor for Merck. SSS has previously provided consulting services to Merck and received a research grant from Merck. All other authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Diagrammatic representation of the classification process for pathology reports (colored indicates abnormal pathology). AIN: anal intraepithelial lesion; AIS: adenocarcinoma in situ; ASC-US: atypical squamous cells of undetermined significance; ASC-H: atypical squamous cells—cannot exclude high-grade squamous intraepithelial lesion; CIN: cervical intraepithelial lesion;HSIL: high-grade squamous intraepithelial lesion; LSIL: low-grade squamous intraepithelial lesion; NIEL: negative for intraepithelial lesion; SCC: squamous cell carcinoma.

References

    1. Bayer R, Galea S. Public health in the precision-medicine era. N Engl J Med. 2015 Aug 06;373(6):499–501. doi: 10.1056/NEJMp1506241. - DOI - PubMed
    1. Yu W, Zheng C, Xie F, Chen W, Mercado C, Sy LS, Qian L, Glenn S, Tseng HF, Lee G, Duffy J, McNeil MM, Daley MF, Crane B, McLean HQ, Jackson LA, Jacobsen SJ. The use of natural language processing to identify vaccine-related anaphylaxis at five health care systems in the Vaccine Safety Datalink. Pharmacoepidemiol Drug Saf. 2020 Feb;29(2):182–188. doi: 10.1002/pds.4919. - DOI - PMC - PubMed
    1. Elkin PL, Froehling DA, Wahner-Roedler DL, Brown SH, Bailey KR. Comparison of natural language processing biosurveillance methods for identifying influenza from encounter notes. Ann Intern Med. 2012 Jan 03;156(1 Pt 1):11–8. doi: 10.7326/0003-4819-156-1-201201030-00003. - DOI - PubMed
    1. Ye Y, Tsui FR, Wagner M, Espino JU, Li Q. Influenza detection from emergency department reports using natural language processing and Bayesian network classifiers. J Am Med Inform Assoc. 2014;21(5):815–23. doi: 10.1136/amiajnl-2013-001934. http://europepmc.org/abstract/MED/24406261 - DOI - PMC - PubMed
    1. Centers for Disease Control and Prevention . USCS Data Brief. US Department of Health and Human Services; 2019. [2020-10-27]. Cancers Associated with Human Papillomavirus, United States—2012–2016. www.cdc.gov/cancer/uscs/about/data-briefs/no10-hpv-assoc-cancers-UnitedS....

LinkOut - more resources