Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul;24(7):1952-1967.
doi: 10.1109/JBHI.2020.2990797. Epub 2020 May 4.

Knowledge Graph-Enabled Cancer Data Analytics

Knowledge Graph-Enabled Cancer Data Analytics

S M Shamimul Hasan et al. IEEE J Biomed Health Inform. 2020 Jul.

Abstract

Cancer registries collect unstructured and structured cancer data for surveillance purposes which provide important insights regarding cancer characteristics, treatments, and outcomes. Cancer registry data typically (1) categorize each reportable cancer case or tumor at the time of diagnosis, (2) contain demographic information about the patient such as age, gender, and location at time of diagnosis, (3) include planned and completed primary treatment information, and (4) may contain survival outcomes. As structured data is being extracted from various unstructured sources, such as pathology reports, radiology reports, medical records, and stored for reporting and other needs, the associated information representing a reportable cancer is constantly expanding and evolving. While some popular analytic approaches including SEER*Stat and SAS exist, we provide a knowledge graph approach to organizing cancer registry data. Our approach offers unique advantages for timely data analysis and presentation and visualization of valuable information. This knowledge graph approach semantically enriches the data, and easily enables linking with third-party data which can help explain variation in cancer incidence patterns, disparities, and outcomes. We developed a prototype knowledge graph based on the Louisiana Tumor Registry dataset. We present the advantages of the knowledge graph approach by examining: i) scenario-specific queries, ii) links with openly available external datasets, iii) schema evolution for iterative analysis, and iv) data visualization. Our results demonstrate that this graph based solution can perform complex queries, improve query run-time performance by up to 76%, and more easily conduct iterative analyses to enhance researchers' understanding of cancer registry data.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
In this figure, we present a high-level overview of the cancer scientific digital library framework.
Fig. 2.
Fig. 2.
Hierarchical breast cancer treatment sequence scenarios.
Fig. 3.
Fig. 3.
In this figure, we present, age-specific TNBC incidence rates (Female, Louisiana 2010–2012). The left figure shows our results and the right figure presents Hossain et al. results [5] (right figure’s credits go to Hossain et al. [5]). According to the 2010’s census in Louisiana, the total number of EA women is 1,463,058 and the total number of AA women is 766,756 (the ratio is almost 2:1), which includes the age group of <30 years. “The estimates are based on the 2010 Census and reflect changes to the April 1, 2010 population due to the Count Question Resolution program and geographic program revisions” [49].
Fig. 4.
Fig. 4.
In this figure, we present TNBC CDI distribution (Female, Louisiana 2010–2012). The left figure shows our results and the right figure presents Hossain et al. results [5] (right figure’s credits go to Hossain et al. [5]).
Fig. 5.
Fig. 5.
Age-specific TNBC incidence rates (Female, Kentucky 2010–2012). According to the 2010’s census in Kentucky, the total number of EA women is 1,963,670 and the total number of AA women is 173,032 (the ratio is almost 12:1), which includes the age group of <30 years. “The estimates are based on the 2010 Census and reflect changes to the April 1, 2010 population due to the Count Question Resolution program and geographic program revisions” [49].
Fig. 6.
Fig. 6.
TNBC CDI Distribution (Female, Kentucky 2010–2012).
Fig. 7.
Fig. 7.
This figure illustrates an example of the knowledge graph’s easy schema evolution. Here we show a patient with prostate cancer (code: C61.9) and a laterality value of 2 (value for a paired organ). To identify invalid laterality codes, we add an “isValidLaterality” flag dynamically to the graph without substantially changing the software code.
Fig. 8.
Fig. 8.
In this figure we present, a partial high level visualization of LTR knowledge graph (Rural-Urban Continuum Codes portion).
Fig. 9.
Fig. 9.
Visualization of the four hypothetical patients information. Although the patient information was simulated to avoid revealing protected health information (PHI) data, the underlying schema is consistent with the actual CTC data.

References

    1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, and Jemal A, “Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: a cancer journal for clinicians, vol. 68, no. 6, pp. 394–424, 2018. - PubMed
    1. “National Cancer Institute’s Surveillance, Epidemiology, and End Results Program,” URL: https://seer.cancer.gov, 2018.
    1. “Center for Disease Control’s National Program of Cancer Registries,” URL: https://www.cdc.gov/cancer/npcr, 2018.
    1. “Surveillance, Epidemiology, and End Results (SEER) Linked Databases,” URL: https://seer.cancer.gov/data-software/linked_databases.html, 2018.
    1. Hossain F, Danos D, Prakash O, Gilliland A, Ferguson TF, Simonsen N, Leonardi C, Yu Q, Wu X-C, Miele L. et al., “Neighborhood social determinants of triple negative breast cancer,” Frontiers in public health, vol. 7, 2019. - PMC - PubMed

Publication types