Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 16;15(1):10685.
doi: 10.1038/s41467-024-54769-y.

GrandQC: A comprehensive solution to quality control problem in digital pathology

Affiliations

GrandQC: A comprehensive solution to quality control problem in digital pathology

Zhilong Weng et al. Nat Commun. .

Abstract

Histological slides contain numerous artifacts that can significantly deteriorate the performance of image analysis algorithms. Here we develop the GrandQC tool for tissue and multi-class artifact segmentation. GrandQC allows for high-precision tissue segmentation (Dice score 0.957) and segmentation of tissue without artifacts (Dice score 0.919-0.938 dependent on magnification). Slides from 19 international pathology departments digitized with the most common scanning systems and from The Cancer Genome Atlas dataset were used to establish a QC benchmark, analyzing inter-institutional, intra-institutional, temporal, and inter-scanner slide quality variations. GrandQC improves the performance of downstream image analysis algorithms. We open-source the GrandQC tool, our large manually annotated test dataset, and all QC masks for the entire TCGA cohort to address the problem of QC in digital/computational pathology. GrandQC can be used as a tool to monitor sample preparation and scanning quality in pathology departments and help to track and eliminate major artifact sources.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no relevant competing interests.

Figures

Fig. 1
Fig. 1. Types of artifacts and training dataset preparation.
A Different types of artifacts: principles of emergence. Shown is a typical processing pipeline of pathology department from tissue sampling and submission by clinicians to histological slide preparation. The common mechanisms of artifact emergence are provided with most artifacts arising during preparation of the slides. The only digitization-specific artifact are out-of-focus regions which, however, might a consequences of suboptimal cutting and staining quality. B Training datasets. Two training datasets were prepared with partially overlapping cases: for tissue detection (slides n = 208) and artifact detection (slides n = 420) tasks. For large slide series included into the training the organs/tumor types are provided as well as source of slides. For details of datasets see Methods. C Annotations principles and classes. Precise manual annotations were performed by expert analysts concerning 9 classes shown (for training purposes AIR BUBBLE and SLIDE EDGE as well as DARK SPOT and FOREIGN bodies were merged as one class, correspondingly, due to similarity). Not shown are annotations for tissue detection tasks that included two classes (tissue and background). Abbreviations: TCGA – The Cancer Genome Atlas, UKK – University Hospital Cologne, PAI-WSIT – PAI-WSIT cohort. Scale bars in all microscopic images are 200 µm. Created in https://BioRender.com.
Fig. 2
Fig. 2. GrandQC algorithm development and validation of tissue detection module.
A GrandQC: Algorithm development. Shown is the pipeline of the algorithm development for two separate modules: tissue detection and artifact detection. Both modules are pixel-wise segmentation networks with tissue detection working at 1x objective magnification level and artifact detection trained in three flavors for 10x, 7x, and 5x magnification with higher resolutions allowing more precision and lower resolutions quicker analysis at a cost of minimal changes in accuracy. Two modules build a tool: GrandQC. The tool is open-sourced for academic research use (https://github.com/cpath-ukk/grandqc). The principle of work is provided below. B Example of tissue detection in the biopsy case with multiple very small tissue particles showing reliable tissue segmentation. C Extreme situations during tissue detection. The algorithm performs very well in such situations as old tissue sections with poor quality of covering glass or glass edges, detection behind air bubble, glass edge our in out-of-focus regions. D Real-world validation of tissue detection. A heterogeneous real-world dataset containing 600 whole-slide images from different organs and types of specimens (5 pathology departments, 5 different scanning systems) was provided to two experienced human analysts that graded tissue detection on a scale 0–10 per slide. Single points were removed for overdetection with subtle, non-relevant underdetection grade as 7 points and any relevant tissue underdetection graded as low point number. Both analysts reported excellent tissue detection capabilities. With slide-level and average accuracy/quality (Reviewer 1: 9.48 of maximal 10 and Reviewer 2: 9.40 of maximal 10 points) results provided. All, mostly very fine inaccuracies were considered non-relevant. For the box plots in the figures, the center line represents the median, the red and the blue points represent the mean of the score, the box bounds depict the interquartile range (IQR), covering data from the 25th to 75th percentiles, which represents the middle 50% of the scores, the whiskers extend to a range of 1.5 times the IQR from the lower and upper quartiles, capturing a broader spread of the data. Created in https://BioRender.com. Source data is provided as a Source data file.
Fig. 3
Fig. 3. Examples of GrandQC application to whole-slide images.
A Example of processing of whole-slide image by GrandQC (7x version). For demonstration purposes an image from a prostatectomy case shown with substantial number of artificats. The detectable artifact classes as well as areas without artifacts are shown in different colors (see color legend). Background is shown as white color (detection by tissue detection module). B Further representative, high-resolution examples of artifact detection by GrandQC artifact detection module. All scale bars are 200 µm. Created in https://BioRender.com.
Fig. 4
Fig. 4. Creation of large test dataset with artifacts and formal validation of artifact detection module.
A Creation of a large test dataset. A large dataset (slides n = 318) of slides with artifacts was generated. Representative tissue samples from four organs were taken from one department. Experienced lab technician represented different types of artifacts. These were digitized with a scanner and precisely manually annotated to be used for formal validation of GrandQC resulting in a dataset of 51283, 26571, and 17145 single image patches for 10x, 7x, and 5x extraction magnification, respectively. B Formal validation results for artifact detection tasks. All metrics represent Dice scores for segmentation accuracy. Shown are the results for three different algorithm versions (5x, 7x, and 10x). Please note, that even if Dice score for Dark Spot & Foreign is relatively low, this is due to the fact that there was inter-artifact misclassification. The accuracy of tissue detection without artifact is the most important score (0.919–0.938 dependent on version). Abbreviation: OoFocus—Out-of-focus. C Performance analysis of the algorithm: speed of single slide analysis. The metrics are shown for both tissue detection and artifact detection modules, for two different datasets: resection specimens and biopsy specimens (Supplementary Methods; Biopsy specimens with at least 3 levels of tissue per slide). The times provided are only for the step of algorithm processing of all slide patches. E.g., generation of overlay, saving the images as files or any further manipulations which can take certain time are excluded. The test was performed using a typical PC station with a consumer-level GPU card (NVIDIA RTX 3090). Created in https://BioRender.com.
Fig. 5
Fig. 5. Analysis of misclassifications in artifact detection.
A The detailed analysis of misclassifications was performed for artifact detection module. The representative examples are shown and summarized in the table in (B). The numbers refer to the type of misclassification in the table. Color legends depicts different artifacts and types of misclassifications (false positive artifact detection or false negative artifact underdetection). All scale bars are 100 µm. B Patterns of misclassification and their relevance. Left side: the patterns of misclassification concerning class are shown – always horizontally for single color-coded artifacts on the left side. Most prominent inter-artifact misclassification is for DARK SPOT / FOREIGN (green) with 24.6% of detected artifact area misclassified as AIR/EDGE. This is specific to the test dataset used (syntethically generated real artifacts) as for foreign body imitation the synthetic threads were used which are highly reminiscent of the borders of AIR bubble. In the real-world application this misclassification is mostly not seen. Right side: The provided table summarizes the patterns of misclassification. Most misclassifications result from minimal pixel-level variations in perception of object boundaries (1), imprecise annotations (2), or inter-artifact misclassifications (4) and therefore non-significant. Patterns 5 and 6 are also highly tolerable as they extend the area of properly detected artifacts, with 6 being even beneficial. Several misclassification patterns might be relevant to consider for end users and appear in a very limited number of slides: detecting pigmented regions as AIR in malignant melanoma (Supplementary Fig. 3; was addressed partially in course of algorithm development) and overcalling of DARK SPOT/FOREIGN in fatty tissue (pattern 5) in single slides scanned by the 3DHISTECH scanners. These problems can be easily overcome as GrandQC detects single artifacts as different classes, whereby misdetected classes might be specifically ignored. Abbreviations: FP false positive, FN false negative, Freq Frequency, Sign Significance.
Fig. 6
Fig. 6. Investigating GrandQC as a benchmark for pathology institutes and scanning systems.
A GrandQC effectively benchmarks slide quality by allowing pathology institute to assess artifact frequency and area in sample slides. We analyze slides from 19 different pathology departments: point plot represents % area for all artifact types in single image. All departments are placed on the line dependent on the mean % area of artifacts in the whole slide set serving as a benchline and reference for new departments assessing slide quality. In the box plots, center line represents the median, black points show the mean, box bounds indicate interquartile range (IQR), covering the 25th to 75th percentiles. Whiskers extend 1.5 times the IQR, gray points denote outliers beyond this range. B Detailed analysis for pathology departments concerning single artifact type (% area occupied by artifacts in the whole dataset). Comments: *In P14 pathology department (3D HISTECH scanner) we observed overcalling of properly detected DARK SPOT (attributed to dust particles and greasy fingerprints). # is a prostate dataset prepared by the pathology department for a project on immunohistochemistry registration (PESO dataset). This is exceptionally high quality, uncommon for regular pathology departments which should be considered in research projects. C GrandQC as benchmark for scanning systems. Two most common scanning systems were tested on a heterogeneous dataset from University Hospital Cologne, containing different organs/specimens (each slide with two scanners). Single points represent slides with % area occupied by all or single artifact being coordinates: x—Scanner 1, y—Scanner 2. 95% confidence interval was used. While artifact areas showed general concordance. Scanner 2 produced significantly more out-of-focus regions, making it a notable benchmark for departments selecting scanning systems to optimize slide quality. D Temporal validation of slide quality serves as a third benchmark, enabling pathology departments to monitor changes over time. Data from University Hospital Cologne is shown on a yearly basis. Analysis can also be done daily, weekly, or monthly, with added outlier detection. Analyzing different artifact types reflects various aspects of slide preparation, from lab work to digitization. Source data is provided as a Source Data file.
Fig. 7
Fig. 7. Analysis of artifacts in the largest open-source multi-organ malignant tumor cohort (TCGA) used by research groups worldwide.
A, B The same analysis as in Fig. 6 is provided for the largest open-source research cohort (The Cancer Genome Atlas) containing multiple cohorts of patient cases and slides to different types of malignant tumors. Only diagnostic slides were analyzed (formalin-fixed paraffin embedded tissue). A Slide level analysis of % area occupied by all artifacts (above) and stratification of cohorts (below) based on the mean value. B Cohort level analysis for single types of artifacts. Comments: * Uveal melanoma (highly pigmented) cohort with focal misclassification of pigmented areas as AIR (high similarity). Refer to https://gdac.broadinstitute.org/ for abbreviations of cohort names. All artifacts and tissue detection masks for all TCGA cohorts are open-sourced for academic research purposes. This is of utmost value for research projects with non-supervised approaches. Using the provided masks allows to remove artificially changed areas from training (n = 2–10% of slide area) which might be an important confounder for algorithm biases and inaccuracy. For the box plots in Figure (A), the center line represents the median, the black points represent the mean of the value, the box bounds depict the interquartile range (IQR), covering data from the 25th to 75th percentiles, which represents the middle 50% of the values, the whiskers extend to a range of 1.5 times the IQR from the lower and upper quartiles, capturing a broader spread of the data, and the gray points are the outliers beyond the whiskers. Source data is provided as a Source Data file.
Fig. 8
Fig. 8. GrandQC improves performance of diagnostic algorithms in downstream applications.
Three use cases (situations) were evaluated to demonstrate the ability of GrandQC to improve the performance of downstream algorithms: A For diagnostic multi-class tissue segmentation algorithms: preventing false positive tumor classifications in benign tissue regions. Representative examples demonstrate how the detection and masking of artifacts help prevent false positive misclassifications in regions of interests containing benign tissue from two diagnostic domains (colorectal, lung). These regions were analyzed using previously developed multi-class tissue segmentation algorithms for lung and colorectal cancer, respectively. B For diagnostic multi-class tissue segmentation algorithms: for improving the segmentation accuracy in tumor regions. A total of 126 regions of interest (ROI) were analyzed for lung cancer and 121 for colorectal cancer, resulting in 2,016 and 1,936 patches, respectively, each sized 512 × 512 px (MPP 1.0). In these ROIs, synthetic out-of-focus regions were generated. The segmentation accuracy results are shown for each of the diagnostic tools (lung and colorectal), comparing a baseline (ROIs without artifacts), OOF regions without quality control, and OOF regions with artifact detection and masking using GrandQC. Notably, while GrandQC detects and masks artifacts, preventing misclassifications, it does not enhance the algorithm’s ability to detect structures in the affected areas. Therefore, thresholds may be needed to prompt pathologists to conduct additional reviews when large artifact areas are detected in certain context, as these may obscure important findings. C For single-cell detection/classification algorithms: preventing false cell detections and classifications. An algorithm was trained for single-cell detection (epithelial/tumor cells and five classes of immune/stromal cells) and classification in colorectal cancer. These algorithms are particularly vulnerable to artifacts due to the small size and subtle features of the cells being analyzed. The case presented is shows how a tissue fold artifact led to false cell detections and their misclassification as epithelial/tumor cells (brown; various other colors represent other cell classes). This issue can be easily prevented through a preanalytical step using GrandQC, which masks artifacts before downstream processing. All scale bars are 200 µm.

References

    1. Bera, K., Schalper, K. A., Rimm, D. L., Velcheti, V. & Madabhushi, A. Artificial intelligence in digital pathology — new tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol.16, 703–715 (2019). - PMC - PubMed
    1. Echle, A. et al. Deep learning in cancer pathology: a new generation of clinical biomarkers. Br. J. Cancer124, 686–696 (2021). - PMC - PubMed
    1. Tolkach, Y., Dohmgörgen, T., Toma, M. & Kristiansen, G. High-accuracy prostate cancer pathology using deep learning. Nat. Mach. Intell.2, 411–418 (2020).
    1. Griem, J. et al. Artificial intelligence-based tool for tumor detection and quantitative tissue analysis in colorectal specimens. Mod. Pathol.36, 100327 (2023). - PubMed
    1. Tolkach, Y. et al. Artificial intelligence for tumour tissue detection and histological regression grading in oesophageal adenocarcinomas: a retrospective algorithm development and validation study. Lancet Digit Health5, e265–e275 (2023). - PubMed

Publication types

LinkOut - more resources