Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 15;41(8):1361-1375.
doi: 10.1002/sim.9282. Epub 2021 Dec 12.

Determination of the number of observers needed to evaluate a subjective test and its application in two PD-L1 studies

Affiliations

Determination of the number of observers needed to evaluate a subjective test and its application in two PD-L1 studies

Gang Han et al. Stat Med. .

Abstract

In pathological studies, subjective assays, especially companion diagnostic tests, can dramatically affect treatment of cancer. Binary diagnostic test results (ie, positive vs negative) may vary between pathologists or observers who read the tumor slides. Some tests have clearly defined criteria resulting in highly concordant outcomes, even with minimal training. Other tests are more challenging. Observers may achieve poor concordance even with training. While there are many statistically rigorous methods for measuring concordance between observers, we are unaware of a method that can identify how many observers are needed to determine whether a test can reach an acceptable concordance, if at all. Here we introduce a statistical approach to the assessment of test performance when the test is read by multiple observers, as would occur in the real world. By plotting the number of observers against the estimated overall agreement proportion, we can obtain a curve that plateaus to the average observer concordance. Diagnostic tests that are well-defined and easily judged show high concordance and plateau with few interobserver comparisons. More challenging tests do not plateau until many interobserver comparisons are made, and typically reach a lower plateau or even 0. We further propose a statistical test of whether the overall agreement proportion will drop to 0 with a large number of pathologists. The proposed analytical framework can be used to evaluate the difficulty in the interpretation of pathological test criteria and platforms, and to determine how pathology-based subjective tests will perform in the real world. The method could also be used outside of pathology, where concordance of a diagnosis or decision point relies on the subjective application of multiple criteria. We apply this method in two recent PD-L1 studies to test whether the curve of overall agreement proportion will converge to 0 and determine the minimal sufficient number of observers required to estimate the concordance plateau of their reads.

Keywords: Binomial distribution; concordance; inflated binomial distribution; overall agreement proportion; pathological tests.

PubMed Disclaimer

Figures

FIGURE C1
FIGURE C1
Plots of the value of p on the x-axis against the exact (solid line “——”) and approximate (dashed line “– – –”) estimates of θ0 on the y-axis for (A): m=12,Cm=15,C0=30,θ1=0.2; (B): m=12,Cm=20,C0=30,θ1=0.2
FIGURE 1
FIGURE 1
Bar charts of the PD-L1 expression agreement percentage of (A) SP263 TNBC data set, and (B) 22c3 tumor NSCLC data set
FIGURE 2
FIGURE 2
Results in the analysis of SP263 TNBC data set. (A) ONEST plot from 100 random permutations of the raters; (B) ONEST empirical estimate of the mean and 95% CI using the 100 permutations; (C) ONEST inference about agreement percentage (solid curve with triangles) and the 95% lower bound (dashed curve) at different number of raters; (D) ONEST inference about the change of percentage agreement (solid curve with triangles) with 95% upper bound (dashed curve)
FIGURE 3
FIGURE 3
Results in the analysis of 22c3 tumor NSCLC data set. (A) ONEST plot from 100 random permutations of the raters; (B) ONEST empirical estimate of the mean and 95% CI using the 100 permutations; (C) ONEST inference about agreement percentage (solid curve with triangles) and the 95% lower bound (dashed curve) at different number of raters; (D) ONEST inference about the change of percentage agreement (solid curve with triangles) with 95% upper bound (dashed curve)

Similar articles

Cited by

References

    1. Diaz LK, Sahin A, Sneige N. Interobserver agreement for estrogen receptor immunohistochemical analysis in breast cancer: a comparison of manual and computer-assisted scoring methods. Ann Diagn Pathol. 2004;8(1):23–27. - PubMed
    1. Leung SC, Nielsen TO, Zabaglo LA, et al. Analytical validation of a standardised scoring protocol for Ki67 immunohistochemistry on breast cancer excision whole sections: an international multicentre collaboration. Histopathology. 2019;75(2):225–235. - PubMed
    1. Maranta AF, Broder S, Fritzsche C, et al. Do YOU know the Ki-67 index of your breast cancer patients? knowledge of your institution’s Ki-67 index distribution and its robustness is essential for decision-making in early breast cancer. Breast. 2020;51:120–126. - PMC - PubMed
    1. Rexhepaj E, Brennan DJ, Holloway P, et al. Novel image analysis approach for quantifying expression of nuclear proteins assessed by immunohistochemistry: application to measurement of Oestrogen and progesterone receptor levels in breast cancer. Breast Cancer Res. 2008;10(5):1–10. - PMC - PubMed
    1. Reisenbichler ES, Han G, Bellizzi A, et al. Prospective multi-institutional evaluation of pathologist assessment of PD-L1 assays for patient selection in triple negative breast cancer. Mod Pathol. 2020;33(9):1746–1752. - PMC - PubMed

Publication types

LinkOut - more resources