Comparative Study

. 2004 Jul;23(7):903-21.

doi: 10.1109/TMI.2004.828354.

Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation

Simon K Warfield¹, Kelly H Zou, William M Wells

Affiliations

PMID: 15250643
PMCID: PMC1283110
DOI: 10.1109/TMI.2004.828354

Comparative Study

Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation

Simon K Warfield et al. IEEE Trans Med Imaging. 2004 Jul.

. 2004 Jul;23(7):903-21.

doi: 10.1109/TMI.2004.828354.

Authors

Simon K Warfield¹, Kelly H Zou, William M Wells

Affiliation

¹ Harvard Medical School and the Department of Radiology of Brigham and Women's Hospital, 75 Francis St, Boston, MA 02115, USA. warfield@bwh.harvard.edu

PMID: 15250643
PMCID: PMC1283110
DOI: 10.1109/TMI.2004.828354

Abstract

Characterizing the performance of image segmentation approaches has been a persistent challenge. Performance analysis is important since segmentation algorithms often have limited accuracy and precision. Interactive drawing of the desired segmentation by human raters has often been the only acceptable approach, and yet suffers from intra-rater and inter-rater variability. Automated algorithms have been sought in order to remove the variability introduced by raters, but such algorithms must be assessed to ensure they are suitable for the task. The performance of raters (human or algorithmic) generating segmentations of medical images has been difficult to quantify because of the difficulty of obtaining or estimating a known true segmentation for clinical data. Although physical and digital phantoms can be constructed for which ground truth is known or readily estimated, such phantoms do not fully reflect clinical images due to the difficulty of constructing phantoms which reproduce the full range of imaging characteristics and normal and pathological anatomical variability observed in clinical data. Comparison to a collection of segmentations by raters is an attractive alternative since it can be carried out directly on the relevant clinical imaging data. However, the most appropriate measure or set of measures with which to compare such segmentations has not been clarified and several measures are used in practice. We present here an expectation-maximization algorithm for simultaneous truth and performance level estimation (STAPLE). The algorithm considers a collection of segmentations and computes a probabilistic estimate of the true segmentation and a measure of the performance level represented by each segmentation. The source of each segmentation in the collection may be an appropriately trained human rater or raters, or may be an automated segmentation algorithm. The probabilistic estimate of the true segmentation is formed by estimating an optimal combination of the segmentations, weighting each segmentation depending upon the estimated performance level, and incorporating a prior model for the spatial distribution of structures being segmented as well as spatial homogeneity constraints. STAPLE is straightforward to apply to clinical imaging data, it readily enables assessment of the performance of an automated image segmentation algorithm, and enables direct comparison of human rater and algorithm performance.

PubMed Disclaimer

Figures

**Fig. 1**
A specified true segmentation was randomly sampled to generate R = 10 segmentations with (p_j, q_j) = (0.95, 0.90) ∀j ∈ 1,…, 10. STAPLE converged in less than 20 iterations. The estimated performance parameters were $\hat{p}$ = 0.950104 ± 0.001201 (mean ± standard deviation) and $\hat{q}$ = 0.900035 ± 0.001685, which closely matches the specified parameters, (a) Example of synthetic segmentation, (b) Example of synthetic segmentation, (c) Example of synthetic segmentation, (d) Example of synthetic segmentation, (e) STAPLE estimated true segmentation. The estimate contains seven isolated single pixel errors which are circled to enable clearer visualization, (f) Estimated true segmentation from STAPLE incorporating an MRF prior. The estimate exactly matches the known true segmentation.

**Fig. 2**
Segmentations generated from R = 3 synthetic experts with parameters specified as (p₁*, q*₁) = (0.95, 0.95),(p₂, q₂) = (0.95, 0.90), and (p₃*, q*₃) = (0.90, 0.90). Only three observations of segmentations by these experts were generated, leading to a small and noisy set of data from which STAPLE was used to estimate the performance parameters and the true segmentation. STAPLE was executed to convergence, initialized with f (T_i = 1) = 0.5 and $(p_{j}^{(0)}, q_{j}^{(0)})$ = (0.99999, 0.99999), ∀j. Comparisons were made with and without a spatial homogeneity prior modeled with an MRF prior assuming four-nearest-neighbor pairwise interaction cliques and homogeneous interaction strength β = 2.5. The results indicate the estimated T found by STAPLE with the MRF prior exactly matches the specified true segmentation used for the simulations, whereas without this constraint the estimated T is somewhat noisier. In both cases, the estimated performance level parameters were very close to the parameters specified for the random segmentations, (a) Expert 1 segmentation, (b) Expert 2 segmentation, (c) Expert 3 segmentation, (d) STAPLE true segmentation estimate under voxelwise independence assumption, (e) STAPLE true segmentation estimate assuming spatially homogeneous true segmentation.

**Fig. 3**
R = 3 synthetic true segmentations with equal volume but different spatial locations. STAPLE was initialized with $(p_{j}^{(0)}, q_{j}^{(0)})$ = (0.99, 0.99), ∀j. The STAPLE true segmentation estimate is shown for the different prior probability assumptions: f (T_i = 1) = 0.12 which closely matches the segmentations, f (T_i = 1) = 0.5 which corresponds to a prior belief that half of the field of view should be the foreground class, and with automatic estimation of the prior via (35). (a) First synthetic segmentation, (b) Second synthetic segmentation, equal in size to the first but shifted to the left 10 voxels, (c) Third synthetic segmentation, equal in size to the first but shifted to the right ten voxels, (d) STAPLE true segmentation estimate for f (T_i = 1) = 0.12 ∀i. The estimated performance parameters were $(\hat{p_{1}}, \hat{q_{1}}) = (1.0, 1.0), (\hat{p_{2}}, \hat{q_{2}}) = (\hat{p_{3}}, \hat{q_{3}}) = (0.88, 0.99) .$ (e) STAPLE true segmentation estimate for f (T_i = 1) = 0.5 ∀ i. The estimated performance parameters were $(\hat{p_{j}}, \hat{q_{j}}) = (0.66, 1.0) \forall j .$ (f) STAPLE true segmentation estimate with automatic prior estimation. The estimated performance parameters were $(\hat{p_{1}}, \hat{q_{1}}) = (1.0, 1.0), (\hat{p_{2}}, \hat{q_{2}}) = (\hat{p_{3}}, \hat{q_{3}}) = (0.88, 0.99) .$

**Fig. 4**
Brain phantom and STAPLE estimated true segmentation with one expert and 3 inexperienced rater segmentations. Different initialization leads to different estimates. Exact estimation of the true segmentation is possible with STAPLE but not with a majority vote rule. The image that was segmented is shown in (a). The red color indicates the segmentation of the cortex from the consensus segmentation in (b), and from the STAPLE estimate in (c). In (d), the area of most frequent selection by the raters and expert is shown in red, and the light blue color represents the region selected only by the expert.

**Fig. 5**
Comparison of exact and estimated complete data log likelihood function for estimation of rater sensitivity and specificity, derived from the brain phantom with one expert and 3 inexperienced rater segmentations, (a) Exact ln f (D, T|θ) dependence on sensitivity for inexperienced rater and expert, (b) Estimated function of sensitivity from Q(θ|θ⁽^k⁾ after convergence from unequal rater initialization, (c) Estimated function of sensitivity from Q(θ|θ⁽^k⁾) after convergence from equal rater initialization, (d) Exact In f (D, T|θ) dependence on specificity for inexperienced rater and expert, (e) Estimated function of specificity from Q(θ|θ⁽^k⁾) after convergence from unequal rater initialization, (f) Estimated function of specificity from Q(θ|θ⁽^k⁾) after convergence from equal rater initialization.

**Fig. 6**
Comparison of STAPLE and voting rule estimates of the true segmentation from segmentations of the cortex generated by one expert and three medical students. The color coding of the frequency of selection is as shown in Fig. 7.

**Fig. 7**
This figure illustrates in (a), MRI of the prostate, in (b) the prostate peripheral zone, in (c) the frequency of assignment of voxels to the prostate peripheral zone in five repeated segmentations by the same rater, and in (d), the probabilistic true segmentation as estimated by STAPLE.

**Fig. 8**
This illustrates the segmentation of a brain tumor from MRI. Three experts carried out segmentation of the brain tumor, and STAPLE was used to estimate the true segmentation of the tumor. The performance level assessment of each of three raters and a semi-automatic algorithm for tumor segmentation program based upon the estimated true segmentation is shown in (c). The estimated sensitivity, specificity, and tumor predictive value are reported. The performance assessment indicates that the program is performing with higher sensitivity than one rater but with lower sensitivity than the other raters, while exhibiting higher specificity than two of the raters and lower specificity than one of the raters. This illustrates that STAPLE can be used to evaluate the performance of segmentation algorithms through comparison to rater segmentations, (a) Estimated true segmentation, (b) Frequency of manual segmentation, (c) Performance level assessment.

**Fig. 9**
STAPLE average wallclock time per iteration as a function of the number of raters, and average wallclock time per iteration as a function of the number of voxels, (a) Average time per iteration versus number of raters, (b) Average time per iteration versus number of voxels.

See this image and copyright information in PMC

References

1. D. Nicoll and W. Detmer, Basic Principles of Diagnostic Test Use and Interpretation. New York: McGraw-Hill, 2001, ch. 1, pp. 1–16.
1. Styner M, Brechbühler C, Székely G, Gerig G. “Parametric estimate of intensity inhomogeneities applied to MRI,”. IEEE Trans Med Imag. 2000 Mar;19(3):153–165. - PubMed
1. Collins D, Zijdenbos A, Kollokian V, Sled J, Kabani N, Holmes C, Evans A. “Design and construction of a realistic digital brain phantom,”. IEEE Trans Med Imag. 1998 June;17(3):463–468. - PubMed
1. Spitzer V, Ackerman MJ, Scherzinger AL, Whitlock D. “The visible human male: A technical report,”. J Amer Med Inform Assoc. 1996;3(2):118–130. - PMC - PubMed
1. T. S. Yoo, M. J. Ackerman, and M. Vannier, “Toward a common validation methodology for segmentation and registration algorithms,” in Proc. 3rd Int. Conf. Medical Image Computing and Computer-Assisted Intervention (MICCAI 2000), A. M. DiGioia and S. Delp, Eds., 2000, pp. 422–431.

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation

Affiliation

Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical