Comparative Study

. 2024 Jul;34(7):4494-4503.

doi: 10.1007/s00330-023-10493-7. Epub 2024 Jan 2.

How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale

Eva J I Hoeijmakers¹, Bibi Martens^{2

3}, Babs M F Hendriks^{2

3}, Casper Mihl^{2

3}, Razvan L Miclea², Walter H Backes^{2

4}, Joachim E Wildberger^{2

3}, Frank M Zijta², Hester A Gietema^{2

5}, Patricia J Nelemans⁶, Cécile R L P N Jeukens²

Affiliations

¹ Department of Radiology and Nuclear Medicine, Maastricht University Medical Centre+, P. Debyelaan 25, Maastricht, 6229 HX, The Netherlands. evie.hoeijmakers@mumc.nl.
² Department of Radiology and Nuclear Medicine, Maastricht University Medical Centre+, P. Debyelaan 25, Maastricht, 6229 HX, The Netherlands.
³ CARIM School for Cardiovascular Diseases, Maastricht University, Universiteitssingel 50, Maastricht, 6229 ER, The Netherlands.
⁴ Department of Neurology and School for Mental health and Neuroscience (MheNs), Maastricht University Medical Centre+, P. Debyelaan 25, Maastricht, 6229 HX, The Netherlands.
⁵ GROW School for Oncology and Reproduction, Maastricht University, Universiteitssingel 50, Maastricht, 6229 ER, The Netherlands.
⁶ Department of Epidemiology, Maastricht University, Universiteitssingel 50, Maastricht, 6229 ER, The Netherlands.

PMID: 38165429
PMCID: PMC11213789
DOI: 10.1007/s00330-023-10493-7

Comparative Study

How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale

Eva J I Hoeijmakers et al. Eur Radiol. 2024 Jul.

. 2024 Jul;34(7):4494-4503.

doi: 10.1007/s00330-023-10493-7. Epub 2024 Jan 2.

Authors

Affiliations

¹ Department of Radiology and Nuclear Medicine, Maastricht University Medical Centre+, P. Debyelaan 25, Maastricht, 6229 HX, The Netherlands. evie.hoeijmakers@mumc.nl.
² Department of Radiology and Nuclear Medicine, Maastricht University Medical Centre+, P. Debyelaan 25, Maastricht, 6229 HX, The Netherlands.
³ CARIM School for Cardiovascular Diseases, Maastricht University, Universiteitssingel 50, Maastricht, 6229 ER, The Netherlands.
⁴ Department of Neurology and School for Mental health and Neuroscience (MheNs), Maastricht University Medical Centre+, P. Debyelaan 25, Maastricht, 6229 HX, The Netherlands.
⁵ GROW School for Oncology and Reproduction, Maastricht University, Universiteitssingel 50, Maastricht, 6229 ER, The Netherlands.
⁶ Department of Epidemiology, Maastricht University, Universiteitssingel 50, Maastricht, 6229 ER, The Netherlands.

PMID: 38165429
PMCID: PMC11213789
DOI: 10.1007/s00330-023-10493-7

Abstract

Objectives: The aim of this study is to improve the reliability of subjective IQ assessment using a pairwise comparison (PC) method instead of a Likert scale method in abdominal CT scans.

Methods: Abdominal CT scans (single-center) were retrospectively selected between September 2019 and February 2020 in a prior study. Sample variance in IQ was obtained by adding artificial noise using dedicated reconstruction software, including reconstructions with filtered backprojection and varying iterative reconstruction strengths. Two datasets (each n = 50) were composed with either higher or lower IQ variation with the 25 original scans being part of both datasets. Using in-house developed software, six observers (five radiologists, one resident) rated both datasets via both the PC method (forcing observers to choose preferred scans out of pairs of scans resulting in a ranking) and a 5-point Likert scale. The PC method was optimized using a sorting algorithm to minimize necessary comparisons. The inter- and intraobserver agreements were assessed for both methods with the intraclass correlation coefficient (ICC).

Results: Twenty-five patients (mean age 61 years ± 15.5; 56% men) were evaluated. The ICC for interobserver agreement for the high-variation dataset increased from 0.665 (95%CI 0.396-0.814) to 0.785 (95%CI 0.676-0.867) when the PC method was used instead of a Likert scale. For the low-variation dataset, the ICC increased from 0.276 (95%CI 0.034-0.500) to 0.562 (95%CI 0.337-0.729). Intraobserver agreement increased for four out of six observers.

Conclusion: The PC method is more reliable for subjective IQ assessment indicated by improved inter- and intraobserver agreement.

Clinical relevance statement: This study shows that the pairwise comparison method is a more reliable method for subjective image quality assessment. Improved reliability is of key importance for optimization studies, validation of automatic image quality assessment algorithms, and training of AI algorithms.

Key points: • Subjective assessment of diagnostic image quality via Likert scale has limited reliability. • A pairwise comparison method improves the inter- and intraobserver agreement. • The pairwise comparison method is more reliable for CT optimization studies.

Keywords: Computed tomography (X-ray); Interobserver variability; Intraobserver variability.

PubMed Disclaimer

Conflict of interest statement

The authors of this manuscript declare relationships with the following companies:

Bibi Martens, Babs MF Hendriks and Casper Mihl receive speakers’ fees from Bayer, all outside the submitted work. Joachim E Wildberger reports institutional grants from Bard, Bayer, Boston, Brainlab, GE, Philips, and Siemens and speakers’ fees from Bayer and Siemens, all outside the submitted work.

Figures

**Fig. 1**
Flow diagram with selection of the high- and low-variation datasets

**Fig. 2**
User interface of the in-house developed software for (a) the pairwise comparison method; b the Likert scale method. c Examples of 3 CT images ranked by an observer using the pairwise comparison method with corresponding Likert scores

**Fig. 3**
Boxplots representing the spread in ranks of the image quality assessment by all six observers using the pairwise comparison method. The boxes are sorted by the median of the six ranks received by the observers and colored by the median Likert score given by the observers. Results are given for the (a) high- and (b) low-variation datasets

**Fig. 4**
Confusion matrices of random combinations of two observers for the Likert scores of the high-variation dataset (a–d) and low-variation dataset (e–h). The horizontal and vertical axes give the Likert scores given by the observers. A diagonal matrix would indicate complete agreement

**Fig. 5**
The agreement between two assessments using the pairwise comparison method for every single observer. For each scan, two dots are shown corresponding to the ranking in assessments 1 and 2. The color represents the mean Likert score given by the observer in the two assessments. The scans are sorted by the median rank. As an example, for observer 5, nearly every scan received a Likert score 4 (purple), despite the images consistently having distinct rankings in both assessments

**Fig. 6**
Confusion matrices of the repeated assessment for every single observer (intraobserver agreement). The horizontal (upper) and vertical axes give the Likert scores given by the observers. A diagonal matrix would indicate total agreement

See this image and copyright information in PMC

Comment in

Quantifying image quality: are we approaching the grail?
Yamada A. Yamada A. Eur Radiol. 2024 Jul;34(7):4492-4493. doi: 10.1007/s00330-023-10563-w. Epub 2024 Jan 4. Eur Radiol. 2024. PMID: 38175224 No abstract available.

References

1. Valentin J (2007) The 2007 Recommendations of the International Commission on Radiological Protection. Oxford: Elsevier 37(2-4):1-133 - PubMed
1. Valentin J (2007) International Commission on Radiation Protection. Managing patient dose in multi-detector computed tomography (MDCT). New York: Elsevier 1-79 - PubMed
1. Samei E, Bakalyar D, Boedeker KL, et al. Performance evaluation of computed tomography systems: summary of AAPM Task Group 233. Med Phys. 2019;46(11):e735–e756. doi: 10.1002/mp.13763. - DOI - PubMed
1. Likert R. A technique for the measurement of attitudes. Arch Psychol. 1932;22(140):5–55.
1. Zhang Z, Zhau J, Liu N, Gu X, Zhang Y (2017) An improved pairwise comparison scaling method for subjective image quality assessment. IEEE Int Symp Broadb Multimed Syst Broadcast (BMSB) 1-6

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Consumer Health Information
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale

Affiliations

How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical