Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 31:8:e49907.
doi: 10.2196/49907.

Controlling Inputter Variability in Vignette Studies Assessing Web-Based Symptom Checkers: Evaluation of Current Practice and Recommendations for Isolated Accuracy Metrics

Affiliations

Controlling Inputter Variability in Vignette Studies Assessing Web-Based Symptom Checkers: Evaluation of Current Practice and Recommendations for Isolated Accuracy Metrics

András Meczner et al. JMIR Form Res. .

Abstract

Background: The rapid growth of web-based symptom checkers (SCs) is not matched by advances in quality assurance. Currently, there are no widely accepted criteria assessing SCs' performance. Vignette studies are widely used to evaluate SCs, measuring the accuracy of outcome. Accuracy behaves as a composite metric as it is affected by a number of individual SC- and tester-dependent factors. In contrast to clinical studies, vignette studies have a small number of testers. Hence, measuring accuracy alone in vignette studies may not provide a reliable assessment of performance due to tester variability.

Objective: This study aims to investigate the impact of tester variability on the accuracy of outcome of SCs, using clinical vignettes. It further aims to investigate the feasibility of measuring isolated aspects of performance.

Methods: Healthily's SC was assessed using 114 vignettes by 3 groups of 3 testers who processed vignettes with different instructions: free interpretation of vignettes (free testers), specified chief complaints (partially free testers), and specified chief complaints with strict instruction for answering additional symptoms (restricted testers). κ statistics were calculated to assess agreement of top outcome condition and recommended triage. Crude and adjusted accuracy was measured against a gold standard. Adjusted accuracy was calculated using only results of consultations identical to the vignette, following a review and selection process. A feasibility study for assessing symptom comprehension of SCs was performed using different variations of 51 chief complaints across 3 SCs.

Results: Intertester agreement of most likely condition and triage was, respectively, 0.49 and 0.51 for the free tester group, 0.66 and 0.66 for the partially free group, and 0.72 and 0.71 for the restricted group. For the restricted group, accuracy ranged from 43.9% to 57% for individual testers, averaging 50.6% (SD 5.35%). Adjusted accuracy was 56.1%. Assessing symptom comprehension was feasible for all 3 SCs. Comprehension scores ranged from 52.9% and 68%.

Conclusions: We demonstrated that by improving standardization of the vignette testing process, there is a significant improvement in the agreement of outcome between testers. However, significant variability remained due to uncontrollable tester-dependent factors, reflected by varying outcome accuracy. Tester-dependent factors, combined with a small number of testers, limit the reliability and generalizability of outcome accuracy when used as a composite measure in vignette studies. Measuring and reporting different aspects of SC performance in isolation provides a more reliable assessment of SC performance. We developed an adjusted accuracy measure using a review and selection process to assess data algorithm quality. In addition, we demonstrated that symptom comprehension with different input methods can be feasibly compared. Future studies reporting accuracy need to apply vignette testing standardization and isolated metrics.

Keywords: accuracy; evaluation; methods; metrics; mobile phone; performance; symptom checker; triage; variability; vignette; vignette studies.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: AM, NC, AQ, MR, SS, EB, and TM are all current or ex-employees and shareholders of Healthily. Healthily funded the research.

References

    1. Levine DM, Tuwani R, Kompa B, Varma A, Finlayson SG, Mehrotra A, Beam A. The diagnostic and triage accuracy of the GPT-3 artificial intelligence model. medRxiv. 2023 Feb 01;:5067. doi: 10.1101/2023.01.30.23285067. https://europepmc.org/abstract/MED/36778449 2023.01.30.23285067 - DOI - PubMed
    1. Wallace W, Chan C, Chidambaram S, Hanna L, Iqbal FM, Acharya A, Normahani P, Ashrafian H, Markar SR, Sounderajah V, Darzi A. The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review. NPJ Digit Med. 2022 Aug 17;5(1):118. doi: 10.1038/s41746-022-00667-w. doi: 10.1038/s41746-022-00667-w.10.1038/s41746-022-00667-w - DOI - DOI - PMC - PubMed
    1. Hildebrandt DE, Westfall JM, Fernald DH, Pace WD. Harm resulting from inappropriate telephone triage in primary care. J Am Board Fam Med. 2006 Sep 01;19(5):437–42. doi: 10.3122/jabfm.19.5.437. http://www.jabfm.org/cgi/pmidlookup?view=long&pmid=16951292 19/5/437 - DOI - PubMed
    1. Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, amending Directive 2001/83/EC, Regulation (EC) No 178/2002 and Regulation (EC) No 1223/2009 and repealing Council Directives 90/385/EEC and 93/42/EEC (Text with EEA relevance. ) European Union. [2024-05-08]. https://eur-lex.europa.eu/eli/reg/2017/745/oj .
    1. Framework for FDA’s real-world evidence program. U.S. Food & Drug Administration. 2018. Dec, [2023-05-15]. https://www.fda.gov/media/120060/download .

LinkOut - more resources