. 2002 Jan-Feb;9(1):1-15.

doi: 10.1136/jamia.2002.0090001.

Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance

George Hripcsak¹, Adam Wilcox

Affiliations

PMID: 11751799
PMCID: PMC349383
DOI: 10.1136/jamia.2002.0090001

Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance

George Hripcsak et al. J Am Med Inform Assoc. 2002 Jan-Feb.

. 2002 Jan-Feb;9(1):1-15.

doi: 10.1136/jamia.2002.0090001.

Authors

George Hripcsak¹, Adam Wilcox

Affiliation

¹ Department of Medical Informatics, Columbia University, New York, New York 10032, USA. hripcsak@columbia.edu

PMID: 11751799
PMCID: PMC349383
DOI: 10.1136/jamia.2002.0090001

Abstract

Medical informatics systems are often designed to perform at the level of human experts. Evaluation of the performance of these systems is often constrained by lack of reference standards, either because the appropriate response is not known or because no simple appropriate response exists. Even when performance can be assessed, it is not always clear whether the performance is sufficient or reasonable. These challenges can be addressed if an evaluator enlists the help of clinical domain experts. 1) The experts can carry out the same tasks as the system, and then their responses can be combined to generate a reference standard. 2)The experts can judge the appropriateness of system output directly. 3) The experts can serve as comparison subjects with which the system can be compared. These are separate roles that have different implications for study design, metrics, and issues of reliability and validity. Diagrams help delineate the roles of experts in complex study designs.

PubMed Disclaimer

Figures

**Figure 1**
*Experts generate a reference standard.* Experts generate responses, which are combined by the evaluator to form a reference standard. The system also generates responses, which are compared by the evaluator with the reference standard to derive performance. It is assumed that tasks are simple enough for responses to be combined and compared unambiguously. (note: Rounded rectangles indicate tasks, observations, or measurements; ovals indicate actions by the system or experts; and diamonds indicate actions that require no domain expertise, such as simple tallying.)

**Figure 2**
*Experts judge system responses.* Experts judge the appropriateness of responses generated by the system, and performance is calculated. (See note to Figure 1.▶)

**Figure 3**
*Experts judge system responses using comparison responses.* Experts judge the correctness of system responses using comparison responses that they generate by a consensus process. The response-generating experts and the judging experts may be the same or different. This scenario differs from that represented by Figure 1▶ because the tasks are assumed to be more complex and therefore require expert judgment to determine appropriateness. The comparison responses do not constitute a reference standard in the sense of a single preferred response per task. Instead, they serve as a reference that can be overridden by the judgment of the experts. (See note to Figure 1.▶)

**Figure 4**
*Experts serve as comparison subjects for interpreting performance.* Experts serve as comparison subjects for setting an external reference standard. The responses of both the system and the experts are compared with an external reference standard, and performance is calculated for each. The performance of the system is then compared with that of the experts to determine whether the system performance is adequate or reasonable. (See note to Figure 1.▶)

**Figure 5**
*Experts serve as comparison subjects without a reference standard.* Lacking a reference standard, the system responses are compared directly with the responses of the experts, resulting in a measure of similarity rather than of performance. This design differs from the design shown in Figure 1▶, because the experts' responses are not combined and no reference standard is claimed. (See note to Figure 1.▶)

**Figure 6**
*Experts generate a reference standard and serve as comparison subjects.* Experts generate responses, which are combined into a reference standard (*middle column*). The system responses (*left column*) and the uncombined responses of the experts (*right column*) are compared with the reference standard, resulting in estimates of system and expert performance. The performance of the system is compared with that of the experts to determine whether the system performance is adequate or reasonable. The two ovals labeled “Experts generate” may represent two groups of experts or the same experts. In the latter case, the experts generate one set of responses, but to avoid bias, the responses of a given expert are compared only with the combined responses of the other experts, as described in the text. (See note to Figure 1.▶)

**Figure 7**
*Experts serve as judges and as comparison subjects.* The system and the experts generate responses, which are then judged by experts. The performance metrics for the system and for the response-generating experts are calculated and then compared, to determine whether system performance is adequate or reasonable. The two ovals labeled “Experts judge” indicate the same experts, and the experts are blinded to which responses (i.e., those of the system or those of the expert) they are judging. The experts who generate responses may be the same as the judging experts if bias is avoided (i.e., if judgments on their own responses are not included in the performance estimates). (See note to Figure 1.▶)

**Figure 8**
*Generating a reference standard from the pooled responses of the system and the experts.* The system (*left column*) and experts (*right column*) generate responses, which are pooled and judged by experts (*middle column*). The responses that are judged to be appropriate constitute the reference standard, which is then used to estimate the performance of the system and of the response-generating experts. The performance of the system is compared with that of the experts to determine whether the system performance is adequate or reasonable. Again, the same experts may generate responses and judge responses if bias is avoided. (See note to Figure 1.▶)

See this image and copyright information in PMC

Comment in

Reference standards in evaluating system performance.
Miller RA. Miller RA. J Am Med Inform Assoc. 2002 Jan-Feb;9(1):87-8. doi: 10.1136/jamia.2002.0090087. J Am Med Inform Assoc. 2002. PMID: 11751807 Free PMC article. No abstract available.

Cited by

Building and evaluation of a structured representation of pharmacokinetics information presented in SPCs: from existing conceptual views of pharmacokinetics associated with natural language processing to object-oriented design.
Duclos-Cartolano C, Venot A. Duclos-Cartolano C, et al. J Am Med Inform Assoc. 2003 May-Jun;10(3):271-80. doi: 10.1197/jamia.M1193. Epub 2003 Jan 28. J Am Med Inform Assoc. 2003. PMID: 12626375 Free PMC article.
Human and automated coding of rehabilitation discharge summaries according to the International Classification of Functioning, Disability, and Health.
Kukafka R, Bales ME, Burkhardt A, Friedman C. Kukafka R, et al. J Am Med Inform Assoc. 2006 Sep-Oct;13(5):508-15. doi: 10.1197/jamia.M2107. Epub 2006 Jun 23. J Am Med Inform Assoc. 2006. PMID: 16799117 Free PMC article.
Initial Development of an Automated Platform for Assessing Trainee Performance on Case Presentations.
King AJ, Kahn JM, Brant EB, Cooper GF, Mowery DL. King AJ, et al. ATS Sch. 2022 Sep 23;3(4):548-560. doi: 10.34197/ats-scholar.2022-0010OC. eCollection 2022 Dec. ATS Sch. 2022. PMID: 36726701 Free PMC article.
Lessons learned in replicating data-driven experiments in multiple medical systems and patient populations.
Kleinberg S, Elhadad N. Kleinberg S, et al. AMIA Annu Symp Proc. 2013 Nov 16;2013:786-95. eCollection 2013. AMIA Annu Symp Proc. 2013. PMID: 24551375 Free PMC article.
Measuring the impact of diagnostic decision support on the quality of clinical decision making: development of a reliable and valid composite score.
Ramnarayan P, Kapoor RR, Coren M, Nanduri V, Tomlinson AL, Taylor PM, Wyatt JC, Britto JF. Ramnarayan P, et al. J Am Med Inform Assoc. 2003 Nov-Dec;10(6):563-72. doi: 10.1197/jamia.M1338. Epub 2003 Aug 4. J Am Med Inform Assoc. 2003. PMID: 12925549 Free PMC article.

See all "Cited by" articles

References

1. Friedman CP, Wyatt JC. Evaluation Methods in Medical Informatics. New York: Springer, 1997.
1. Willems JL, Abreu-Lima C, Arnaud P, et al. Evaluation of ECG interpretation results obtained by computer and cardiologists. Methods Inf Med. 1990;29:308–16. - PubMed
1. Miller PL. Issues in the evaluation of artificial intelligence systems in medicine. Proc Annu Symp Comput Appl Med Care. 1985:281–6.
1. Willems JL, Abreu-Lima C, Arnaud P, et al. The diagnostic performance of computer programs for the interpretation of electrocardiograms. N Engl J Med. 1991;325:1767–73. - PubMed
1. Michaelis J, Wellek S, Willems JL. Reference standards for software evaluation. Methods Inf Med. 1990;29:289–97. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance

Affiliation

Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance

Authors

Affiliation

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources