Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 Jan-Feb;9(1):1-15.
doi: 10.1136/jamia.2002.0090001.

Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance

Affiliations

Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance

George Hripcsak et al. J Am Med Inform Assoc. 2002 Jan-Feb.

Abstract

Medical informatics systems are often designed to perform at the level of human experts. Evaluation of the performance of these systems is often constrained by lack of reference standards, either because the appropriate response is not known or because no simple appropriate response exists. Even when performance can be assessed, it is not always clear whether the performance is sufficient or reasonable. These challenges can be addressed if an evaluator enlists the help of clinical domain experts. 1) The experts can carry out the same tasks as the system, and then their responses can be combined to generate a reference standard. 2)The experts can judge the appropriateness of system output directly. 3) The experts can serve as comparison subjects with which the system can be compared. These are separate roles that have different implications for study design, metrics, and issues of reliability and validity. Diagrams help delineate the roles of experts in complex study designs.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Experts generate a reference standard. Experts generate responses, which are combined by the evaluator to form a reference standard. The system also generates responses, which are compared by the evaluator with the reference standard to derive performance. It is assumed that tasks are simple enough for responses to be combined and compared unambiguously. (note: Rounded rectangles indicate tasks, observations, or measurements; ovals indicate actions by the system or experts; and diamonds indicate actions that require no domain expertise, such as simple tallying.)
Figure 2
Figure 2
Experts judge system responses. Experts judge the appropriateness of responses generated by the system, and performance is calculated. (See note to Figure 1.▶)
Figure 3
Figure 3
Experts judge system responses using comparison responses. Experts judge the correctness of system responses using comparison responses that they generate by a consensus process. The response-generating experts and the judging experts may be the same or different. This scenario differs from that represented by Figure 1▶ because the tasks are assumed to be more complex and therefore require expert judgment to determine appropriateness. The comparison responses do not constitute a reference standard in the sense of a single preferred response per task. Instead, they serve as a reference that can be overridden by the judgment of the experts. (See note to Figure 1.▶)
Figure 4
Figure 4
Experts serve as comparison subjects for interpreting performance. Experts serve as comparison subjects for setting an external reference standard. The responses of both the system and the experts are compared with an external reference standard, and performance is calculated for each. The performance of the system is then compared with that of the experts to determine whether the system performance is adequate or reasonable. (See note to Figure 1.▶)
Figure 5
Figure 5
Experts serve as comparison subjects without a reference standard. Lacking a reference standard, the system responses are compared directly with the responses of the experts, resulting in a measure of similarity rather than of performance. This design differs from the design shown in Figure 1▶, because the experts' responses are not combined and no reference standard is claimed. (See note to Figure 1.▶)
Figure 6
Figure 6
Experts generate a reference standard and serve as comparison subjects. Experts generate responses, which are combined into a reference standard (middle column). The system responses (left column) and the uncombined responses of the experts (right column) are compared with the reference standard, resulting in estimates of system and expert performance. The performance of the system is compared with that of the experts to determine whether the system performance is adequate or reasonable. The two ovals labeled “Experts generate” may represent two groups of experts or the same experts. In the latter case, the experts generate one set of responses, but to avoid bias, the responses of a given expert are compared only with the combined responses of the other experts, as described in the text. (See note to Figure 1.▶)
Figure 7
Figure 7
Experts serve as judges and as comparison subjects. The system and the experts generate responses, which are then judged by experts. The performance metrics for the system and for the response-generating experts are calculated and then compared, to determine whether system performance is adequate or reasonable. The two ovals labeled “Experts judge” indicate the same experts, and the experts are blinded to which responses (i.e., those of the system or those of the expert) they are judging. The experts who generate responses may be the same as the judging experts if bias is avoided (i.e., if judgments on their own responses are not included in the performance estimates). (See note to Figure 1.▶)
Figure 8
Figure 8
Generating a reference standard from the pooled responses of the system and the experts. The system (left column) and experts (right column) generate responses, which are pooled and judged by experts (middle column). The responses that are judged to be appropriate constitute the reference standard, which is then used to estimate the performance of the system and of the response-generating experts. The performance of the system is compared with that of the experts to determine whether the system performance is adequate or reasonable. Again, the same experts may generate responses and judge responses if bias is avoided. (See note to Figure 1.▶)

Comment in

Similar articles

Cited by

References

    1. Friedman CP, Wyatt JC. Evaluation Methods in Medical Informatics. New York: Springer, 1997.
    1. Willems JL, Abreu-Lima C, Arnaud P, et al. Evaluation of ECG interpretation results obtained by computer and cardiologists. Methods Inf Med. 1990;29:308–16. - PubMed
    1. Miller PL. Issues in the evaluation of artificial intelligence systems in medicine. Proc Annu Symp Comput Appl Med Care. 1985:281–6.
    1. Willems JL, Abreu-Lima C, Arnaud P, et al. The diagnostic performance of computer programs for the interpretation of electrocardiograms. N Engl J Med. 1991;325:1767–73. - PubMed
    1. Michaelis J, Wellek S, Willems JL. Reference standards for software evaluation. Methods Inf Med. 1990;29:289–97. - PubMed

Publication types