Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 28:13:RP96017.
doi: 10.7554/eLife.96017.

A deep learning approach for automated scoring of the Rey-Osterrieth complex figure

Affiliations

A deep learning approach for automated scoring of the Rey-Osterrieth complex figure

Nicolas Langer et al. Elife. .

Abstract

Memory deficits are a hallmark of many different neurological and psychiatric conditions. The Rey-Osterrieth complex figure (ROCF) is the state-of-the-art assessment tool for neuropsychologists across the globe to assess the degree of non-verbal visual memory deterioration. To obtain a score, a trained clinician inspects a patient's ROCF drawing and quantifies deviations from the original figure. This manual procedure is time-consuming, slow and scores vary depending on the clinician's experience, motivation, and tiredness. Here, we leverage novel deep learning architectures to automatize the rating of memory deficits. For this, we collected more than 20k hand-drawn ROCF drawings from patients with various neurological and psychiatric disorders as well as healthy participants. Unbiased ground truth ROCF scores were obtained from crowdsourced human intelligence. This dataset was used to train and evaluate a multihead convolutional neural network. The model performs highly unbiased as it yielded predictions very close to the ground truth and the error was similarly distributed around zero. The neural network outperforms both online raters and clinicians. The scoring system can reliably identify and accurately score individual figure elements in previously unseen ROCF drawings, which facilitates explainability of the AI-scoring system. To ensure generalizability and clinical utility, the model performance was successfully replicated in a large independent prospective validation study that was pre-registered prior to data collection. Our AI-powered scoring system provides healthcare institutions worldwide with a digital tool to assess objectively, reliably, and time-efficiently the performance in the ROCF test from hand-drawn images.

Keywords: ROCF; Rey–Osterrieth complex figure; crowdsourced human intelligence; deep learning; memory deficit; neuropsychology; neuroscience; non-verbal visual memory; none.

PubMed Disclaimer

Conflict of interest statement

NL, MW, BH, DS, LW, AP, JH, SM, CS, MT, JL, DR, FS, QZ, RL, FW, OJ, PB, TZ, RL, CZ No competing interests declared

Figures

Figure 1.
Figure 1.. Overview of retrospective dataset.
(A) Rey–Osterrieth complex figure (ROCF) figure with 18 elements. (B) Demographics of the participants and clinical population of the retrospective dataset. (C) Examples of hand-drawn ROCF images. (D) The pie chart illustrates the proportion of the different clinical conditions of the retrospective dataset. (E) Performance in the copy and (immediate) recall condition across the lifespan in the retrospective dataset. (F) Distribution of the number of images for each total score (online raters).
Figure 1—figure supplement 1.
Figure 1—figure supplement 1.. Original scoring system according to Osterrieth.
Figure 1—figure supplement 2.
Figure 1—figure supplement 2.. World maps depict the worldwide distribution of the origin of the data.
(A) Retrospective data. (B) Prospective data.
Figure 1—figure supplement 3.
Figure 1—figure supplement 3.. The graphical user interface of the crowdsourcing application.
Figure 1—figure supplement 4.
Figure 1—figure supplement 4.. Overview of prospective dataset.
(A) Demographics of the participants of the prospectively collected data. (B) Performance in the copy and (immediate) recall condition across the lifespan in the prospectively collected data. (C) Distribution of number of images for each total score for the prospectively collected data.
Figure 1—figure supplement 5.
Figure 1—figure supplement 5.. The user interface for the tablet- (and smartphone-) based application.
The application enables explainability by providing a score for each individual item. Furthermore, the total score is displayed. The user can also compare the individual with a choosable norm population.
Figure 2.
Figure 2.. Model architecture and performance evaluation.
(A) Network architecture, constituted of a shared feature extractor and 18 item-specific feature extractors and output blocks. The shared feature extractor consists of three convolutional blocks, whereas item-specific feature extractors have one convolutional block with global max pooling. Convolutional blocks consist of two convolution and batch normalization pairs, followed by max pooling. Output blocks consist of two fully connected layers. ReLU activation is applied after batch normalization. After pooling, dropout is applied. (B) Item-specific mean absolute error (MAE) for the regression-based network (blue) and multilabel classification network (orange). In the final model, we determine whether to use the regressor or classifier network based on its performance in the validation dataset, indicated by an opaque color in the bar chart. In case of identical performance, the model resulting in the least variance was selected. (C) Model variants were compared and the performance of the best model in the original, retrospectively collected (green) and the independent, prospectively collected (purple) test set is displayed; Clf: multilabel classification network; Reg: regression-based network; NA: no augmentation; DA: data augmentation; TTA: test-time augmentation. (D) Convergence analysis revealed that after ~8000 images, no substantial improvements could be achieved by including more data. (E) The effect of image size on the model performance is measured in terms of MAE. The error bars in all subplots indicate the 95% confidence interval.
Figure 3.
Figure 3.. Contrasting the ratings of our model (A) and clinicians (D) against the ground truth revealed a larger deviation from the regression line for the clinicians.
A jitter is applied to better highlight the dot density. The distribution of errors for our model (B) and the clinicians ratings (E) is displayed. The mean absolute error (MAE) of our model (C) and the clinicians (F) is displayed for each individual item of the figure (see also Figure 2—source data 1). The corresponding plots for the performance on the prospectively collected data are displayed in Figure 3—figure supplement 1. The model performance for the retrospective (green) and prospective (purple) sample across the entire range of total scores for model (G), clinicians (H), and online raters (I) is presented. The error bars in all subplots indicate the 95% confidence interval.
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Detailed performance of the model on the prospective data.
Contrasting the ratings of our model. (A) Against the ground truth. A jitter is applied to better highlight the dot density. (B) The distribution of errors for our model on the prospective data is displayed. (C) The mean absolute error (MAE) of our model is displayed for each individual item of the figure (see also Figure 3—source data 2).
Figure 3—figure supplement 2.
Figure 3—figure supplement 2.. The standard deviation of the human raters is displayed across differently scored drawings.
Figure 4.
Figure 4.. Model performance across ROCF conditions, demographics, and clinical subgroups in the retrospective dataset.
(A) Displayed are the mean absolute error and bootstrapped 95% confidence intervals of the model performance across different Rey–Osterrieth complex figure (ROCF) conditions (copy and recall), demographics (age and gender), and clinical statuses (healthy individuals and patients) for the retrospective data. (B) Model performance across different diagnostic conditions. (C, D) The number of subjects in each subgroup is depicted. The same model performance analysis for the prospective data is reported in Figure 4—figure supplement 1.
Figure 4—figure supplement 1.
Figure 4—figure supplement 1.. Model performance across ROCF conditions, demographics, and clinical subgroups in prospective dataset.
(A) Displayed are the mean absolute error and bootstrapped 95% confidence intervals of the model performance across different Rey–Osterrieth complex figure (ROCF) conditions (copy and recall), demographics (age and gender), and clinical statuses (healthy individuals and patients) for the prospective data. (B) The number of subjects in each subgroup is depicted. Please note, that we did not have sufficient information on the specific patient diagnoses in the prospective data to decompose the model performance for specific clinical conditions.
Figure 5.
Figure 5.. Robustness to geometric, brightness, and contrast variations.
The mean absolute error (MAE) is depicted for different degrees of transformations, including (A) rotations; (B) perspective change; (C) brightness decrease; (D) brightness increase; (E) contrast change. In addition examples of the transformed Rey–Osterrieth complex figure (ROCF) draw are provided. The error bars in all subplots indicate the 95% confidence interval.
Figure 5—figure supplement 1.
Figure 5—figure supplement 1.. Effect of data augmentation.
The mean absolute error (MAE) for the model with data augmentation and without data augmentation is depicted for different degrees of transformations, including (A) rotations; (B) perspective change; (C) brightness decrease; (D) brightness increase; (E) contrast change.

Update of

References

    1. Alladi S, Arnold R, Mitchell J, Nestor PJ, Hodges JR. Mild cognitive impairment: applicability of research criteria in a memory clinic and characterization of cognitive profile. Psychological Medicine. 2006;36:507–515. doi: 10.1017/S0033291705006744. - DOI - PubMed
    1. Awad N, Tsiakas M, Gagnon M, Mertens VB, Hill E, Messier C. Explicit and objective scoring criteria for the taylor complex figure test. Journal of Clinical and Experimental Neuropsychology. 2004;26:405–415. doi: 10.1080/13803390490510112. - DOI - PubMed
    1. Berry DTR, Allen RS, Schmitt FA. Rey-Osterrieth complex figure: Psychometric characteristics in a geriatric sample. Clinical Neuropsychologist. 1991;5:143–153. doi: 10.1080/13854049108403298. - DOI
    1. Bin Nazar H, Moetesum M, Ehsan S, Siddiqi I, Khurshid K, Vincent N, McDonald-Maier KD. Classification of Graphomotor Impressions Using Convolutional Neural Networks: An Application to Automated Neuro-Psychological Screening Tests. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR; 2017. - DOI
    1. Buolamwini J, Gebru T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency.2018.

LinkOut - more resources