Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Jan-Dec:29:23312165251347773.
doi: 10.1177/23312165251347773. Epub 2025 Jun 3.

Comparison of Deep Learning Models for Objective Auditory Brainstem Response Detection: A Multicenter Validation Study

Affiliations
Comparative Study

Comparison of Deep Learning Models for Objective Auditory Brainstem Response Detection: A Multicenter Validation Study

Yin Liu et al. Trends Hear. 2025 Jan-Dec.

Abstract

Auditory brainstem response (ABR) interpretation in clinical practice often relies on visual inspection by audiologists, which is prone to inter-practitioner variability. While deep learning (DL) algorithms have shown promise in objectifying ABR detection in controlled settings, their applicability to real-world clinical data is hindered by small datasets and insufficient heterogeneity. This study evaluates the generalizability of nine DL models for ABR detection using large, multicenter datasets. The primary dataset analyzed, Clinical Dataset I, comprises 128,123 labeled ABRs from 13,813 participants across a wide range of ages and hearing levels, and was divided into a training set (90%) and a held-out test set (10%). The models included convolutional neural networks (CNNs; AlexNet, VGG, ResNet), transformer-based architectures (Transformer, Patch Time Series Transformer [PatchTST], Differential Transformer, and Differential PatchTST), and hybrid CNN-transformer models (ResTransformer, ResPatchTST). Performance was assessed on the held-out test set and four external datasets (Clinical II, Southampton, PhysioNet, Mendeley) using accuracy and area under the receiver operating characteristic curve (AUC). ResPatchTST achieved the highest performance on the held-out test set (accuracy: 91.90%, AUC: 0.976). Transformer-based models, particularly PatchTST, showed superior generalization to external datasets, maintaining robust accuracy across diverse clinical settings. Additional experiments highlighted the critical role of dataset size and diversity in enhancing model robustness. We also observed that incorporating acquisition parameters and demographic features as auxiliary inputs yielded performance gains in cross-center generalization. These findings underscore the potential of DL models-especially transformer-based architectures-for accurate and generalizable ABR detection, and highlight the necessity of large, diverse datasets in developing clinically reliable systems.

Keywords: auditory brainstem response; deep learning; generalizability; multicenter validation; objective detection.

PubMed Disclaimer

Conflict of interest statement

Declaration of Conflicting InterestsThe authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
Participant demographics and data distribution for Clinical Datasets I and II. (a), (b) Age group distribution by gender: Infants: ( < 6 months), children (6 months-18 years), adults (18–60 years), elderly ( > 60 years); (c), (d) histogram showing the number of responses recorded from each ear; (e), (f) distribution of severity of hearing loss across age groups, defined based on average hearing thresholds at 0.5, 1, 2, and 4 kHz in the better ear. The categories are as follows: Normal hearing: Threshold < 20 dB, mild hearing loss: 20 dB threshold < 35 dB, moderate hearing loss: 35 dB threshold < 50 dB, moderate-severe hearing loss: 50 dB threshold < 65 dB, severe hearing loss: 65 dB threshold < 80 dB, profound hearing loss: 80 dB threshold < 95 dB, and complete or total hearing loss: threshold 95 dB (World Health Organization, 2021). Clinical Datasets I and II have 5,824 and 94 individuals who have not undergone hearing testing, respectively.
Figure 2.
Figure 2.
Examples of auditory brainstem response (ABR). Four examples of ABR at a range of stimulus levels for hearing threshold estimation. Wave V is labeled where assessed as present.
Figure 3.
Figure 3.
Overview of the proposed deep learning framework for detecting auditory brainstem responses (ABRs). (a) Data preparation: Clinical Dataset I is split into 90% for training and 10% as the held-out test set, while Clinical Dataset II, along with the Southampton, PhysioNet, and Mendeley datasets, are used for external validation. (b) Model development: The training set is used for model development, including hyperparameter tuning using 9-fold cross-validation, followed by retraining with the optimized hyperparameters. (c) Model evaluation: The final models classify ABRs as response present or response absent, with performance assessed on five test sets to evaluate generalization across multicenter datasets.
Figure 4.
Figure 4.
Architectures of various DL models. CNN-based models: AlexNet, VGG, and ResNet; Transformer-based models: Transformer, PatchTST, DiffTransformer and DiffPatchTST; Hybrid models: ResTransformer and ResPatchTST. Note. DL = deep learning; CNN = convolutional neural network; PatchTST = patch time series transformer; DiffTransformer = differential transformer; DiffPatchTST = differential PatchTST.
Figure 5.
Figure 5.
Performance of deep learning models on Clinical Dataset I. (a) Error bar plots for accuracy, sensitivity, specificity, F1-score, and area under the receiver operating characteristic curve (AUC) (left to right), evaluated using 9-fold cross-validation on the training set. (b) Forest plots for the same metrics evaluated on the held-out test set. Bold values indicate the best performance for each metric.
Figure 6.
Figure 6.
Distribution of absolute errors in auditory brainstem response (ABR) threshold prediction and corresponding cumulative accuracy for ResPatchTST. Each bar represents the number of ears at a given absolute prediction error level (in dB). Accuracy, 10 dB accuracy, and 20 dB accuracy are annotated in the upper right, with 95.41% of predictions falling within 10 dB of expert-labeled thresholds.
Figure 7.
Figure 7.
Forest plots for accuracies of various models externally evaluated on multicenter datasets: Clinical Dataset I, Clinical Dataset II, Southampton, PhysioNet, and Mendeley Datasets. Bold values indicate the best result for each dataset.
Figure 8.
Figure 8.
Impact of training dataset size on the generalization performance of PatchTST. The model is retrained on subsets randomly sampled from the full training dataset (12,256 individuals) and validated using 9-fold cross-validation on the training set, as well as on the held-out test set from Clinical Dataset I and the Independent Clinical Dataset II. (a) Accuracy; (b) AUC. Note. AUC = area under the receiver operating characteristic curve; PatchTST = patch time series transformer.
Figure 9.
Figure 9.
Generalization performance of PatchTST trained on age-restricted groups compared to mixed-age groups of equal size. The upper panel illustrates the model's performance when trained on data from a single age group and validated on unseen age groups both individually and collectively. The lower panel shows the corresponding performance metrics when the model is trained on a mixed-age dataset of equivalent size. (a) Accuracy; (b) AUC. In both panels, n represents the number of subjects. Statistical significance of AUC differences was evaluated using DeLong's test: *p < .05, **p < .01, ***p < .001. Note. AUC = area under the receiver operating characteristic curve; PatchTST = patch time series transformer.
Figure 10.
Figure 10.
Generalization performance of PatchTST trained on hearing-status-restricted groups versus mixed-hearing-status groups of equal size. The upper panel shows performance when trained on either individuals with NH or HL and validated on the opposite group. The lower panel presents performance when trained on a mixed-hearing-status dataset of the same size. (a) Accuracy; (b) AUC. Statistical significance of AUC differences was evaluated using DeLong’s test: *p < .05, **p < .01, ***p < .001. Note. AUC = area under the receiver operating characteristic curve; PatchTST = patch time series transformer; NH = normal hearing; HL = hearing loss

Similar articles

References

    1. Acır N., Özdamar Ö, Güzeliş C. (2006). Automatic classification of auditory brainstem responses using SVM-based feature selection algorithm for threshold detection. Engineering Applications of Artificial Intelligence, 19(2), 209–218. httpsss://doi.org/http://doi.org/10.1016/j.engappai.2005.08.004
    1. Aloufi N., Heinrich A., Marshall K., Kluk K. (2023). Sex differences and the effect of female sex hormones on auditory function: A systematic review. Frontiers in Human Neuroscience, 17, 1077409. 10.3389/fnhum.2023.1077409 - DOI - PMC - PubMed
    1. Alpsan D. (1991). Classification of auditory brainstem responses by human experts and backipropagation neural networks. In Paper presented at the Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society 13: 1991.
    1. Arnold S. A. (1985). Objective versus visual detection of the auditory brain stem response. Ear and Hearing, 6(3), 144–150. 10.1097/00003446-198505000-00004 - DOI - PubMed
    1. Ballachanda B. B., Moushegian G., Stillman R. D. (1992). Adaptation of the auditory brainstem response: Effects of click intensity, polarity, and position. Journal of the American Academy of Audiology, 3(4), 275–282. - PubMed

LinkOut - more resources