. 2023 Jul 11;13(1):11238.

doi: 10.1038/s41598-023-38186-7.

Extended performance analysis of deep-learning algorithms for mice vocalization segmentation

Daniele Baggi¹, Marika Premoli², Alessandro Gnutti³, Sara Anna Bonini², Riccardo Leonardi¹, Maurizio Memo², Pierangelo Migliorati¹

Affiliations

¹ Department of Information Engineering, University of Brescia, Brescia, Italy.
² Department of Molecular and Translational Medicine, University of Brescia, Brescia, Italy.
³ Department of Information Engineering, University of Brescia, Brescia, Italy. alessandro.gnutti@unibs.it.

PMID: 37433808
PMCID: PMC10336146
DOI: 10.1038/s41598-023-38186-7

Extended performance analysis of deep-learning algorithms for mice vocalization segmentation

Daniele Baggi et al. Sci Rep. 2023.

. 2023 Jul 11;13(1):11238.

doi: 10.1038/s41598-023-38186-7.

Authors

Daniele Baggi¹, Marika Premoli², Alessandro Gnutti³, Sara Anna Bonini², Riccardo Leonardi¹, Maurizio Memo², Pierangelo Migliorati¹

Affiliations

¹ Department of Information Engineering, University of Brescia, Brescia, Italy.
² Department of Molecular and Translational Medicine, University of Brescia, Brescia, Italy.
³ Department of Information Engineering, University of Brescia, Brescia, Italy. alessandro.gnutti@unibs.it.

PMID: 37433808
PMCID: PMC10336146
DOI: 10.1038/s41598-023-38186-7

Abstract

Ultrasonic vocalizations (USVs) analysis represents a fundamental tool to study animal communication. It can be used to perform a behavioral investigation of mice for ethological studies and in the field of neuroscience and neuropharmacology. The USVs are usually recorded with a microphone sensitive to ultrasound frequencies and then processed by specific software, which help the operator to identify and characterize different families of calls. Recently, many automated systems have been proposed for automatically performing both the detection and the classification of the USVs. Of course, the USV segmentation represents the crucial step for the general framework, since the quality of the call processing strictly depends on how accurately the call itself has been previously detected. In this paper, we investigate the performance of three supervised deep learning methods for automated USV segmentation: an Auto-Encoder Neural Network (AE), a U-NET Neural Network (UNET) and a Recurrent Neural Network (RNN). The proposed models receive as input the spectrogram associated with the recorded audio track and return as output the regions in which the USV calls have been detected. To evaluate the performance of the models, we have built a dataset by recording several audio tracks and manually segmenting the corresponding USV spectrograms generated with the Avisoft software, producing in this way the ground-truth (GT) used for training. All three proposed architectures demonstrated precision and recall scores exceeding [Formula: see text], with UNET and AE achieving values above [Formula: see text], surpassing other state-of-the-art methods that were considered for comparison in this study. Additionally, the evaluation was extended to an external dataset, where UNET once again exhibited the highest performance. We suggest that our experimental results may represent a valuable benchmark for future works.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
An example of USV segmentation.

**Figure 2**
Two examples of inconvenient situations emerging from a not post-processed segmentation. (a) some spikes are incorrectly detected prior and after the USV, while in (b) the main prediction is split into two regions because of an extremely short break inside. Both cases can be properly fixed by applying a post-processing algorithm based on mathematical morphology on the output of the models.

**Figure 3**
General framework for the USV segmentation. The spectrogram $S_{i}$ associated with an entire audio track is divided into shorter fragments $s_{ij}$ . A classification model takes the fragments separately as input and returns for each of them a binary vector $y_{ij}^{(p)}$ , the elements of which indicate if a USV is present at the corresponding time frames (white square) or not (black square). The vectors are then concatenated, generating the vector $y_{i}^{(p)}$ associated with $S_{i}$ . Finally, a post-processing algorithm allows to remove potential isolated spikes or fill undesired small interruptions, generating this way $z_{i}^{(p)}$ . For the sake of visibility, the squares associated with the single time predictions are drawn larger with respect to the time resolution of the depicted spectrogram.

**Figure 4**
Flowchart of the three proposed models. The meaning of the blocks for all the 3 architectures is reported in the legend in (a). k, n and d specify the number of filters, the number of neurons and the dropout rate for the convolutional, dense and dropout layers, respectively.

**Figure 5**
Performance comparison between the models in terms of time frame classification, with and without taking into account the post-processing algorithm. (a) and (b) show the accuracy and precision-recall curves, respectively. (c-e) depict the confusion matrices associated with the models without post-processing. (f–h) depict the confusion matrices associated with the models when implementing post-processing.

**Figure 6**
Training vocalization segmentation performance of the models at varying of the kernel size used in the post-processing.

**Figure 7**
Precision and recall curves of the models at varying of the IoU threshold with and without post-processing.

**Figure 8**
Histogram of detected USV calls (positives) with respect to the corresponding IoU value with post-processing.

**Figure 9**
Visual results comparison between the proposed models with the post-processing algorithm. The pictures show the spectrograms of two audio segments (the first segment is depicted in (a– c) while the second one in (d–f) with the associated GT (blue signal) and prediction (orange signal) related to the three models.

**Figure 10**
Boxplots related to the length relative errors distributions.

See this image and copyright information in PMC

References

1. Holy T, Guo Z. Ultrasonic songs of male mice. PLoS Biol. 2005;3:e386. doi: 10.1371/journal.pbio.0030386. - DOI - PMC - PubMed
1. Zippelius HM, Schleidt WM. Ultraschall-laute bei jungen mause. Naturwissenschaften. 1956;43:502–502. doi: 10.1007/BF00632534. - DOI
1. D’Amato FR, Scalera E, Sarli C, Moles A. Pups call, mothers rush: Does maternal responsiveness affect the amount of ultrasonic vocalizations in mouse pups? Behav. Genet. 2005;35:103–112. doi: 10.1007/s10519-004-0860-9. - DOI - PubMed
1. Panksepp JB, et al. Affiliative behavior, ultrasonic communication and social reward are influenced by genetic variation in adolescent mice. PLoS ONE. 2007;2:e351. doi: 10.1371/journal.pone.0000351. - DOI - PMC - PubMed
1. Peleh T, Eltokhi A, Pitzer C. Longitudinal analysis of ultrasonic vocalizations in mice from infancy to adolescence: Insights into the vocal repertoire of three wild-type strains in two different social contexts. PLoS ONE. 2019;14:e0220238. doi: 10.1371/journal.pone.0220238. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Extended performance analysis of deep-learning algorithms for mice vocalization segmentation

Affiliations

Extended performance analysis of deep-learning algorithms for mice vocalization segmentation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous