. 2024 Jan 2;15(1):148.

doi: 10.1038/s41467-023-44516-0.

Spontaneous emergence of rudimentary music detectors in deep neural networks

Gwangsu Kim¹, Dong-Kyum Kim¹, Hawoong Jeong^{2

3}

Affiliations

¹ Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Korea.
² Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Korea. hjeong@kaist.edu.
³ Center for Complex Systems, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Korea. hjeong@kaist.edu.

PMID: 38168097
PMCID: PMC10761941
DOI: 10.1038/s41467-023-44516-0

Spontaneous emergence of rudimentary music detectors in deep neural networks

Gwangsu Kim et al. Nat Commun. 2024.

. 2024 Jan 2;15(1):148.

doi: 10.1038/s41467-023-44516-0.

Authors

Gwangsu Kim¹, Dong-Kyum Kim¹, Hawoong Jeong^{2

3}

Affiliations

¹ Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Korea.
² Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Korea. hjeong@kaist.edu.
³ Center for Complex Systems, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Korea. hjeong@kaist.edu.

PMID: 38168097
PMCID: PMC10761941
DOI: 10.1038/s41467-023-44516-0

Abstract

Music exists in almost every society, has universal acoustic features, and is processed by distinct neural circuits in humans even with no experience of musical training. However, it remains unclear how these innate characteristics emerge and what functions they serve. Here, using an artificial deep neural network that models the auditory information processing of the brain, we show that units tuned to music can spontaneously emerge by learning natural sound detection, even without learning music. The music-selective units encoded the temporal structure of music in multiple timescales, following the population-level response characteristics observed in the brain. We found that the process of generalization is critical for the emergence of music-selectivity and that music-selectivity can work as a functional basis for the generalization of natural sound, thereby elucidating its origin. These findings suggest that evolutionary adaptation to process natural sounds can provide an initial blueprint for our sense of music.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Distinct representation of music in deep neural networks trained for natural sound detection without music.**
a Example log-Mel spectrograms of the natural sound data in the AudioSet. b Architecture of the deep neural network used to detect the natural sound categories in the input data. The purple box indicates the average pooling layer. c Performance (mean average precision, mAP) of the network trained without music for music-related categories (top, red bars) and other categories (bottom, blue). n = 5 independent networks. Error bars represent mean +/− SD. d Density plot of the t-SNE embedding of feature vectors obtained from the network in C. The lines represent iso-proportion lines at 80%, 60%, 40%, and 20% levels. Source data are provided as a Source Data file.

**Fig. 2. Selective response of units in the network to music.**
a Histograms of the average response of the units for music (red) and non-music (blue) stimuli in networks trained without music. The lines represent the response averaged over all units. b Response of the music-selective units to music (red) and non-music stimuli. Inset: Response of the units in the untrained network with the top 12.5% MSI values to music and non-music stimuli. The box represents the lower and upper quartile. The whiskers represent the lower (upper) quartile – (+) 1.5 × interquartile range. n_{music, train} = 4539, n_{music, test} = 3999, n_{non-music, train} = 10,483, and n_{non-music_test} = 11,010 independent sounds. c Invariance of the music-selectivity to changes in sound amplitude. Response of the music-selective units to music (red) and non-music (blue) using the training dataset with normalized amplitude. The whiskers represent the lower (upper) quartile – (+) 1.5 × interquartile range. n_music = 4539, n_non-music = 10,483 independent sounds. d Illustration of the binary classification of music and non-music using the response of the music-selective units (left), and the performance of the linear classifier (right). One-tailed Wilcoxon rank-sum test, from left asterisks, U_12.5-25% = 25, U_25-37.5% = 25, U_37.5-50% = 25, p_12.5-25% = 0.006, p_25-37.5% = 0.006, p_37.5-50% = 0.006, ES_12.5-25% = 1, ES_25-37.5% = 1, ES_37.5-50% = 1, n = 5 independent networks. Error bars represent mean +/− SD. The asterisks represent statistical significance (p < 0.05). Source data are provided as a Source Data file.

**Fig. 3. Significance of the music-selectivity emerging in the network trained without music.**
a The component response profile inferred by a voxel decomposition method from human fMRI data (data from Fig. 2d of Norman-Haignere, 2015). Bars represent the response magnitude of the music component to 165 natural sounds. Sounds are sorted in descending order of response. b Analysis of the average response of units (in the networks trained without music) with the top 12.5% MSI values (identified with the AudioSet dataset) to the 165 natural sounds data. Inset represents music/non-music response ratio for the fMRI data in (a) and the networks trained without music. One-tailed, one-sample Wilcoxon signed-rank test, U = 15, p = 0.031, ES = 1, n = 5 independent networks. Error bars represent mean +/− SD. c The same analysis for the network trained with music, (inset) the randomly initialized network, and the Gabor filter bank model. d The average response of music-selective units to each of the 11 sound categories defined in Norman-Haignere, 2015 in the networks trained without music. The music-selective units showed higher responses to the music categories compared to all other non-music sound categories (1-to-1 comparison). One-tailed Wilcoxon singed-rank test, for all pairs, U = 15, p = 0.031, ES = 1, n = 5 independent networks. e The average music/non-music response ratio (Sounds in the training AudioSet dataset) of units with top 12.5% MSI values in each model. Two-tailed Wilcoxon rank-sum test, vs Trained with music: U = 11, p = 0.417, ES = 0.44; vs Untrained: U = 0, p = 0.006, ES = 0, n = 5 independent networks. The asterisks represent statistical significance (p < 0.05). Error bars represent mean +/− SD. Source data are provided as a Source Data file.

**Fig. 4. Encoding of the temporal structure of music by music-selective units in the network.**
a Schematic diagram of the generation of sound quilts. A change in the order of the alphabets represents the segment reordering process. b Response of the music-selective units to sound quilts made of music (red) and non-music (blue). One-tailed Wilcoxon signed-rank test was used to test whether the response was reduced compared to the original condition. For the music quilts: U₅₀ = 15, U₁₀₀ = 15, U₂₀₀ = 15, U₄₀₀ = 8, U₈₀₀ = 0, U_1,600 = 1, p₅₀ = 0.031, p₁₀₀ = 0.031, p₂₀₀ = 0.031, p₄₀₀ = 0.500, p₈₀₀ = 1.000, p_1,600 = 0.969, ES₅₀ = 1.0, ES₁₀₀ = 1.0, ES₂₀₀ = 1.0, ES₄₀₀ = 0.533, ES₈₀₀ = 0, ES_1,600 = 0.067; for the non-music quilts: U₅₀ = 15, U₁₀₀ = 15, U₂₀₀ = 15, U₄₀₀ = 0, U₈₀₀ = 0, U_1,600 = 0, p₅₀ = 0.031, p₁₀₀ = 0.031, p₂₀₀ = 0.031, p₄₀₀ = 1.000, p₈₀₀ = 1.000, p_1,600 = 1.000, ES₅₀ = 1, ES₁₀₀ = 1, ES₂₀₀ = 1, ES₄₀₀ = 0, ES₈₀₀ = 0, ES_1,600 = 0; n = 5 independent networks. Error bars represent mean +/− SD. c Response of the other units to sound quilts made of music (red) and non-music (blue). One-tailed Wilcoxon signed-rank test. For the music quilts: U₅₀ = 2, U₁₀₀ = 3, U₂₀₀ = 9, U₄₀₀ = 7, U₈₀₀ = 2, U_1,600 = 9, p₅₀ = 0.938, p₁₀₀ = 0.906, p₂₀₀ = 0.406, p₄₀₀ = 0.594, p₈₀₀ = 0.938, p_1,600 = 0.406, ES₅₀ = 0.133, ES₁₀₀ = 0.2, ES₂₀₀ = 0.6, ES₄₀₀ = 0.467, ES₈₀₀ = 0.133, ES_1,600 = 0.6; for the non-music quilts: U₅₀ = 3, U₁₀₀ = 2, U₂₀₀ = 5, U₄₀₀ = 1, U₈₀₀ = 1, U_1,600 = 4, p₅₀ = 0.906, p₁₀₀ = 0.938, p₂₀₀ = 0.781, p₄₀₀ = 0.969, p₈₀₀ = 0.969, p_1,600 = 0.844, ES₅₀ = 0.2, ES₁₀₀ = 0.133, ES₂₀₀ = 0.333, ES₄₀₀ = 0.067, ES₈₀₀ = 0.067, ES_1,600 = 0.267; n = 5 independent networks. Error bars represent mean +/- SD. The asterisks indicate statistical significance (p < 0.05). N.S.: non-significant. Source data are provided as a Source Data file.

**Fig. 5. Music-selectivity as a generalization of natural sounds.**
a Illustration of network training to memorize the data by randomizing the labels. b Response of the units with the top 12.5% MSI values to music quilts in the networks trained with randomized labels (black, memorization) compared to that of the network in Fig. 4b (red, generalization). To normalize the two conditions, each response was divided by the average response to the original sound from each network. One-tailed Wilcoxon rank-sum test, U₅₀ = 25, U₁₀₀ = 25, U₂₀₀ = 17, U₄₀₀ = 14, U₈₀₀ = 10, U_1,600 = 15, p₅₀ = 0.006, p₁₀₀ = 0.006, p₂₀₀ = 0.202, p₄₀₀ = 0.417, p₈₀₀ = 0.735, p_1,600 = 0.338, ES₅₀ = 1, ES₁₀₀ = 1, ES₂₀₀ = 0.68, ES₄₀₀ = 0.56, ES₈₀₀ = 0.4, ES_1,600 = 0.6, n = 5 independent networks. Error bars represent mean +/− SD. c Performance of the network after the ablation of specific units. One-tailed Wilcoxon signed-rank test, MSI top 12.5% vs Baseline: U = 15, p = 0.031, ES = 1; vs MSI bot. 12.5%: U = 15, p = 0.031, ES = 1, vs MSI mid. 12.5%: U = 15, p = 0.031, ES = 1, vs L1 norm top 12.5%: U = 15, p = 0.031, ES = 1. The asterisks indicate statistical significance (p < 0.05). n = 5 independent networks. Error bars represent mean +/− SD. Source data are provided as a Source Data file.

See this image and copyright information in PMC

References

1. Mehr SA, et al. Universality and diversity in human song. Science. 2019;366:eaax0868. doi: 10.1126/science.aax0868. - DOI - PMC - PubMed
1. Savage PE, Brown S, Sakai E, Currie TE. Statistical universals reveal the structures and functions of human music. Proc. Natl Acad. Sci. USA. 2015;112:8987–8992. doi: 10.1073/pnas.1414495112. - DOI - PMC - PubMed
1. Zatorrea RJ, Salimpoor VN. From perception to pleasure: music and its neural substrates. Proc. Natl Acad. Sci. USA. 2013;110:10430–10437. doi: 10.1073/pnas.1301228110. - DOI - PMC - PubMed
1. Zatorre RJ, Chen JL, Penhune VB. When the brain plays music: auditory-motor interactions in music perception and production. Nat. Rev. Neurosci. 2007;8:547–558. doi: 10.1038/nrn2152. - DOI - PubMed
1. Koelsch S. Toward a neural basis of music perception - a review and updated model. Front. Psychol. 2011;2:1–20. doi: 10.3389/fpsyg.2011.00110. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

2022R1A2B5B02001752/National Research Foundation of Korea (NRF)

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Spontaneous emergence of rudimentary music detectors in deep neural networks

Affiliations

Spontaneous emergence of rudimentary music detectors in deep neural networks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources