Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 2;15(1):148.
doi: 10.1038/s41467-023-44516-0.

Spontaneous emergence of rudimentary music detectors in deep neural networks

Affiliations

Spontaneous emergence of rudimentary music detectors in deep neural networks

Gwangsu Kim et al. Nat Commun. .

Abstract

Music exists in almost every society, has universal acoustic features, and is processed by distinct neural circuits in humans even with no experience of musical training. However, it remains unclear how these innate characteristics emerge and what functions they serve. Here, using an artificial deep neural network that models the auditory information processing of the brain, we show that units tuned to music can spontaneously emerge by learning natural sound detection, even without learning music. The music-selective units encoded the temporal structure of music in multiple timescales, following the population-level response characteristics observed in the brain. We found that the process of generalization is critical for the emergence of music-selectivity and that music-selectivity can work as a functional basis for the generalization of natural sound, thereby elucidating its origin. These findings suggest that evolutionary adaptation to process natural sounds can provide an initial blueprint for our sense of music.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Distinct representation of music in deep neural networks trained for natural sound detection without music.
a Example log-Mel spectrograms of the natural sound data in the AudioSet. b Architecture of the deep neural network used to detect the natural sound categories in the input data. The purple box indicates the average pooling layer. c Performance (mean average precision, mAP) of the network trained without music for music-related categories (top, red bars) and other categories (bottom, blue). n = 5 independent networks. Error bars represent mean +/− SD. d Density plot of the t-SNE embedding of feature vectors obtained from the network in C. The lines represent iso-proportion lines at 80%, 60%, 40%, and 20% levels. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. Selective response of units in the network to music.
a Histograms of the average response of the units for music (red) and non-music (blue) stimuli in networks trained without music. The lines represent the response averaged over all units. b Response of the music-selective units to music (red) and non-music stimuli. Inset: Response of the units in the untrained network with the top 12.5% MSI values to music and non-music stimuli. The box represents the lower and upper quartile. The whiskers represent the lower (upper) quartile – (+) 1.5 × interquartile range. nmusic, train = 4539, nmusic, test = 3999, nnon-music, train = 10,483, and nnon-music_test = 11,010 independent sounds. c Invariance of the music-selectivity to changes in sound amplitude. Response of the music-selective units to music (red) and non-music (blue) using the training dataset with normalized amplitude. The whiskers represent the lower (upper) quartile – (+) 1.5 × interquartile range. nmusic = 4539, nnon-music = 10,483 independent sounds. d Illustration of the binary classification of music and non-music using the response of the music-selective units (left), and the performance of the linear classifier (right). One-tailed Wilcoxon rank-sum test, from left asterisks, U12.5-25% = 25, U25-37.5% = 25, U37.5-50% = 25, p12.5-25% = 0.006, p25-37.5% = 0.006, p37.5-50% = 0.006, ES12.5-25% = 1, ES25-37.5% = 1, ES37.5-50% = 1, n = 5 independent networks. Error bars represent mean +/− SD. The asterisks represent statistical significance (p < 0.05). Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Significance of the music-selectivity emerging in the network trained without music.
a The component response profile inferred by a voxel decomposition method from human fMRI data (data from Fig. 2d of Norman-Haignere, 2015). Bars represent the response magnitude of the music component to 165 natural sounds. Sounds are sorted in descending order of response. b Analysis of the average response of units (in the networks trained without music) with the top 12.5% MSI values (identified with the AudioSet dataset) to the 165 natural sounds data. Inset represents music/non-music response ratio for the fMRI data in (a) and the networks trained without music. One-tailed, one-sample Wilcoxon signed-rank test, U = 15, p = 0.031, ES = 1, n = 5 independent networks. Error bars represent mean +/− SD. c The same analysis for the network trained with music, (inset) the randomly initialized network, and the Gabor filter bank model. d The average response of music-selective units to each of the 11 sound categories defined in Norman-Haignere, 2015 in the networks trained without music. The music-selective units showed higher responses to the music categories compared to all other non-music sound categories (1-to-1 comparison). One-tailed Wilcoxon singed-rank test, for all pairs, U = 15, p = 0.031, ES = 1, n = 5 independent networks. e The average music/non-music response ratio (Sounds in the training AudioSet dataset) of units with top 12.5% MSI values in each model. Two-tailed Wilcoxon rank-sum test, vs Trained with music: U = 11, p = 0.417, ES = 0.44; vs Untrained: U = 0, p = 0.006, ES = 0, n = 5 independent networks. The asterisks represent statistical significance (p < 0.05). Error bars represent mean +/− SD. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Encoding of the temporal structure of music by music-selective units in the network.
a Schematic diagram of the generation of sound quilts. A change in the order of the alphabets represents the segment reordering process. b Response of the music-selective units to sound quilts made of music (red) and non-music (blue). One-tailed Wilcoxon signed-rank test was used to test whether the response was reduced compared to the original condition. For the music quilts: U50 = 15, U100 = 15, U200 = 15, U400 = 8, U800 = 0, U1,600 = 1, p50 = 0.031, p100 = 0.031, p200 = 0.031, p400 = 0.500, p800 = 1.000, p1,600 = 0.969, ES50 = 1.0, ES100 = 1.0, ES200 = 1.0, ES400 = 0.533, ES800 = 0, ES1,600 = 0.067; for the non-music quilts: U50 = 15, U100 = 15, U200 = 15, U400 = 0, U800 = 0, U1,600 = 0, p50 = 0.031, p100 = 0.031, p200 = 0.031, p400 = 1.000, p800 = 1.000, p1,600 = 1.000, ES50 = 1, ES100 = 1, ES200 = 1, ES400 = 0, ES800 = 0, ES1,600 = 0; n = 5 independent networks. Error bars represent mean +/− SD. c Response of the other units to sound quilts made of music (red) and non-music (blue). One-tailed Wilcoxon signed-rank test. For the music quilts: U50 = 2, U100 = 3, U200 = 9, U400 = 7, U800 = 2, U1,600 = 9, p50 = 0.938, p100 = 0.906, p200 = 0.406, p400 = 0.594, p800 = 0.938, p1,600 = 0.406, ES50 = 0.133, ES100 = 0.2, ES200 = 0.6, ES400 = 0.467, ES800 = 0.133, ES1,600 = 0.6; for the non-music quilts: U50 = 3, U100 = 2, U200 = 5, U400 = 1, U800 = 1, U1,600 = 4, p50 = 0.906, p100 = 0.938, p200 = 0.781, p400 = 0.969, p800 = 0.969, p1,600 = 0.844, ES50 = 0.2, ES100 = 0.133, ES200 = 0.333, ES400 = 0.067, ES800 = 0.067, ES1,600 = 0.267; n = 5 independent networks. Error bars represent mean +/- SD. The asterisks indicate statistical significance (p < 0.05). N.S.: non-significant. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Music-selectivity as a generalization of natural sounds.
a Illustration of network training to memorize the data by randomizing the labels. b Response of the units with the top 12.5% MSI values to music quilts in the networks trained with randomized labels (black, memorization) compared to that of the network in Fig. 4b (red, generalization). To normalize the two conditions, each response was divided by the average response to the original sound from each network. One-tailed Wilcoxon rank-sum test, U50 = 25, U100 = 25, U200 = 17, U400 = 14, U800 = 10, U1,600 = 15, p50 = 0.006, p100 = 0.006, p200 = 0.202, p400 = 0.417, p800 = 0.735, p1,600 = 0.338, ES50 = 1, ES100 = 1, ES200 = 0.68, ES400 = 0.56, ES800 = 0.4, ES1,600 = 0.6, n = 5 independent networks. Error bars represent mean +/− SD. c Performance of the network after the ablation of specific units. One-tailed Wilcoxon signed-rank test, MSI top 12.5% vs Baseline: U = 15, p = 0.031, ES = 1; vs MSI bot. 12.5%: U = 15, p = 0.031, ES = 1, vs MSI mid. 12.5%: U = 15, p = 0.031, ES = 1, vs L1 norm top 12.5%: U = 15, p = 0.031, ES = 1. The asterisks indicate statistical significance (p < 0.05). n = 5 independent networks. Error bars represent mean +/− SD. Source data are provided as a Source Data file.

Similar articles

Cited by

References

    1. Mehr SA, et al. Universality and diversity in human song. Science. 2019;366:eaax0868. doi: 10.1126/science.aax0868. - DOI - PMC - PubMed
    1. Savage PE, Brown S, Sakai E, Currie TE. Statistical universals reveal the structures and functions of human music. Proc. Natl Acad. Sci. USA. 2015;112:8987–8992. doi: 10.1073/pnas.1414495112. - DOI - PMC - PubMed
    1. Zatorrea RJ, Salimpoor VN. From perception to pleasure: music and its neural substrates. Proc. Natl Acad. Sci. USA. 2013;110:10430–10437. doi: 10.1073/pnas.1301228110. - DOI - PMC - PubMed
    1. Zatorre RJ, Chen JL, Penhune VB. When the brain plays music: auditory-motor interactions in music perception and production. Nat. Rev. Neurosci. 2007;8:547–558. doi: 10.1038/nrn2152. - DOI - PubMed
    1. Koelsch S. Toward a neural basis of music perception - a review and updated model. Front. Psychol. 2011;2:1–20. doi: 10.3389/fpsyg.2011.00110. - DOI - PMC - PubMed

Publication types