Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 12;20(2):e0317183.
doi: 10.1371/journal.pone.0317183. eCollection 2025.

Monkeys can identify pictures from words

Affiliations

Monkeys can identify pictures from words

Elizabeth Cabrera-Ruiz et al. PLoS One. .

Abstract

Humans learn and incorporate cross-modal associations between auditory and visual objects (e.g., between a spoken word and a picture) into language. However, whether nonhuman primates can learn cross-modal associations between words and pictures remains uncertain. We trained two rhesus macaques in a delayed cross-modal match-to-sample task to determine whether they could learn associations between sounds and pictures of different types. In each trial, the monkeys listened to a brief sound (e.g., a monkey vocalization or a human word), and retained information about the sound to match it with one of 2-4 pictures presented on a touchscreen after a 3-second delay. We found that the monkeys learned and performed proficiently in over a dozen associations. In addition, to test their ability to generalize, we exposed them to sounds uttered by different individuals. We found that their hit rate remained high but more variable, suggesting that they perceived the new sounds as equivalent, though not identical. We conclude that rhesus monkeys can learn cross-modal associations between objects of different types, retain information in working memory, and generalize the learned associations to new objects. These findings position rhesus monkeys as an ideal model for future research on the brain pathways of cross-modal associations between auditory and visual objects.

PubMed Disclaimer

Conflict of interest statement

NO authors have competing interests.

Figures

Fig 1
Fig 1. Delayed crossmodal match-to-sample task.
(A) Task Events. A trial begins with the monkey pressing a lever in response to a cross appearing in the center of the touchscreen. This is followed by a 0.5-second reference sound, succeeded by a 3-second delay. After the delay, 2–4 pictures are simultaneously presented on the touchscreen. The monkey must then release the lever and touch the picture that matches the sample sound to receive a reward. LP indicates lever press. (B) Examples of Crossmodal Associations. Each column displays a CMA between a sound, represented visually by its sonogram and spectrogram and a picture. The sounds, marked in black, include two Spanish words (in IPA notation) and vocalizations of a monkey and a cow. (C) HR (close boxplots) and FAs (open boxplots) during the presentations of the CMAs shown in B. The dashed line indicates the performance at chance level (i.e., 25% for sounds discriminated against four pictures). The reference sound is labeled in red at the top of the graph. (D) Same as in C, but for Monkey M. The dashed line is set at the 50% chance level (i.e., two pictures on the screen). The pictures are similar but not identical to the original images used in the study and are therefore for illustrative purposes only.
Fig 2
Fig 2. Learning CMAs in monkeys.
(A) Monkey G’s learning progress for three CMAs across sessions with trials presenting 2, 3, or 4 pictures simultaneously on the screen. The black line represents the average performance across sessions, while the blue line maps the first derivative of performance over training sessions (y’ values), illustrating the rate of change at each session. The initial HR (Y0) was near chance level (indicated by the black line at the ordinates), followed by γ (the left edge of the gray box), where the HR statistically exceeded chance. The learning parameter δ, marks a period when HR increased consistently above chance, culminating in a performance plateau at the session denoted by the asymptote of learning λ. (B) Sessions before δ for each CMAs. (C) Represents the average performance of Monkey G across all CMAs over the sessions. (D) Same as in C, but for Monkey M. The pictures are similar but not identical to the original images used in the study and are therefore for illustrative purposes only.
Fig 3
Fig 3. Crossmodal associations influenced the monkeys’ reaction times.
(A) Cumulative probabilities of reaction and motor times across four CMAs. (B) Left panel, pie charts displaying hit rates in sets presenting three CMAs. In all trials, the reference sound was consistently a "coo," but the match in each session was one of the four monkey pictures. Hits are depicted in colors, while false alarms (FAs), occurred when the monkey chose a non-matching picture, are shown in gray or white. Right panel, reaction time (RT) distributions of hits are illustrated with the same color coding as in the left panel. Inset, FA distributions produced in trials where one of the four monkey pictures was presented as a match, but a picture of a ‘human’ or a ‘cow’ was selected. (C) Same format as B but for ’cow’ CMAs. (D) The standard deviations (STDs) of the RT distributions increased as a function of their means during hits, false alarms (FAs), and in trials with two, three, or four pictures on the screen. (E) Plot of the monkeys’ HRs as a function of the mean RTs of hit distributions in D.
Fig 4
Fig 4. Monkeys recognized sounds uttered by different individuals.
(A) Spectrograms from various speakers depicting the Spanish word [’ro. xo] (red). The spectrogram of the learned sound is on the left. (B) Hit rate of monkey G in all sounds’ versions. Closed boxes on the left represent HR in the learned sounds (L). Open boxes, different versions’ HR. Closed boxes on the right of each group correspond to the HR in versions comprised of double repetitions of some sounds including L. (C) Cumulative density functions of the RTs in the learned sounds (bold lines) of monkey G and their versions. Notice how the distributions group by the picture category rather than by sounds. (D) Same as in B, but for monkey M. (E) Same as C, but for monkey M. The pictures are similar but not identical to the original images used in the study and are therefore for illustrative purposes only.

References

    1. Bowerman M, Choi S. Shaping meanings for language: universal and language–specific in the acquisition of spatial semantic categories. In: Bowerman M, Levinson S, editors. Language acquisition and conceptual development. Cambridge, UK; 2001. p. 475–511.
    1. Beauchamp MS, Lee KE, Argall BD, Martin A. Integration of auditory and visual information about objects in superior temporal sulcus. Neuron. 2004;41(5):809–23. doi: 10.1016/s0896-6273(04)00070-4 - DOI - PubMed
    1. Noesselt T, Rieger JW, Schoenfeld MA, Kanowski M, Hinrichs H, Heinze HJ, et al.. Audiovisual temporal correspondence modulates human multisensory superior temporal sulcus plus primary sensory cortices. Journal of Neuroscience. 2007. Oct 17;27(42):11431–41. doi: 10.1523/JNEUROSCI.2252-07.2007 - DOI - PMC - PubMed
    1. Mesulam MM, Wieneke C, Hurley R, Rademaker A, Thompson CK, Weintraub S, Rogalski EJ. Words and objects at the tip of the left temporal lobe in primary progressive aphasia. Brain. 2013. Feb;136(Pt 2):601–18. doi: 10.1093/brain/aws336 Epub 2013 Jan 29. ; PMCID: PMC3572925. - DOI - PMC - PubMed
    1. Vihman M, Croft W. Phonological development: Toward a “radical” templatic phonology. Linguistics. 2007. Jul 20;45(4):683–725.

LinkOut - more resources