Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 31;9(13):eadf3197.
doi: 10.1126/sciadv.adf3197. Epub 2023 Mar 31.

The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation

Affiliations

The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation

Andrew Reece et al. Sci Adv. .

Abstract

People spend a substantial portion of their lives engaged in conversation, and yet, our scientific understanding of conversation is still in its infancy. Here, we introduce a large, novel, and multimodal corpus of 1656 conversations recorded in spoken English. This 7+ million word, 850-hour corpus totals more than 1 terabyte of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, together with an extensive survey of speakers' postconversation reflections. By taking advantage of the considerable scope of the corpus, we explore many examples of how this large-scale public dataset may catalyze future research, particularly across disciplinary boundaries, as scholars from a variety of fields appear increasingly interested in the study of conversation.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. A framework for studying conversation.
The results are organized according to an analytic framework that distinguishes between three related levels of conversation. Low-level features can be observed directly, vary over short time periods, and often relate to conversational structure (e.g., a pause at the end of a speaker’s turn). Mid-level features are generally inferred indirectly by human perceivers or algorithms that approximate human perception, vary on a medium-frequency or turn-by-turn basis, and capture linguistic or paralinguistic conversational content (e.g., a happy facial expression or vocal emotional intensity). High-level features relate to people’s subjective judgments of a conversation (e.g., postconversation-reported enjoyment or people’s evaluations of their partner). Subsequent sections present empirical results at each level of the hierarchy, as well as analyses that demonstrate the interplay across levels that we believe will represent an increasingly common and important type of research.
Fig. 2.
Fig. 2.. Distribution of gaps and overlaps across speaker transitions.
Negative intervals are classified as “overlaps,” indicating the presence of simultaneous speech on the part of two conversation partners. Positive values are “gaps,” indicating a stretch of silence between turns. The results indicated that the median between-speaker turn interval was +80 ms and was distributed approximately normally. These results are similar to those previously observed in conversations across many cultures and communication modalities (33). Short gaps are consistently the most common type of turn transition in naturally occurring conversation.
Fig. 3.
Fig. 3.. A depiction of turn segmentation by the Audiophile and Cliffhanger turn models.
The baseline Audiophile model treated each interjection as initiated a new turn and, thus, disrupted the flow of Fatima’s self-introduction (red). In contrast, our improved Cliffhanger model organizes the same information into a more intuitive format in which Fatima and Eduardo (blue) alternate pleasantries.
Fig. 4.
Fig. 4.. Example transcripts from Audiophile and Backbiter turn models.
Audiophile treats each backchannel as initiating a new turn that disrupts the flow of speaker 1’s self-introduction (red). In contrast, our Backbiter turn model organizes the same information and presents it in a more intuitive format in which speaker 1 offers a single introductory turn (while speaker 2 is backchanneling). Speaker 1 then concludes their turn and yields the floor to speaker 2 (blue), at which point speaker 2 takes their first turn and also provides a self-introduction (while speaker 1 occupies the backchannel).
Fig. 5.
Fig. 5.. The frequency of backchannel words across the corpus
Backchannel words are a foundational element of conversation that occur at an approximate rate of 1000 per hour of speech; listeners deployed them in nearly two-thirds of speaker turns that were five words or longer. “Generic” continuers, such as “uh huh,” may function to signal to speakers that they should keep talking. In contrast, “specific” backchannel words, such as “wow,” may convey context-specific responses such as mirroring a speaker’s emotion while a story is told. This distribution of backchannels reflects English spoken in the United States in 2020.
Fig. 6.
Fig. 6.. Positive affect is significantly greater after than before a conversation.
Each row of density plots corresponds to an age group. Respondents were asked to report their mood immediately before (red) and after (blue) their conversation. Conversation’s effect on people’s mood was positive, significant, and of considerable magnitude.
Fig. 7.
Fig. 7.. Turn exchange is related nonlinearly to partner enjoyment.
The x axis indicates an individual’s mean interval between the end of their conversation partner’s turn and the beginning of their turn. Positive intervals indicate gaps between turns, and negative intervals indicate overlaps in speech near turn boundaries. We found that longer positive intervals (gaps) between turns were negatively associated with partner enjoyment, but we found no such relation between enjoyment and negative intervals (overlaps). This underscores the importance of connecting lower-level features of conversation with higher-level features, and the interdisciplinary understanding required to do so.
Fig. 8.
Fig. 8.. Behavior patterns of good and bad conversationalists.
(A to F) The behavioral patterns of good conversationalists (top 25% of partner-rated conversationalist score; depicted in blue) and bad conversationalists (bottom 25%; depicted in red) are depicted. Horizontal axes denote turn-level feature deciles. The y axis indicates the mean proportion of turns in a category for a good or bad conversationalist. Error bars represent 95% confidence intervals. Top, middle, and bottom rows correspond to text, audio, and visual modalities, respectively; left and right columns include features that can be observed directly and those that require an additional layer of machine learning to estimate.
Fig. 9.
Fig. 9.. Topic flow within the CANDOR corpus.
The topics people chose to talk about, as measured in CANDOR transcripts by a simple keyword dictionary, reflect the ebb and flow of societal issues in an unusually tumultuous year. COVID-19 (red) surged from unknown to the talk of the nation by mid-2020, matching or even exceeding family-related discussion (blue), a reliable staple of conversation. CANDOR frequencies for the presidential election (purple) and policing (green) highlight the trajectories of these nationally debated issues.

References

    1. H. H. Clark, Arenas of Language Use (University of Chicago Press, 1992).
    1. N. J. Enfield, How We Talk: The Inner Workings of Conversation (Basic Books, 2017).
    1. M. J. Pickering, S. Garrod, Understanding Dialogue: Language Use and Social Interaction (Cambridge Univ. Press, 2021).
    1. H. Sacks, E. A. Schegloff, G. Jefferson, A simplest systematics for the organization of turn-taking for conversation, in Studies in the Organization of Conversational Interaction, J. Schenkein, Ed. (Academic Press, 1978), pp. 7–55.
    1. M. Tomasello, Constructing a Language: A Usage-based Theory of Language Acquisition (Harvard Univ. Press, 2003).