Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 14;11(1):6386.
doi: 10.1038/s41467-020-19712-x.

Detection of eye contact with deep neural networks is as accurate as human experts

Affiliations

Detection of eye contact with deep neural networks is as accurate as human experts

Eunji Chong et al. Nat Commun. .

Abstract

Eye contact is among the most primary means of social communication used by humans. Quantification of eye contact is valuable as a part of the analysis of social roles and communication skills, and for clinical screening. Estimating a subject's looking direction is a challenging task, but eye contact can be effectively captured by a wearable point-of-view camera which provides a unique viewpoint. While moments of eye contact from this viewpoint can be hand-coded, such a process tends to be laborious and subjective. In this work, we develop a deep neural network model to automatically detect eye contact in egocentric video. It is the first to achieve accuracy equivalent to that of human experts. We train a deep convolutional network using a dataset of 4,339,879 annotated images, consisting of 103 subjects with diverse demographic backgrounds. 57 subjects have a diagnosis of Autism Spectrum Disorder. The network achieves overall precision of 0.936 and recall of 0.943 on 18 validation subjects, and its performance is on par with 10 trained human coders with a mean precision 0.918 and recall 0.946. Our method will be instrumental in gaze behavior analysis by serving as a scalable, objective, and accessible tool for clinicians and researchers.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the approach.
Wearable glasses with a small outward-facing camera embedded in the bridge are used to record the face of the camera wearer’s interactive partner. By virtue of its placement, gaze during eye contact is directed toward the camera, and is captured in video, enabling automated detection. Due to its ease of use, the approach can be widely deployed in a variety of settings, as illustrated in the figure, for which eye tracking may be infeasible due to cost, burden, compliance, or distraction issues.
Fig. 2
Fig. 2
Precision and recall (PR) of deep learning model and human raters. The blue line is the PR curve for the model, zoomed into the range 0.5–1.0. Human rater data are presented as mean values ± SD. Improved model PR (red diamond) is obtained by temporally smoothing the model output. The PR for each of the ten expert raters (yellow dots) is obtained by comparing an expert's ratings to the consensus ratings of the other nine experts. a PR curve on all 18 validation sessions. The model (red diamond) achieves higher precision than the average of the expert raters (green diamond) for the same recall. The model PR (red diamond) lies within one standard deviation (green error bars) of the mean rater, and both the model and the mean rater have similar F1 scores. Therefore, we conclude that the deep learning model exhibits comparable performance to expert human raters. b PR curves computed separately for the BOSCC (top) and the ESCS protocol (bottom). c PR curves computed separately for male (top) and female (bottom) samples. d PR curves computed separately for TD (top) and ASD (bottom) samples. In all cases, model PR lies within one SD of the mean rater.
Fig. 3
Fig. 3
Pairwise Cohen’s kappa distributions among all human pairs and human–algorithm pairs, represented as box plot. a 18 validation sessions, b ESCS, c BOSCC. Generally, kappa scores above 0.8 are considered an almost perfect agreement. On all sessions annotated by ten human experts, agreements among humans and agreements between each human and algorithm are similar in terms of kappa values.
Fig. 4
Fig. 4
Average duration of eye contact during conversation and interactive play in child and adolescent samples (n = 15), measured at time 1 and time 2. a based on human coding, b based on automated coding. Data are presented as mean values ± SEM.
Fig. 5
Fig. 5
Deep neural network layout. Given a frame extracted from a point-of-view camera, the subject’s face is automatically detected and cropped as an input to the deep neural networks. A deep neural network is used to compute the features from facial image via a series of convolutions. At the end of the network, features are combined through average pooling and fully connected layers, and the softmax operation produces the final eye contact score. Using this score, algorithm can decide if the input face is an eye contact. The authors have obtained consent to publish the sample picture from the study participants.

References

    1. Kaye K, Fogel A. The temporal structure of face-to-face communication between mothers and infants. Dev. Psychol. 1980;16:454. doi: 10.1037/0012-1649.16.5.454. - DOI
    1. Vecera SP, Johnson MH. Gaze detection and the cortical processing of faces: evidence from infants and adults. Vis. Cogn. 1995;2:59–87. doi: 10.1080/13506289508401722. - DOI
    1. Farroni T, et al. Newborns’ preference for face-relevant stimuli: effects of contrast polarity. Proc. Natl Acad. Sci. USA. 2005;102:17245–17250. doi: 10.1073/pnas.0502205102. - DOI - PMC - PubMed
    1. Reid VM, et al. The human fetus preferentially engages with face-like visual stimuli. Curr. Biol. 2017;27:1825–1828. doi: 10.1016/j.cub.2017.05.044. - DOI - PubMed
    1. Argyle, M. & Dean, J. Eye-contact, distance and affiliation. Sociometry28, 289–304 (1965). - PubMed

Publication types