Racial disparities in automated speech recognition

Allison Koenecke¹, Andrew Nam², Emily Lake³, Joe Nudell⁴, Minnie Quartey⁵, Zion Mengesha³, Connor Toups³, John R Rickford³, Dan Jurafsky^{3

6}, Sharad Goel⁷

Affiliations

¹ Institute for Computational & Mathematical Engineering, Stanford University, Stanford, CA 94305.
² Department of Psychology, Stanford University, Stanford, CA 94305.
³ Department of Linguistics, Stanford University, Stanford, CA 94305.
⁴ Department of Management Science & Engineering, Stanford University, Stanford, CA 94305.
⁵ Department of Linguistics, Georgetown University, Washington, DC 20057.
⁶ Department of Computer Science, Stanford University, Stanford, CA 94305.
⁷ Department of Management Science & Engineering, Stanford University, Stanford, CA 94305; scgoel@stanford.edu.

PMID: 32205437
PMCID: PMC7149386
DOI: 10.1073/pnas.1915768117

Racial disparities in automated speech recognition

Allison Koenecke et al. Proc Natl Acad Sci U S A. 2020.

. 2020 Apr 7;117(14):7684-7689.

doi: 10.1073/pnas.1915768117. Epub 2020 Mar 23.

Authors

Allison Koenecke¹, Andrew Nam², Emily Lake³, Joe Nudell⁴, Minnie Quartey⁵, Zion Mengesha³, Connor Toups³, John R Rickford³, Dan Jurafsky^{3

6}, Sharad Goel⁷

Affiliations

¹ Institute for Computational & Mathematical Engineering, Stanford University, Stanford, CA 94305.
² Department of Psychology, Stanford University, Stanford, CA 94305.
³ Department of Linguistics, Stanford University, Stanford, CA 94305.
⁴ Department of Management Science & Engineering, Stanford University, Stanford, CA 94305.
⁵ Department of Linguistics, Georgetown University, Washington, DC 20057.
⁶ Department of Computer Science, Stanford University, Stanford, CA 94305.
⁷ Department of Management Science & Engineering, Stanford University, Stanford, CA 94305; scgoel@stanford.edu.

PMID: 32205437
PMCID: PMC7149386
DOI: 10.1073/pnas.1915768117

Abstract

Automated speech recognition (ASR) systems, which use sophisticated machine-learning algorithms to convert spoken language to text, have become increasingly widespread, powering popular virtual assistants, facilitating automated closed captioning, and enabling digital dictation platforms for health care. Over the last several years, the quality of these systems has dramatically improved, due both to advances in deep learning and to the collection of large-scale datasets used to train the systems. There is concern, however, that these tools do not work equally well for all subgroups of the population. Here, we examine the ability of five state-of-the-art ASR systems-developed by Amazon, Apple, Google, IBM, and Microsoft-to transcribe structured interviews conducted with 42 white speakers and 73 black speakers. In total, this corpus spans five US cities and consists of 19.8 h of audio matched on the age and gender of the speaker. We found that all five ASR systems exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers. We trace these disparities to the underlying acoustic models used by the ASR systems as the race gap was equally large on a subset of identical phrases spoken by black and white individuals in our corpus. We conclude by proposing strategies-such as using more diverse training datasets that include African American Vernacular English-to reduce these performance differences and ensure speech recognition technology is inclusive.

Keywords: fair machine learning; natural language processing; speech-to-text.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

**Fig. 1.**
The average WER across ASR services is 0.35 for audio snippets of black speakers, as opposed to 0.19 for snippets of white speakers. The maximum SE among the 10 WER values displayed (across black and white speakers and across ASR services) is 0.005. For each ASR service, the average WER is calculated across a matched sample of 2,141 black and 2,141 white audio snippets, totaling 19.8 h of interviewee audio. Nearest-neighbor matching between speaker race was performed based on the speaker’s age, gender, and audio snippet duration.

**Fig. 2.**
The CCDF denotes the share of audio snippets having a WER greater than the value specified along the horizontal axis. The two CCDFs shown for audio snippets by white speakers (blue) versus those by black speakers (red) use the average WER across the five ASR services tested. If we assume that a WER >0.5 implies a transcript is unusable, then 23% of audio snippets of black speakers result in unusable transcripts, whereas only 1.6% of audio snippets of white speakers result in unusable transcripts.

**Fig. 3.**
For each audio snippet, we first computed the average error rate across the five ASR services we consider: Amazon, Apple, Google, IBM, and Microsoft. These average WERs were then grouped by interview location, with the distributions summarized in the boxplots above. In the three AAVE sites, denoted by a gray background (Princeville, NC; Washington, DC; and Rochester, NY), the error rates are typically higher than in the two white sites (Sacramento, CA, and Humboldt, CA), although error rates in Rochester are comparable to those in Sacramento.

**Fig. 4.**
The relationship between a measure of dialect density (DDM, on the horizontal axis) and average ASR error rate (WER, on the vertical axis) for a random sample of 50 snippets in each of the three AAVE sites we consider. The dashed vertical lines indicate the average DDM in each location. The solid black line shows a linear regression fit to the data and indicates that speakers who exhibit more linguistic features characteristic of AAVE tend to have higher WER.

See this image and copyright information in PMC

References

1. Tatman R., “Gender and dialect bias in YouTube’s automatic captions” in Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, Hovy D., et al., Eds. (Association for Computational Linguistics, 2017), pp. 53–59.
1. Tatman R., Kasten C., “Effects of talker dialect, gender & race on accuracy of Bing speech and YouTube automatic captions” in INTERSPEECH, Lacerda F., et al., Eds. (International Speech Communication Association, 2017), pp. 934–938.
1. Harwell D., Mayes B., Walls M., Hashemi S., The accent gap. The Washington Post, 19 July 2018. https://www.washingtonpost.com/graphics/2018/business/alexa-does-not-und.... Accessed 28 February 2020.
1. Kitashov F., Svitanko E., Dutta D., Foreign English accent adjustment by learning phonetic patterns. arXiv:1807.03625 (9 July 2018).
1. Buolamwini J., Gebru T., “Gender shades: Intersectional accuracy disparities in commercial gender classification” in Proceedings of the Conference on Fairness, Accountability and Transparency, Friedler S. A., Wilson C., Eds. (Association for Computing Machinery, New York, NY, 2018), pp. 77–91.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Racial disparities in automated speech recognition

Affiliations

Racial disparities in automated speech recognition

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous