Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 7;117(14):7684-7689.
doi: 10.1073/pnas.1915768117. Epub 2020 Mar 23.

Racial disparities in automated speech recognition

Affiliations

Racial disparities in automated speech recognition

Allison Koenecke et al. Proc Natl Acad Sci U S A. .

Abstract

Automated speech recognition (ASR) systems, which use sophisticated machine-learning algorithms to convert spoken language to text, have become increasingly widespread, powering popular virtual assistants, facilitating automated closed captioning, and enabling digital dictation platforms for health care. Over the last several years, the quality of these systems has dramatically improved, due both to advances in deep learning and to the collection of large-scale datasets used to train the systems. There is concern, however, that these tools do not work equally well for all subgroups of the population. Here, we examine the ability of five state-of-the-art ASR systems-developed by Amazon, Apple, Google, IBM, and Microsoft-to transcribe structured interviews conducted with 42 white speakers and 73 black speakers. In total, this corpus spans five US cities and consists of 19.8 h of audio matched on the age and gender of the speaker. We found that all five ASR systems exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers. We trace these disparities to the underlying acoustic models used by the ASR systems as the race gap was equally large on a subset of identical phrases spoken by black and white individuals in our corpus. We conclude by proposing strategies-such as using more diverse training datasets that include African American Vernacular English-to reduce these performance differences and ensure speech recognition technology is inclusive.

Keywords: fair machine learning; natural language processing; speech-to-text.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
The average WER across ASR services is 0.35 for audio snippets of black speakers, as opposed to 0.19 for snippets of white speakers. The maximum SE among the 10 WER values displayed (across black and white speakers and across ASR services) is 0.005. For each ASR service, the average WER is calculated across a matched sample of 2,141 black and 2,141 white audio snippets, totaling 19.8 h of interviewee audio. Nearest-neighbor matching between speaker race was performed based on the speaker’s age, gender, and audio snippet duration.
Fig. 2.
Fig. 2.
The CCDF denotes the share of audio snippets having a WER greater than the value specified along the horizontal axis. The two CCDFs shown for audio snippets by white speakers (blue) versus those by black speakers (red) use the average WER across the five ASR services tested. If we assume that a WER >0.5 implies a transcript is unusable, then 23% of audio snippets of black speakers result in unusable transcripts, whereas only 1.6% of audio snippets of white speakers result in unusable transcripts.
Fig. 3.
Fig. 3.
For each audio snippet, we first computed the average error rate across the five ASR services we consider: Amazon, Apple, Google, IBM, and Microsoft. These average WERs were then grouped by interview location, with the distributions summarized in the boxplots above. In the three AAVE sites, denoted by a gray background (Princeville, NC; Washington, DC; and Rochester, NY), the error rates are typically higher than in the two white sites (Sacramento, CA, and Humboldt, CA), although error rates in Rochester are comparable to those in Sacramento.
Fig. 4.
Fig. 4.
The relationship between a measure of dialect density (DDM, on the horizontal axis) and average ASR error rate (WER, on the vertical axis) for a random sample of 50 snippets in each of the three AAVE sites we consider. The dashed vertical lines indicate the average DDM in each location. The solid black line shows a linear regression fit to the data and indicates that speakers who exhibit more linguistic features characteristic of AAVE tend to have higher WER.

References

    1. Tatman R., “Gender and dialect bias in YouTube’s automatic captions” in Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, Hovy D., et al., Eds. (Association for Computational Linguistics, 2017), pp. 53–59.
    1. Tatman R., Kasten C., “Effects of talker dialect, gender & race on accuracy of Bing speech and YouTube automatic captions” in INTERSPEECH, Lacerda F., et al., Eds. (International Speech Communication Association, 2017), pp. 934–938.
    1. Harwell D., Mayes B., Walls M., Hashemi S., The accent gap. The Washington Post, 19 July 2018. https://www.washingtonpost.com/graphics/2018/business/alexa-does-not-und.... Accessed 28 February 2020.
    1. Kitashov F., Svitanko E., Dutta D., Foreign English accent adjustment by learning phonetic patterns. arXiv:1807.03625 (9 July 2018).
    1. Buolamwini J., Gebru T., “Gender shades: Intersectional accuracy disparities in commercial gender classification” in Proceedings of the Conference on Fairness, Accountability and Transparency, Friedler S. A., Wilson C., Eds. (Association for Computing Machinery, New York, NY, 2018), pp. 77–91.

LinkOut - more resources