A large-scaled corpus for assessing text readability

Scott Crossley¹, Aron Heintz², Joon Suh Choi³, Jordan Batchelor³, Mehrnoush Karimi³, Agnes Malatinszky²

Affiliations

¹ Georgia State University, Atlanta, GA, USA. scrossley@gsu.edu.
² CommonLit, Baltimore, MD, USA.
³ Georgia State University, Atlanta, GA, USA.

PMID: 35297016
PMCID: PMC10027808
DOI: 10.3758/s13428-022-01802-x

A large-scaled corpus for assessing text readability

Scott Crossley et al. Behav Res Methods. 2023 Feb.

. 2023 Feb;55(2):491-507.

doi: 10.3758/s13428-022-01802-x. Epub 2022 Mar 16.

Authors

Scott Crossley¹, Aron Heintz², Joon Suh Choi³, Jordan Batchelor³, Mehrnoush Karimi³, Agnes Malatinszky²

Affiliations

¹ Georgia State University, Atlanta, GA, USA. scrossley@gsu.edu.
² CommonLit, Baltimore, MD, USA.
³ Georgia State University, Atlanta, GA, USA.

PMID: 35297016
PMCID: PMC10027808
DOI: 10.3758/s13428-022-01802-x

Abstract

This paper introduces the CommonLit Ease of Readability (CLEAR) corpus, which provides unique readability scores for ~ 5000 text excerpts along with information about the excerpt's year of publishing, genre, and other metadata. The CLEAR corpus will provide researchers interested in discourse processing and reading with a resource from which to develop and test readability metrics and to model text readability. The CLEAR corpus includes a number of improvements in comparison to previous readability corpora including size, breadth of the excerpts available, which cover over 250 years of writing in two different genres, and unique readability criterion provided for each text based on teachers' ratings of text difficulty for student readers. This paper discusses the development of the corpus and presents reliability metrics for the human ratings of readability.

Keywords: Corpus linguistics; Natural language processing; Readability; Readability formulas.

PubMed Disclaimer

Figures

**Fig. 1**
Rater interface for pairwise readability comparisons

**Fig. 2**
Correlation plot between Bradley-Terry scores (full and split half) and readability formulas

**Fig. 3**
Boxplot for Bradley-Terry scores for excerpts based on genre

**Fig. 4**
Scatterplot for Bradley-Terry scores by year of excerpt publication

See this image and copyright information in PMC

References

1. Bailin A, Grafstein A. The linguistic assumptions underlying readability formulae: A critique. Language & Communication. 2001;21(3):285–301. doi: 10.1016/S0271-5309(01)00005-2. - DOI
1. Balota DA, Yap MJ, Hutchison KA, Cortese MJ, Kessler B, Loftis B, Neely J, Nelson L, Simpson G, Treiman R. The English lexicon project. Behavior Research Methods. 2007;39(3):445–459. doi: 10.3758/BF03193014. - DOI - PubMed
1. Best R, Floyd R, McNamara D. Differential competencies contributing to children's comprehension of narrative and expository texts. Reading Psychology. 2008;29:137–164. doi: 10.1080/02702710801963951. - DOI
1. Biber D. Variation across speech and writing. Cambridge University Press; 1988.
1. Bradley RA, Terry ME. Rank analysis of incomplete block designs. I. The method of paired comparisons. Biometrika. 1952;39:324–345.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A large-scaled corpus for assessing text readability

Affiliations

A large-scaled corpus for assessing text readability

Authors

Affiliations

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources