Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jun 2;5(6):e10729.
doi: 10.1371/journal.pone.0010729.

SUBTLEX-CH: Chinese word and character frequencies based on film subtitles

Affiliations

SUBTLEX-CH: Chinese word and character frequencies based on film subtitles

Qing Cai et al. PLoS One. .

Abstract

Background: Word frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.

Methodology: Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.

Conclusions: Our results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Lay-out of the SUBTLEX-CH-CHR file.
Figure 2
Figure 2. Lay-out of the SUBTLEX-CH-WF file.
Figure 3
Figure 3. Lay-out of SUBTLEX-CH-WF_PoS file.
Figure 4
Figure 4. The light gray points on the background represent the 28,336 two-character words included in both SUBLTEX-CH and LCMC, together with their log10 frequencies; the black diamonds represent the 400 words selected for the lexical decision validation study.

Similar articles

Cited by

References

    1. Perfetti CA, Tan LH. The time course of graphic, phonological, and semantic activation in Chinese character identification. J Exp Psychol Learn Mem Cogn. 1998;24:101–118. - PubMed
    1. Bai X, Yan G, Liversedge SP, Zang X, Rayner K. Reading spaced and unspaced Chinese text: Evidence from eye movements. J Exp Psychol Hum Percept Perform. 2008;34:1277-1287–1277-1287. PMID: 18823210. - PMC - PubMed
    1. Wong K, Li W, Xu R, Zhang Z. San Rafael, California: Morgan & Claypool Publishers; 2010. Introduction to Chinese Natural Language Processing. DOI: 10.2200/S00211ED1V01Y200909HLT004. - DOI
    1. Language Teaching and Research Institute of Beijing Language Institute. Beijing: Beijing Language Institute Press; 1986. formula image(Dictionary of Modern Chinese Frequency) (in Chinese).
    1. Liu Y, et al., editors. Beijing: Yuhang Publishing House; 1990. formula image(Dictionary of Modern Chinese words in common uses) (in Chinese).

Publication types