Complementing but Not Replacing: Comparing the Impacts of GPT-4 and Native-Speaker Interaction on Chinese L2 Writing Outcomes

Zhaoyang Shan¹, Zhangyuan Song², Xu Jiang², Wen Chen³, Luyao Chen^{2

4

5}

Affiliations

¹ School of Foreign Languages and Literature, Shandong University, Jinan 250100, China.
² School of International Chinese Language Education, Beijing Normal University, Beijing 100875, China.
³ Lishui Experimental School, Beijing Normal University, Lishui 323000, China.
⁴ Institute of Educational System Science, School of Systems Science, Beijing Normal University, Beijing 100875, China.
⁵ Department of Neuropsychology, Max Planck Institute for Human Cognitive and Brain Sciences, 04103 Leipzig, Germany.

PMID: 40282161
PMCID: PMC12023996
DOI: 10.3390/bs15040540

Complementing but Not Replacing: Comparing the Impacts of GPT-4 and Native-Speaker Interaction on Chinese L2 Writing Outcomes

Zhaoyang Shan et al. Behav Sci (Basel). 2025.

. 2025 Apr 17;15(4):540.

doi: 10.3390/bs15040540.

Authors

Zhaoyang Shan¹, Zhangyuan Song², Xu Jiang², Wen Chen³, Luyao Chen^{2

4

5}

Affiliations

¹ School of Foreign Languages and Literature, Shandong University, Jinan 250100, China.
² School of International Chinese Language Education, Beijing Normal University, Beijing 100875, China.
³ Lishui Experimental School, Beijing Normal University, Lishui 323000, China.
⁴ Institute of Educational System Science, School of Systems Science, Beijing Normal University, Beijing 100875, China.
⁵ Department of Neuropsychology, Max Planck Institute for Human Cognitive and Brain Sciences, 04103 Leipzig, Germany.

PMID: 40282161
PMCID: PMC12023996
DOI: 10.3390/bs15040540

Abstract

This study explored the efficacy of large language models (LLMs), namely GPT-4, in supporting second language (L2) writing in comparison with interaction with a human language partner in the pre-writing phase. A within-subject behavioral experiment was conducted with 23 Chinese L2 learners who were exposed to three conditions: "without interaction", "interaction with GPT-4", and "interaction with a language partner". They then completed an L2 writing task. It was found that interaction with the language partner yielded significantly improved results compared with both interaction with GPT-4 and the case without interaction in terms of overall writing scores, organization, and language. Additionally, both types of interaction enhanced the participants' topic familiarity and writing confidence and reduced the task's perceived difficulty compared with the case without interaction. Interestingly, in the "interaction with GPT-4" condition, topic familiarity was positively correlated with better writing outcomes, whereas in the "interaction with a language partner" condition, perceived difficulty was positively correlated with content scores; however, content scores were negatively associated with writing confidence. This study suggests that LLMs should be used to complement and not replace human language partners in the L2 pre-writing phase.

Keywords: GPT-4; human language partner; large language models; second language writing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Figure 1**
The experimental procedure. Each participant was invited to participate in a three-session writing experiment. Between each session, there was a minimum 10-day interval. In order to avoid carryover effects from the interactions, all participants first underwent a session involving writing without interaction; subsequently, the two interaction conditions (i.e., interaction with GPT-4 and interaction with the language partner) were counterbalanced across the participants. The writing topics were also counterbalanced across the three writing sessions.

**Figure 2**
Writing score comparison results. p-values for the Friedman test with effect sizes (W denotes Kendall’s W) and p-values for the Wilcoxon signed-rank test are provided. On the horizontal axis, W—without interaction; G—interaction with GPT-4; and P—interaction with language partner.

**Figure 3**
Rating score comparison results. p-values for the Friedman test with effect sizes (W denotes Kendall’s W) and p-values for the Wilcoxon signed-rank test are provided.

**Figure 4**
Spearman’s correlation test results. Spearman’s R values, as well as p-values, are provided. Post_W: rating score change, i.e., post-interaction rating score−rating score under the “without interaction” condition; G_W: writing score change, i.e., G = writing score under interaction with GPT-4−writing score under the “without interaction” condition; and P_W: writing score change, i.e., P = writing score under interaction with a language partner−writing score under the “without interaction” condition. (A) Heat map of the correlation between interaction with GPT-4 ratings (familiarity, confidence, and difficulty) and writing scores (total & sub-dimensions). (B) Heat map of the correlation between interaction with language partner ratings (familiarity, confidence, and difficulty) and writing scores (total & sub-dimensions). (C) Significant correlation between Gpost_W familiarity and writing scores (total & sub-dimensions). (D) Significant correlation between Ppost_W difficulty/confidence and content score.

See this image and copyright information in PMC

References

1. Abdi Tabari M., Bui G., Wang Y. The effects of topic familiarity on emotionality and linguistic complexity in EAP writing. Language Teaching Research. 2024;28(4):1616–1634. doi: 10.1177/13621688211033565. - DOI
1. Agustini N. P. O. Examining the role of ChatGPT as a learning tool in promoting students’ English language learning autonomy relevant to Kurikulum Merdeka Belajar. Edukasia: Jurnal Pendidikan Dan Pembelajaran. 2023;4(2):921–934. doi: 10.62775/edukasia.v4i2.373. - DOI
1. Albdrani R. N., Al-Shargabi A. A. Investigating the effectiveness of ChatGPT for providing personalized learning experience: A case study. International Journal of Advanced Computer Science & Applications. 2023;14(11):1208.
1. Alqahtani T., Badreldin H. A., Alrashed M., Alshaya A. I., Alghamdi S. S., Bin Saleh K., Alowais S. A., Alshaya O. A., Rahman I., Al Yami M. S., Albekairy A. M. The emergent role of artificial intelligence, natural learning processing, and large language models in higher education and research. Research in Social and Administrative Pharmacy. 2023;19(8):1236–1242. doi: 10.1016/j.sapharm.2023.05.016. - DOI - PubMed
1. Arques A. C., Ferrero C. L. Peer-feedback of an occluded genre in the Spanish language classroom: A case study. Assessing Writing. 2023;57:100756. doi: 10.1016/j.asw.2023.100756. - DOI

Grants and funding

2019YFA0709503/the National Key R&D Program of China

LinkOut - more resources

Full Text Sources
- MDPI
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Complementing but Not Replacing: Comparing the Impacts of GPT-4 and Native-Speaker Interaction on Chinese L2 Writing Outcomes

Affiliations

Complementing but Not Replacing: Comparing the Impacts of GPT-4 and Native-Speaker Interaction on Chinese L2 Writing Outcomes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources