Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 17;15(4):540.
doi: 10.3390/bs15040540.

Complementing but Not Replacing: Comparing the Impacts of GPT-4 and Native-Speaker Interaction on Chinese L2 Writing Outcomes

Affiliations

Complementing but Not Replacing: Comparing the Impacts of GPT-4 and Native-Speaker Interaction on Chinese L2 Writing Outcomes

Zhaoyang Shan et al. Behav Sci (Basel). .

Abstract

This study explored the efficacy of large language models (LLMs), namely GPT-4, in supporting second language (L2) writing in comparison with interaction with a human language partner in the pre-writing phase. A within-subject behavioral experiment was conducted with 23 Chinese L2 learners who were exposed to three conditions: "without interaction", "interaction with GPT-4", and "interaction with a language partner". They then completed an L2 writing task. It was found that interaction with the language partner yielded significantly improved results compared with both interaction with GPT-4 and the case without interaction in terms of overall writing scores, organization, and language. Additionally, both types of interaction enhanced the participants' topic familiarity and writing confidence and reduced the task's perceived difficulty compared with the case without interaction. Interestingly, in the "interaction with GPT-4" condition, topic familiarity was positively correlated with better writing outcomes, whereas in the "interaction with a language partner" condition, perceived difficulty was positively correlated with content scores; however, content scores were negatively associated with writing confidence. This study suggests that LLMs should be used to complement and not replace human language partners in the L2 pre-writing phase.

Keywords: GPT-4; human language partner; large language models; second language writing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Figure 1
Figure 1
The experimental procedure. Each participant was invited to participate in a three-session writing experiment. Between each session, there was a minimum 10-day interval. In order to avoid carryover effects from the interactions, all participants first underwent a session involving writing without interaction; subsequently, the two interaction conditions (i.e., interaction with GPT-4 and interaction with the language partner) were counterbalanced across the participants. The writing topics were also counterbalanced across the three writing sessions.
Figure 2
Figure 2
Writing score comparison results. p-values for the Friedman test with effect sizes (W denotes Kendall’s W) and p-values for the Wilcoxon signed-rank test are provided. On the horizontal axis, W—without interaction; G—interaction with GPT-4; and P—interaction with language partner.
Figure 3
Figure 3
Rating score comparison results. p-values for the Friedman test with effect sizes (W denotes Kendall’s W) and p-values for the Wilcoxon signed-rank test are provided.
Figure 4
Figure 4
Spearman’s correlation test results. Spearman’s R values, as well as p-values, are provided. Post_W: rating score change, i.e., post-interaction rating score−rating score under the “without interaction” condition; G_W: writing score change, i.e., G = writing score under interaction with GPT-4−writing score under the “without interaction” condition; and P_W: writing score change, i.e., P = writing score under interaction with a language partner−writing score under the “without interaction” condition. (A) Heat map of the correlation between interaction with GPT-4 ratings (familiarity, confidence, and difficulty) and writing scores (total & sub-dimensions). (B) Heat map of the correlation between interaction with language partner ratings (familiarity, confidence, and difficulty) and writing scores (total & sub-dimensions). (C) Significant correlation between Gpost_W familiarity and writing scores (total & sub-dimensions). (D) Significant correlation between Ppost_W difficulty/confidence and content score.

References

    1. Abdi Tabari M., Bui G., Wang Y. The effects of topic familiarity on emotionality and linguistic complexity in EAP writing. Language Teaching Research. 2024;28(4):1616–1634. doi: 10.1177/13621688211033565. - DOI
    1. Agustini N. P. O. Examining the role of ChatGPT as a learning tool in promoting students’ English language learning autonomy relevant to Kurikulum Merdeka Belajar. Edukasia: Jurnal Pendidikan Dan Pembelajaran. 2023;4(2):921–934. doi: 10.62775/edukasia.v4i2.373. - DOI
    1. Albdrani R. N., Al-Shargabi A. A. Investigating the effectiveness of ChatGPT for providing personalized learning experience: A case study. International Journal of Advanced Computer Science & Applications. 2023;14(11):1208.
    1. Alqahtani T., Badreldin H. A., Alrashed M., Alshaya A. I., Alghamdi S. S., Bin Saleh K., Alowais S. A., Alshaya O. A., Rahman I., Al Yami M. S., Albekairy A. M. The emergent role of artificial intelligence, natural learning processing, and large language models in higher education and research. Research in Social and Administrative Pharmacy. 2023;19(8):1236–1242. doi: 10.1016/j.sapharm.2023.05.016. - DOI - PubMed
    1. Arques A. C., Ferrero C. L. Peer-feedback of an occluded genre in the Spanish language classroom: A case study. Assessing Writing. 2023;57:100756. doi: 10.1016/j.asw.2023.100756. - DOI

LinkOut - more resources