Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 13;15(1):35622.
doi: 10.1038/s41598-025-19488-4.

Research literacy and its predictors among university students and graduates identified by machine learning and spatial analysis

Affiliations

Research literacy and its predictors among university students and graduates identified by machine learning and spatial analysis

Mohammed A Mamun et al. Sci Rep. .

Abstract

The landscape of academic publishing has evolved dramatically, leading to a surge in publications and journals. The 'publish or perish' culture has resulted in undesirable practices, such as many researchers publishing in predatory journals due to institutional pressures and lack of awareness. While numerous studies have investigated knowledge of predatory journals, overall research literacy has remained underexplored. This study is the first to assess research literacy comprehensively, incorporating GIS and machine learning techniques alongside traditional statistical analyses. This study utilized a cross-sectional survey method with a questionnaire collecting information on socio-demographics, academic information, research training and experience, and research literacy. Traditional statistical analyses were performed using SPSS, while machine learning models were developed with Python and Google Colab. Supervised classification algorithms and mapping with R statistical software's 'bangladesh' package. The findings revealed that over half of the participants had poor research literacy. Significant predictors of higher research literacy included satisfaction with research courses at university education, research course taken outside university , and research-related professional engagement. Machine learning analysis identified that taking research courses outside of university was the most impactful factor for research literacy, while researchers within family members had minimal influence. The Random Forest and CatBoost models performed strongly in predicting literacy, achieving accuracy rates of 73.04% and 71.57%, respectively, and precision values of 73.29% and 71.69%, respectively, with low log loss values of 0.57 and 0.56. GIS-based spatial analyses revealed regional disparities in research literacy (χ²=9.234, p = 0.236), with certain divisions exhibiting a higher prevalence of lower literacy. This study highlights that a substantial portion of the participants lack research literacy, which is associated with multiple factors. The findings suggest the need for intervention programs to enhance research practices and awareness among students and professionals, fostering a culture of academic excellence.

Keywords: Plagiarism; Predatory journals; Research evaluation; Research literacy; Research training; Thesis students.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests. Ethics approval and consent to participate: Before the study implementation, it received approval from the Institutional Review Board at Patuakhali Science and Technology University, Bangladesh [Reference Number: PSTU/IEC/2023/81]. Besides, the study adhered to the principles outlined in the revised Helsinki Declaration of 2013, ensuring ethical standards for human subject participation were met. Participants were informed about the purpose and objectives of the study through a brief description at the beginning of the questionnaire, where their written informed consent was sought for participation. They were assured of their right to withdraw from the study at any time without any obligation. It is important to note that no monetary or non-monetary remuneration or benefits were offered for participation in the study. Informed consent: Participants were informed about the purpose and objectives of the study through a brief description at the beginning of the questionnaire, where their written informed consent was sought for participation.

Figures

Fig. 1
Fig. 1
Relative impact of individual features on research literacy predictions, ranked by SHAP values from the XGBoost model. SHAP (Shapley Additive Explanations) values represent each feature’s contribution to model output, with higher absolute SHAP values indicating greater influence on predicting research literacy category. XGBoost: Extreme Gradient Boosting. Feature value is color-coded (red: high value, blue: low value).
Fig. 2
Fig. 2
Distribution of research literacy item familiarity among participants. DOAJ: Directory of Open Access Journals. “Not familiar at all” = 0, “Slightly familiar” = 1, “Moderately familiar” = 2, “Very familiar” = 3. Percentages reflect self-reported familiarity with each concept. Higher percentages in “Not familiar at all” and “Slightly familiar” indicate knowledge gaps. Items correspond to the ten research literacy domains assessed in the study.
Fig. 3
Fig. 3
Geographic distribution of low research literacy among participants by division and thesis status in Bangladesh. Panel A (left) displays the percentage of students/graduates with low research literacy in each of the eight divisions. Panels B and C (right) show the percentages stratified by thesis group and non-thesis group, respectively. Percentages are color-coded (darker shades indicate higher proportions of low research literacy). Maps were created using the R ‘bangladesh’ package, based on self-reported division of residence. Thesis group: completed or ongoing thesis. Non-thesis group: undecided, unwilling, or no opportunity for thesis.
Fig. 4
Fig. 4
Receiver Operating Characteristic (ROC) curves for machine learning models predicting research literacy status. Curves represent the classification performance of K-Nearest Neighbor (KNN), Random Forest (RF), Extreme Gradient Boosting (XGBoost), Gradient Boosting Machine (GBM), and Categorical Boosting (CatBoost) algorithms. The Area Under the Curve (AUC) values for each model are provided in the legend. Higher AUC indicates better model discrimination between high and low research literacy.

References

    1. Fire, M. & Guestrin, C. Over-optimization of academic publishing metrics: observing goodhart’s law in action. Gigascience8, giz053. 10.1093/gigascience/giz053 (2019). - PMC - PubMed
    1. Musa, Z. How many journal articles have been published? 2023. Retrieved from: https://publishingstate.com/how-many-journal-articles-have-been-publishe...
    1. Demir, S. B. Predatory journals: who publishes in them and why? J. Informetr.12, 1296–1311. 10.1016/j.joi.2018.10.008 (2018).
    1. Cobey, K. D. et al. Knowledge and motivations of researchers publishing in presumed predatory journals: a survey. BMJ Open.9, e026516. 10.1136/bmjopen-2018-026516 (2019). - PMC - PubMed
    1. Ibrahim, H. et al. Medical resident awareness of predatory journal practices in an international medical education system. Med. Educ. Online. 27, 2139169. 10.1080/10872981.2022.2139169 (2022). - PMC - PubMed

LinkOut - more resources