Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 28;14(1):25764.
doi: 10.1038/s41598-024-77563-8.

Utilizing machine learning to predict participant response to follow-up health surveys in the Millennium Cohort Study

Collaborators, Affiliations

Utilizing machine learning to predict participant response to follow-up health surveys in the Millennium Cohort Study

Wisam Barkho et al. Sci Rep. .

Abstract

The Millennium Cohort Study is a longitudinal study which collects self-reported data from surveys to examine the long-term effects of military service. Participant nonresponse to follow-up surveys presents a potential threat to the validity and generalizability of study findings. In recent years, predictive analytics has emerged as a promising tool to identify predictors of nonresponse. Here, we develop a high-skill classifier using machine learning techniques to predict participant response to follow-up surveys of the Millennium Cohort Study. Six supervised algorithms were employed to predict response to the 2021 follow-up survey. Using latent class analysis (LCA), we classified participants based on historical survey response and compared prediction performance with and without this variable. Feature analysis was subsequently conducted on the best-performing model. Including the LCA variable in the machine learning analysis, all six algorithms performed comparably. Without the LCA variable, random forest outperformed the benchmark regression model, however overall prediction performance decreased. Feature analysis showed the LCA variable as the most important predictor. Our findings highlight the importance of historical response to improve prediction performance of participant response to follow-up surveys. Machine learning algorithms can be especially valuable when historical data are not available. Implementing these methods in longitudinal studies can enhance outreach efforts by strategically targeting participants, ultimately boosting survey response rates and mitigating nonresponse.

Keywords: Latent class analysis; Longitudinal studies; Machine learning; Survey nonresponse; Survey outreach efforts; Survey response.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Temporal prequential approach, using a sliding window schematic, to train and test supervised machine learning models using Panel 1 data from the Millennium Cohort Study. Survey response was predicted using predictor variables from the prior survey cycle in both the train and test data.
Fig. 2
Fig. 2
Comparison of prediction performance using six supervised machine learning algorithms on the 2016–2021 test data, employing the receiver operating characteristic and precision-recall curves. Panels (a) and (b) demonstrate the results when latent class analysis (LCA) is incorporated, while panels (c) and (d) show the outcomes without LCA. The dotted lines represent the performance at random chance levels. Prediction performance among the six algorithms exhibited comparable results with the inclusion of the LCA variable, whereas the random forest and interaction forest algorithms displayed noticeably higher area under the curve values compared with the other curves when the LCA variable was omitted.
Fig. 3
Fig. 3
Feature importance of predictor variables in the random forest model, both with and without the incorporation of latent class analysis (LCA), utilizing the 2016–2021 test data. In panel (a), where LCA is used, the model demonstrates a pronounced preference for the LCA variable, showcasing its crucial role. However, in panel (b), without LCA, the model identifies a distinct set of patterns in the data, resulting in a noticeable shift in the overall importance of the variables. AD, active duty; GWV, Gulf War veteran.

References

    1. Rothman, K. J., Greenland, S. & Lash, T. L. Modern Epidemiology (Wolters Kluwer Health/Lippincott Williams & Wilkins, 2008).
    1. Caruana, E. J., Roman, M., Hernández-Sánchez, J. & Solli, P. Longitudinal studies. J. Thorac. Dis.7, E537–540. 10.3978/j.issn.2072-1439.2015.10.63 (2015). - PMC - PubMed
    1. Muñoz-Leiva, F., Sánchez-Fernández, J., Ríos, F. & Ibáñez-Zapata, J. A. Improving the response rate and quality in web-based surveys through the personalization and frequency of reminder mailings. Qual. Quantity. 44, 1037–1052. 10.1007/s11135-009-9256-5 (2010).
    1. Su, J., Shao, P. & Fang, J. Effect of incentives on web-based surveys. Tsinghua Sci. Technol.13, 344–347. 10.1016/S1007-0214(08)70055-5 (2008).
    1. Buskirk, T., Kirchner, A., Eck, A. & Signorino, C. An introduction to machine learning methods for Survey Researchers. Surv. Pract.11, 1–10. 10.29115/SP-2018-0004 (2018). - PubMed

LinkOut - more resources