A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning

Anusha P Bishop^{1

2}, Giuseppe Amatulli³, Chaz Hyseni⁴, Evlyn Pless^{1

5}, Rosemary Bateta⁶, Winnie A Okeyo^{6

7}, Paul O Mireji^{6

8}, Sylvance Okoth⁶, Imna Malele⁹, Grace Murilla⁶, Serap Aksoy¹⁰, Adalgisa Caccone¹, Norah P Saarman^{1

11}

Affiliations

¹ Department of Ecology and Evolutionary Biology Yale University New Haven CT USA.
² Department of Environmental Science, Policy, & Management University of California Berkeley CA USA.
³ School of the Environment Yale University New Haven CT USA.
⁴ Department of Ecology and Genetics Uppsala University Uppsala Sweden.
⁵ Department of Anthropology University of California Davis CA USA.
⁶ Biotechnology Research Institute Kenya Agricultural and Livestock Research Organization Kikuyu, Nairobi Kenya.
⁷ Department of Biomedical Sciences and Technology School of Public Health and Community Development Maseno University Maseno, Kisumu Kenya.
⁸ Centre for Geographic Medicine Research Coast Kenya Medical Research Institute Kilifi Kenya.
⁹ Vector and Vector Borne Diseases Research Institute Tanzania Veterinary Laboratory Agency Tanga Tanzania.
¹⁰ Department of Epidemiology of Microbial Diseases Yale School of Public Health New Haven CT USA.
¹¹ Department of Biology Utah State University Logan UT USA.

PMID: 34295362
PMCID: PMC8288027
DOI: 10.1111/eva.13237

A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning

Anusha P Bishop et al. Evol Appl. 2021.

. 2021 May 5;14(7):1762-1777.

doi: 10.1111/eva.13237. eCollection 2021 Jul.

Authors

Affiliations

¹ Department of Ecology and Evolutionary Biology Yale University New Haven CT USA.
² Department of Environmental Science, Policy, & Management University of California Berkeley CA USA.
³ School of the Environment Yale University New Haven CT USA.
⁴ Department of Ecology and Genetics Uppsala University Uppsala Sweden.
⁵ Department of Anthropology University of California Davis CA USA.
⁶ Biotechnology Research Institute Kenya Agricultural and Livestock Research Organization Kikuyu, Nairobi Kenya.
⁷ Department of Biomedical Sciences and Technology School of Public Health and Community Development Maseno University Maseno, Kisumu Kenya.
⁸ Centre for Geographic Medicine Research Coast Kenya Medical Research Institute Kilifi Kenya.
⁹ Vector and Vector Borne Diseases Research Institute Tanzania Veterinary Laboratory Agency Tanga Tanzania.
¹⁰ Department of Epidemiology of Microbial Diseases Yale School of Public Health New Haven CT USA.
¹¹ Department of Biology Utah State University Logan UT USA.

PMID: 34295362
PMCID: PMC8288027
DOI: 10.1111/eva.13237

Abstract

Vector control is an effective strategy for reducing vector-borne disease transmission, but requires knowledge of vector habitat use and dispersal patterns. Our goal was to improve this knowledge for the tsetse species Glossina pallidipes, a vector of human and animal African trypanosomiasis, which are diseases that pose serious health and socioeconomic burdens across sub-Saharan Africa. We used random forest regression to (i) build and integrate models of G. pallidipes habitat suitability and genetic connectivity across Kenya and northern Tanzania and (ii) provide novel vector control recommendations. Inputs for the models included field survey records from 349 trap locations, genetic data from 11 microsatellite loci from 659 flies and 29 sampling sites, and remotely sensed environmental data. The suitability and connectivity models explained approximately 80% and 67% of the variance in the occurrence and genetic data and exhibited high accuracy based on cross-validation. The bivariate map showed that suitability and connectivity vary independently across the landscape and was used to inform our vector control recommendations. Post hoc analyses show spatial variation in the correlations between the most important environmental predictors from our models and each response variable (e.g., suitability and connectivity) as well as heterogeneity in expected future climatic change of these predictors. The bivariate map suggests that vector control is most likely to be successful in the Lake Victoria Basin and supports the previous recommendation that G. pallidipes from most of eastern Kenya should be managed as a single unit. We further recommend that future monitoring efforts should focus on tracking potential changes in vector presence and dispersal around the Serengeti and the Lake Victoria Basin based on projected local climatic shifts. The strong performance of the spatial models suggests potential for our integrative methodology to be used to understand future impacts of climate change in this and other vector systems.

Keywords: disease vector; gene flow; habitat suitability; landscape genetics; random forest; spatial modeling.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to declare.

Figures

**FIGURE 1**
Map of sampling sites in Kenya and Tanzania, color coded by genetic cluster. The boxed area of detail is the location of the study region in Africa. The approximate area of the Serengeti ecosystem is shaded in green (combination of the Maasai Mara National Reserve and the Serengeti National Park), and the approximate outline of the Great Rift Valley is shaded in purple. The three new sampling sites for this study (OTT, CNP, and AMR) are labeled. CNP was split into CNPa and CNPb for our analysis as some trap locations from this sampling site were found to be further than two kilometers apart (see methods). This map was created using the R packages “ggplot2” (Wickham, 2016), “raster” (Hijmans, 2019), and “rgdal” (Bivand et al., 2019) with publicly available data from DIVA‐GIS (March 2020; http://www.diva‐gis.org), Map Library (March 2020; http://www.maplibrary.org), World Map (March 2020; https://worldmap.harvard.edu), and MaMaSe (March 2020; http://maps.mamase.org)

**FIGURE 2**
Diagram of simplified methods. Light gray shaded boxes indicate the separate pipelines for the suitability (A1, C1) and connectivity (A2, C2) models. The original data inputs are presence‐background data (A1) and microsatellite data (A2) from flies caught during trapping surveys in Kenya and northern Tanzania as well as remotely sensed data from CHELSA, MERIT, and DIVA‐GIS repositories (A3). See methods for more details on calculation of genetic distances (A2), manipulation of environmental data (B1, B2), and selection of background points (A1). Dark gray outlined boxes (C1, C2, C3, C4) illustrate the final outputs of the pipelines (C1, C2), the bivariate map of connectivity and suitability (C3), and post hoc analyses (C4)

**FIGURE 3**
Maps of RMSE values for each sampling site from the leave‐one‐out cross‐validation results. Sampling sites are color coded by genetic cluster: (a) RMSE values from external validation of the genetic connectivity model and (b) RMSE values from the spatial evaluation of the genetic connectivity map (the projection of the genetic connectivity model). Sites with high error compared to other sites and to the null models are labeled (File S1)

**FIGURE 4**
Predicted genetic connectivity and habitat suitability based on machine learning (random forest) models. White areas in all three maps are regions where the predicted probability of G. *pallidipes* presence is less than ten percent, based on the habitat suitability map. (a) Scaled map of habitat suitability (combination of our final model and the FAO model), (b) scaled and transformed ( $1 ‐ scaled genetic distance$ ) map of genetic connectivity, and (c) bivariate map of genetic connectivity versus habitat suitability. The bivariate legend in the bottom left‐hand corner shows the corresponding colors for the different percentiles of genetic connectivity and habitat suitability (dark red: high genetic connectivity/high habitat suitability, yellow: high genetic connectivity/low habitat suitability, blue: low genetic connectivity/high habitat suitability, gray: low genetic connectivity/low habitat suitability)

**FIGURE 5**
Variable importance plots for (a) the 10 replicate habitat suitability models and (b) the final genetic connectivity model. Only the top 10 most important variables are shown, for the full variable importance plots see Figure S6. The R package “randomForest” measures importance based on the increase in node purity (IncNodePurity). Variables correspond to those described in Table S1. (c) Post hoc analyses of the most important predictor variable for habitat suitability (left column) and genetic connectivity (right column). The first row of maps shows the current environmental conditions (color palette from the “wesanderson” package; Ram & Wickham, 2018). The second row of maps shows the local Pearson's correlations between the top predictor variables and response variables of interest (i.e., maximum temperature of the warmest month vs suitability (probability of presence) and precipitation of the dries season vs connectivity ( $1 ‐ scaled genetic distance$ ). The local correlation coefficients were calculated with the corLocal() function from the R package “raster” (neighborhood size = 21; Hijmans, 2019). The third row shows maps of the predicted future change in the top predictor variables under the NASA RCP 4.5 climate change model for 2041–2060. White areas in all maps are regions where the predicted probability of *G. pallidipes* presence is less than ten percent, based on the habitat suitability model. Abbreviations: Precipitation (Prec), Temperature (Temp), Maximum (Max), Correlation (Corr), Month (Mo)

See this image and copyright information in PMC

References

1. Allouche, O. , Tsoar, A. , & Kadmon, R. (2006). Assessing the accuracy of species distribution models: Prevalence, kappa and the true skill statistic (TSS). Journal of Applied Ecology, 43(6), 1223–1232. 10.1111/j.1365-2664.2006.01214.x - DOI
1. Amatulli, G. , McInerney, D. , Sethi, T. , Strobl, P. , & Domisch, S. (2020). Geomorpho90m, empirical evaluation and accuracy assessment of global high‐resolution geomorphometric layers. Scientific Data, 7(1), 162. 10.1038/s41597-020-0479-6 - DOI - PMC - PubMed
1. Auguie, B. (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. Retrieved from https://cran.r‐project.org/package=gridExtra
1. Baddeley, A. , & Turner, R. (2005). {spatstat}: An {R} package for analyzing spatial point patterns. Journal of Statistical Software, 12(6), 1–42.Retrieved from http://www.jstatsoft.org/v12/i06/
1. Barbet‐Massin, M. , Jiguet, F. , Albert, C. H. , & Thuiller, W. (2012). Selecting pseudo‐absences for species distribution models: How, where and how many? Methods in Ecology and Evolution, 3, 327–338. 10.1111/j.2041-210X.2011.00172.x - DOI

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning

Affiliations

A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources