Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov 23;15(1):27.
doi: 10.1186/s13062-020-00278-z.

A machine learning framework to determine geolocations from metagenomic profiling

Affiliations

A machine learning framework to determine geolocations from metagenomic profiling

Lihong Huang et al. Biol Direct. .

Abstract

Background: Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples' geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples.

Results: Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the "mystery" cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples.

Conclusion: Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples' geolocations for samples from locations that are not in the training dataset.

Keywords: Abundance profiling; Affine transform; Binning; Kriging interpolation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Testing samples were geotagged with longitude and latitude coordinates via global positioning system (GPS). After that, we used affine transformation to transform geographic points to biological points, and applied Kriging interpolation to predict the probabilities of the testing samples from unsampled cities
Fig. 2
Fig. 2
The flowchart of the proposed framework
Fig. 3
Fig. 3
Prediction performance on validation set v.s. number of features used for training on the corresponding training data. Prediction performance is measured by averaging the prediction accuracy over 1000 random shuffle-splits of training and validation sets on the training dataset provided by the Challenge
Fig. 4
Fig. 4
Two-dimensional hierarchical clustering of abundance profiles on the 50 selected features for all training samples. Abundance profiles are shown after log10(value+1e-6) operation
Fig. 5
Fig. 5
The total number of raw features is 5503, including 4, 48, 73, 160, 344, 913 and 3961 for Kingdom, Phylum, Class, Order, Family, Genus and Species clade levels, respectively. After feature selection, we retained 4 clade levels including Order, Family, Genus and Species. Species has the highest number of features (41) and Order has the lowest number of features (1)
Fig. 6
Fig. 6
t-SNE visualization (a) before and (b) after feature binning. Before feature binning, it is very difficult to separate the cities. After binning, most cities can be clearly separated from others
Fig. 7
Fig. 7
The confusion matrix of the training dataset calculated using binned abundance profiles. All cities are highly distinguishable except for Hamilton and Auckland. Both cities are well separated from the other cities, but it is relatively difficult to distinguish between these two cities
Fig. 8
Fig. 8
The probabilities were interpolated from the other 9 European cities except Stockholm using the proposed framework. Kriging interpolation is performed on the biological coordinates. For each sample, the probability on the left side is resulted from original training data and those on the right side are from permuted training samples. P-values for significant differences are noted
Fig. 9
Fig. 9
The predicted probabilities on nine cities from the training set and the interpolated probabilities of four samples from Stockholm shown on biological coordinate. The circle size indicates the probability

Similar articles

Cited by

References

    1. Consortium TMI. The metagenomics and metadesign of the subways and urban biomes (metasub) international consortium inaugural meeting report. Microbiome. 2016;4:1–14. doi: 10.1186/s40168-015-0145-y. - DOI - PMC - PubMed
    1. Alshawaqfeh M, Bashaireh A, Serpedin E, Suchodolski J. Consistent metagenomic biomarker detection via robust PCA. Biol Direct. 2017;12(4):1–16. - PMC - PubMed
    1. Ryan FJ. Application of machine learning techniques for creating urban microbial fingerprints. Biol Direct. 2019;14(13):1–13. - PMC - PubMed
    1. Casimiro-Soriguer CS, Loucera C, Perez Florido J, López-López D, Dopazo J. Antibiotic resistance and metabolic profiles as functional biomarkers that accurately predict the geographic origin of city metagenomics samples. Biol Direct. 2019;14(15):1–16. - PMC - PubMed
    1. Harris ZN, Dhungel E, Mosior M, Ahn T-H. Massive metagenomic data analysis using abundance-based machine learning. Biol Direct. 2019;14(1):1–13. doi: 10.1186/s13062-019-0242-0. - DOI - PMC - PubMed

Publication types

LinkOut - more resources