Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 10;16(9):e0256782.
doi: 10.1371/journal.pone.0256782. eCollection 2021.

Identifying indicator species in ecological habitats using Deep Optimal Feature Learning

Affiliations

Identifying indicator species in ecological habitats using Deep Optimal Feature Learning

Yiting Tsai et al. PLoS One. .

Abstract

Much of the current research on supervised modelling is focused on maximizing outcome prediction accuracy. However, in engineering disciplines, an arguably more important goal is that of feature extraction, the identification of relevant features associated with the various outcomes. For instance, in microbial communities, the identification of keystone species can often lead to improved prediction of future behavioral shifts. This paper proposes a novel feature extractor based on Deep Learning, which is largely agnostic to underlying assumptions regarding the training data. Starting from a collection of microbial species abundance counts, the Deep Learning model first trains itself to classify the selected distinct habitats. It then identifies indicator species associated with the habitats. The results are then compared and contrasted with those obtained by traditional statistical techniques. The indicator species are similar when compared at top taxonomic levels such as Domain and Phylum, despite visible differences in lower levels such as Class and Order. More importantly, when our estimated indicators are used to predict final habitat labels using simpler models (such as Support Vector Machines and traditional Artificial Neural Networks), the prediction accuracy is improved. Overall, this study serves as a preliminary step that bridges modern, black-box Machine Learning models with traditional, domain expertise-rich techniques.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overall data analysis workflow in block diagram form.
(Step 1): The collection of raw input data samples, as well as a corresponding set of labelled “ground-truth” targets. (Step 2): The pre-processing of raw input data into suitable structures for modelling, guided by any available domain or expertise knowledge. (Step 3): The training of several types of classification models (including Deep Learning), which maps inputs to their corresponding discrete class labels. (Step 4): The design of special objective function within a Deep Learning classification, which identifies a latent space with improved class separation. The most dominant latent features are then distinguished by the magnitude (ex. 2 norm) of the neural network weights.
Fig 2
Fig 2. Printout of pandas dataframes containing raw data collected directly from ecological sites.
(Left) Dataframe containing raw input variables. (Right) Dataframe containing output class labels.
Fig 3
Fig 3. Discovery of an optimally-separating latent feature space.
(Top Left) The high-dimensional and confounded raw inputs X makes class separation a challenging task.(Bottom Left) A case example including six microorganism species (A through F) which are entangled in the raw input space. (Top Right) A DL model which learns a latent space Z that optimally separates the classes. (Bottom Right) The disentanglement of the six species into distinct classes, which can be further aggregated into two major classes—Class 0 (A and B) and Class 1 (C through F).
Fig 4
Fig 4. Species abundance counts after each pre-processing transformation.
(Top) Distribution of counts from all 21721 species; the density is extremely skewed towards the low end. (Middle) Distribution of only the top 50 species by sum. (Bottom) Distribution of the top 50 species after a log10 transformation. Rarer but higher-abundance species are now recognizable.
Fig 5
Fig 5. Histogram of maximum abundance counts.
Fig 6
Fig 6. Visualization of a linear separating hyperplane and separating margins in a 2-class SVM model.
Fig 7
Fig 7. Visualization of a classification problem with a non-linear separation boundary between the two classes.
(Left) The raw feature-space spanned by raw variables x1 and x2 renders linear separation impossible. (Right) A desired latent feature space which optimally separates the two classes. The goal of the DL model is to learn its coordinates, z1 and z2.
Fig 8
Fig 8. A two-layer neural network with a hinge-loss-like objective function.
Fig 9
Fig 9. Feature selection based on direction of optimal separation.
Fig 10
Fig 10. Selection of relevant input variables, by reverse-engineering matrix multiplication.
Fig 11
Fig 11. The overall data analysis workflow applied to the Mount Polley case study.
Fig 12
Fig 12. The ANN neuron architecture used in the Mount Polley case study.
Fig 13
Fig 13. The 4-layer ANN architecture used to classify the Fisher Iris dataset.
Fig 14
Fig 14. Visualization of data separation within the first ANN hidden layer.
(Left) Inter-class separation distance in the traditional ANN, between Class 2 (red) samples and Class 1 (black) samples. (Right) Inter-class separation distance in the optimally-separating ANN.
Fig 15
Fig 15. Inter-class separations in the traditional ANN.
Black samples belong to the undisturbed class. Red samples belong to the disturbed class. Only the first 4 out of the total 5000 runs are shown.
Fig 16
Fig 16. Inter-class separations in the optimally-separating ANN.
The separations are noticeably larger than those in Fig 15. Black samples belong to the undisturbed class. Red samples belong to the disturbed class. Only the first 4 out of the total 5000 runs are shown.
Fig 17
Fig 17. Taxonomic comparison of indicator species at the domain level.
Blue bars represent the indicators identified by Garris et al [34]. Purple bars represent the indicators identified by our proposed feature extractor. The horizontal axis represents the percentage of indicators belonging to each species.
Fig 18
Fig 18. Taxonomic comparison of indicator species at the phylum level.
Blue bars represent the indicators identified by Garris et al [34]. Purple bars represent the indicators identified by our proposed feature extractor. The horizontal axis represents the percentage of indicators belonging to each species.
Fig 19
Fig 19. Taxonomic comparison of indicator species at the class level.
Blue bars represent the indicators identified by Garris et al [34]. Purple bars represent the indicators identified by our proposed feature extractor. The horizontal axis represents the percentage of indicators belonging to each species.

Similar articles

Cited by

References

    1. Rajput DS, Basha SM, Xin Q, Gadekallu TR, Kaluri R, Lakshmanna K, et al.. Providing diagnosis on diabetes using cloud computing environment to the people living in rural areas of India. Journal of Ambient Intelligence and Humanized Computing. 2021; p. 1–12.
    1. Dufrêne M, Legendre P. Species assemblages and indicator species: the need for a flexible asymmetrical approach. Ecological monographs. 1997;67(3):345–366. doi: 10.1890/0012-9615(1997)067[0345:SAAIST]2.0.CO;2 - DOI
    1. Podani J, Csányi B. Detecting indicator species: Some extensions of the IndVal measure. Ecological Indicators. 2010;10(6):1119–1124. doi: 10.1016/j.ecolind.2010.03.010 - DOI
    1. Penczak T. Fish assemblage compositions after implementation of the IndVal method on the Narew River system. Ecological modelling. 2009;220(3):419–423. doi: 10.1016/j.ecolmodel.2008.11.005 - DOI
    1. Antonelli L, Foata J, Quilichini Y, Marchand B. Influence of season and site location on European cultured sea bass parasites in Corsican fish farms using indicator species analysis (IndVal). Parasitology research. 2016;115(2):561–568. doi: 10.1007/s00436-015-4772-9 - DOI - PubMed