Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 17;12(1):19737.
doi: 10.1038/s41598-022-22258-1.

The role of individual variability on the predictive performance of machine learning applied to large bio-logging datasets

Affiliations

The role of individual variability on the predictive performance of machine learning applied to large bio-logging datasets

Marianna Chimienti et al. Sci Rep. .

Abstract

Animal-borne tagging (bio-logging) generates large and complex datasets. In particular, accelerometer tags, which provide information on behaviour and energy expenditure of wild animals, produce high-resolution multi-dimensional data, and can be challenging to analyse. We tested the performance of commonly used artificial intelligence tools on datasets of increasing volume and dimensionality. By collecting bio-logging data across several sampling seasons, datasets are inherently characterized by inter-individual variability. Such information should be considered when predicting behaviour. We integrated both unsupervised and supervised machine learning approaches to predict behaviours in two penguin species. The classified behaviours obtained from the unsupervised approach Expectation Maximisation were used to train the supervised approach Random Forest. We assessed agreement between the approaches, the performance of Random Forest on unknown data and the implications for the calculation of energy expenditure. Consideration of behavioural variability resulted in high agreement (> 80%) in behavioural classifications and minimal differences in energy expenditure estimates. However, some outliers with < 70% of agreement, highlighted how behaviours characterized by signal similarity are confused. We advise the broad bio-logging community, approaching these large datasets, to be cautious when upscaling predictions, as this might lead to less accurate estimates of behaviour and energy expenditure.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Conceptual overview of research questions and approach followed in this study. Both unsupervised and supervised machine learning approaches are applied on large accelerometer datasets to detect and predict behavioural activities. We quantify the behavioural variation included in the training datasets used for the supervised approach and test their predictive performance. We finally validate DEE estimates resulted from the behavioural activity budgets from the two approaches based on energetic validation obtained on Adélie penguin (Pygoscelis adeliae) from Hicks et al..
Figure 2
Figure 2
Example of behavioural classification in Adélie penguins (Pygoscelis adeliae) while diving. (a) diving profiles, (b) Pitch (degrees) indicating body posture while diving, (c) Vectorial Dynamic Body Acceleration (VeDBA, g) indicating overall acceleration while moving underwater. Colours indicate the behavioural states identified while diving.
Figure 3
Figure 3
Example of behavioural classification in Little penguins (Eudyptula minor) while diving. (a) diving profiles, (b) Pitch (degrees) indicating body posture while diving, (c) Vectorial Dynamic Body Acceleration (VeDBA, g) indicating overall acceleration while moving underwater. Colours indicate the behavioural states identified while diving.
Figure 4
Figure 4
Repeatability scores. Values are calculated for pitch (blue) and VeDBA (yellow) on training datasets obtained from Season 1 (squares) and both seasons (triangles) for both species, Adélie penguin (Pygoscelis adeliae) (a) and little penguin (Eudyptula minor) (b).
Figure 5
Figure 5
Agreement between machine learning approaches. Overall agreement between behavioural classification performed by the unsupervised machine learning algorithm Expectation Maximisation (EM) and one returned by the supervised machine learning algorithm Random Forest (RF), ran on data collected during foraging trips of Adélie penguins (Pygoscelis adeliae) (a), and little penguins (Eudyptula minor) (b).
Figure 6
Figure 6
Daily Energy Expenditure estimated using results from two machine learning approaches. Energy Expenditure obtained accounting for the activity budgets predicted by unsupervised machine learning algorithm Expectation Maximisation (EM) and the supervised machine learning algorithm Random Forest (RF) for Adélie penguin (Pygoscelis adeliae). (a) activity budgets used for calculating energy expenditure as in Hicks et al.2020, (b) boxplots of the distributions of the resulting Energy Expenditure, (c) comparison and regression between the Energy Expenditure calculated with the two approaches.

References

    1. Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J. Big Data. 2015;2:21.
    1. Lemoine F, et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature. 2018;556:452–456. - PMC - PubMed
    1. Manzoni C, et al. Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief. Bioinform. 2018;19:286–302. - PMC - PubMed
    1. Lichtman JW, Pfister H, Shavit N. The big data challenges of connectomics. Nat. Neurosci. 2014;17:1448–1454. - PMC - PubMed
    1. Altaf-Ul-Amin M, Afendi FM, Kiboi SK, Kanaya S. Systems biology in the context of big data and networks. Biomed. Res. Int. 2014;2014:428570. - PMC - PubMed

Publication types