. 2022 May 18;23(1):377.

doi: 10.1186/s12864-022-08540-6.

PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data

Diogo Pinheiro¹, Sergio Santander-Jimenéz², Aleksandar Ilic³

Affiliations

¹ INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, 1000-029, Portugal.
² Department of Computer and Communications Technologies, University of Extremadura, Campus universitario s/n, Cáceres, 10003, Spain.
³ INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, 1000-029, Portugal. aleksandar.ilic@inesc-id.pt.

PMID: 35585494
PMCID: PMC9116704
DOI: 10.1186/s12864-022-08540-6

PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data

Diogo Pinheiro et al. BMC Genomics. 2022.

. 2022 May 18;23(1):377.

doi: 10.1186/s12864-022-08540-6.

Authors

Diogo Pinheiro¹, Sergio Santander-Jimenéz², Aleksandar Ilic³

Affiliations

¹ INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, 1000-029, Portugal.
² Department of Computer and Communications Technologies, University of Extremadura, Campus universitario s/n, Cáceres, 10003, Spain.
³ INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, 1000-029, Portugal. aleksandar.ilic@inesc-id.pt.

PMID: 35585494
PMCID: PMC9116704
DOI: 10.1186/s12864-022-08540-6

Abstract

Background: In the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evolutionary history but also have a wide range of additional applications in science. One of the most challenging problems that arise when building phylogenetic trees is the presence of missing biological data. More specifically, the possibility of inferring wrong phylogenetic trees increases proportionally to the amount of missing values in the input data. Although there are methods proposed to deal with this issue, their applicability and accuracy is often restricted by different constraints.

Results: We propose a framework, called PhyloMissForest, to impute missing entries in phylogenetic distance matrices and infer accurate evolutionary relationships. PhyloMissForest is built upon a random forest structure that infers the missing entries of the input data, based on the known parts of it. PhyloMissForest contributes with a robust and configurable framework that incorporates multiple search strategies and machine learning, complemented by phylogenetic techniques, to provide a more accurate inference of lost phylogenetic distances. We evaluate our framework by examining three real-world datasets, two DNA-based sequence alignments and one containing amino acid data, and two additional instances with simulated DNA data. Moreover, we follow a design of experiments methodology to define the hyperparameter values of our algorithm, which is a concise method, preferable in comparison to the well-known exhaustive parameters search. By varying the percentages of missing data from 5% to 60%, we generally outperform the state-of-the-art alternative imputation techniques in the tests conducted on real DNA data. In addition, significant improvements in execution time are observed for the amino acid instance. The results observed on simulated data also denote the attainment of improved imputations when dealing with large percentages of missing data.

Conclusions: By merging multiple search strategies, machine learning, and phylogenetic techniques, PhyloMissForest provides a highly customizable and robust framework for phylogenetic missing data imputation, with significant topological accuracy and effective speedups over the state of the art.

Keywords: Machine learning; Missing data imputation; Phylogenetic tree; Random forest.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Missing matrix inputs obtained from incomplete sequences

**Fig. 2**
Results obtained with the 9 ×9 dataset. Box plots with the mean, min (best), max (worst) and standard deviation of NRF

**Fig. 3**
Results obtained with the 37 ×37 dataset. Box plots with the mean, min (best), max (worst) and standard deviation of NRF

**Fig. 4**
Results obtained with the 55 ×55 dataset. Box plots with the mean, min (best), max (worst) and standard deviation of NRF

**Fig. 5**
Results obtained with the 40 ×40 dataset. Box plots with the mean, min (best), max (worst) and standard deviation of NRF

**Fig. 6**
Phylogenetic trees estimated with the full distance matrix on the upper left, in comparison with the trees obtained with PhyloMissForest (bottom left), matrix factorization (upper right) and autoencoder (bottom right) in the 9 ×9 dataset with 5% of missing data

**Fig. 7**
Random forest scheme with bootstrap and aggregation steps

**Fig. 8**
Flowchart of the phases of PhyloMissForest

See this image and copyright information in PMC

References

1. Lemey P, Salemi M, Vandamme A-M. The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing. Cambridge: Cambridge University Press; 2009.
1. Fernández-García JL. Phylogenetics for wildlife conservation. In: Phylogenetics. IntechOpen: 2017. p. 27–46.
1. Baker C, Palumbi S. Which whales are hunted? a molecular genetic approach to monitoring whaling. Science. 1994;265(5178):1538–40. doi: 10.1126/science.265.5178.1538. - DOI - PubMed
1. Siljic M, Salemovic D, Cirkovic V, Pesic-Pavlovic I, Ranin J, Todorovic M, Nikolic S, Jevtovic D, Stanojevic M. Forensic application of phylogenetic analyses – exploration of suspected hiv-1 transmission case. Forensic Sci Int Genet. 2017;27:100–5. doi: 10.1016/j.fsigen.2016.12.006. - DOI - PubMed
1. Lam TT-Y, Hon C-C, Tang JW. Use of phylogenetics in the molecular epidemiology and evolutionary studies of viral infections. Crit Rev Clin Lab Sci. 2010;47(1):5–49. doi: 10.3109/10408361003633318. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data

Affiliations

PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources