. 2023 Nov 22;9(2):vead069.

doi: 10.1093/ve/vead069. eCollection 2023.

Optimizing ancestral trait reconstruction of large HIV Subtype C datasets through multiple-trait subsampling

Xingguang Li, Nídia S Trovão¹, Joel O Wertheim², Guy Baele³, Adriano de Bernardi Schneider^{4

5

6}

Affiliations

¹ Division of International Epidemiology and Population Studies, Fogarty International Center, National Institutes of Health, 31 Center Dr, Bethesda, MA 20892, USA.
² Department of Medicine, University of California, La Jolla, San Diego, CA 92093, USA.
³ Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven BE-3000, Belgium.
⁴ Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.
⁵ Ningbo No.2 Hospital, Ningbo 315010, China.
⁶ Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315000, China.

PMID: 38046219
PMCID: PMC10691791
DOI: 10.1093/ve/vead069

Optimizing ancestral trait reconstruction of large HIV Subtype C datasets through multiple-trait subsampling

Xingguang Li et al. Virus Evol. 2023.

. 2023 Nov 22;9(2):vead069.

doi: 10.1093/ve/vead069. eCollection 2023.

Authors

Xingguang Li, Nídia S Trovão¹, Joel O Wertheim², Guy Baele³, Adriano de Bernardi Schneider^{4

5

6}

Affiliations

¹ Division of International Epidemiology and Population Studies, Fogarty International Center, National Institutes of Health, 31 Center Dr, Bethesda, MA 20892, USA.
² Department of Medicine, University of California, La Jolla, San Diego, CA 92093, USA.
³ Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven BE-3000, Belgium.
⁴ Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.
⁵ Ningbo No.2 Hospital, Ningbo 315010, China.
⁶ Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315000, China.

PMID: 38046219
PMCID: PMC10691791
DOI: 10.1093/ve/vead069

Abstract

Large datasets along with sampling bias represent a challenge for phylodynamic reconstructions, particularly when the study data are obtained from various heterogeneous sources and/or through convenience sampling. In this study, we evaluate the presence of unbalanced sampled distribution by collection date, location, and risk group of human immunodeficiency virus Type 1 Subtype C using a comprehensive subsampling strategy and assess their impact on the reconstruction of the viral spatial and risk group dynamics using phylogenetic comparative methods. Our study shows that a most suitable dataset for ancestral trait reconstruction can be obtained through subsampling by all available traits, particularly using multigene datasets. We also demonstrate that sampling bias is inflated when considerable information for a given trait is unavailable or of poor quality, as we observed for the trait risk group. In conclusion, we suggest that, even if traits are not well recorded, including them deliberately optimizes the representativeness of the original dataset rather than completely excluding them. Therefore, we advise the inclusion of as many traits as possible with the aid of subsampling approaches in order to optimize the dataset for phylodynamic analysis while reducing the computational burden. This will benefit research communities investigating the evolutionary and spatio-temporal patterns of infectious diseases.

Keywords: HIV Subtype C; ancestral trait reconstruction; multiple-trait subsampling; phylogenetic comparative methods; subsampling approaches.

Published by Oxford University Press 2023. This work is written by (a) US Government employee(s) and is in the public domain in the US.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1.**
Sampling distributions of metadata traits for the full and subsampled datasets of HIV-1 Subtype C. (A) Country distribution for the full and subsampled datasets. The distribution of the original dataset shows a disproportionate amount of samples sampled from BR, BW, IN, MW, SE, TZ, ZA, and ZM. (B) PGRHA distribution for the full and subsampled datasets. The distribution of the data shows a large amount of missing data (labeled NR) and higher amount of SH in comparison to other PGRHA. The number of sequences for the full dataset is labeled on the right y-axis.

**Figure 2.**
Sampling distributions of metadata traits for the pol and subsampled datasets of HIV-1 Subtype C. (A) Country distribution for the pol and subsampled datasets. The distribution of the original dataset shows a larger amount of samples sampled from BR, BW, ET, GB, IN, MW, MZ, TZ, US, ZM, and ZW, with a disproportionate amount of samples from ZA. (B) PGRHA distribution for the pol and subsampled datasets. The distribution of the data shows a large amount of missing data (labeled NR) and slightly higher amount of SH in comparison to other PGRHA. The number of sequences for the pol dataset is labelled on the right y-axis.

**Figure 3.**
Cluster info distance comparison of the phylogenetic topologies of the full (A) and pol (B) with their respective subsampled dataset subtrees. Zero cluster info distance equals identical trees. The topologies of the full and pol subsampled dataset subtrees are overall similar to that of their respective original datasets.

**Figure 4.**
Correlation of SHR and degree centrality metrics as a proxy for transmission network structures of HIV-1 Subtype C for the full and pol with respective subsampled datasets by date, country, and PGRHA. The correlation of the transmission network is higher (thus, similar structures) as the correlation coefficient approximates to 1. (A) Estimates of similarity of the spatial transmission network structure for all subsampled datasets based on SHR metric; (B) estimates of similarity of the spatial transmission network structure of HIV-1 Subtype C for all subsampled datasets based on the degree centrality metric; (C) estimates of similarity of the PGRHA transmission network structure of HIV-1 Subtype C for all subsampled datasets based on SHR metric; (D) estimates of similarity of the PGRHA transmission network structure of HIV-1 Subtype C for all subsampled datasets based on the degree centrality metric. The degree of connectivity of each country or PGRHA node in the overall transmission network is generally maintained irrespective of the subsampling strategy; however, there is a discrepancy of the country and PGRHA node behaviors as indicated by the varying SHR per subsampling strategy.

**Figure 5.**
Distribution of the average degree centrality among all subsamples for the top clusters in the pol and full subsets. (A) Degree centrality for top five countries (country trait) for full subsets; (B) degree centrality for top nine countries (country trait) for pol subsets; (C) degree centrality for all PGRHA for full subsets; (D) degree centrality for all PGRHA trait for pol subsets. Including country and PGRHA as subsampling traits yields the most consistent results for both traits’ transmission networks. The distribution of degree centrality among the nodes of the networks evaluated shows that the datasets subsampled by PGRHA and country or solely by country result in patterns similar to those for the original pol and full datasets. The top countries in terms of degree centrality are mostly conserved across the full datasets, with wider variance observed in the pol datasets likely due to the larger number of locations for the country trait in these datasets.

See this image and copyright information in PMC

Cited by

The emergence and circulation of human immunodeficiency virus (HIV)-1 subtype C.
Li X, Tamim S, Trovão NS. Li X, et al. J Med Microbiol. 2024 May;73(5):001827. doi: 10.1099/jmm.0.001827. J Med Microbiol. 2024. PMID: 38757423 Free PMC article.

References

1. Alzohairy A. (2011) ‘BioEdit: An Important Software for Molecular Biology’, GERF Bulletin of Biosciences, 2: 60–1.
1. Arias A. et al. (2016) ‘Rapid Outbreak Sequencing of Ebola Virus in Sierra Leone Identifies Transmission Chains Linked to Sporadic Cases’, Virus Evolution, 2: vew016. - PMC - PubMed
1. Bedford T. et al. (2015) ‘Global Circulation Patterns of Seasonal Influenza Viruses Vary with Antigenic Drift’, Nature, 523: 217–20. - PMC - PubMed
1. Bogdanowicz D., and Giaro K. (2012) ‘Matching Split Distance for Unrooted Binary Phylogenetic Trees’, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9: 150–60. - PubMed
1. Brown T., and Peerapatanapokin W. (2019) ‘Evolving HIV Epidemics: The Urgent Need to Refocus on Populations with Risk’, Current Opinion in HIV and AIDS, 14: 337–53. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimizing ancestral trait reconstruction of large HIV Subtype C datasets through multiple-trait subsampling

Affiliations

Optimizing ancestral trait reconstruction of large HIV Subtype C datasets through multiple-trait subsampling

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources