Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 22;9(2):vead069.
doi: 10.1093/ve/vead069. eCollection 2023.

Optimizing ancestral trait reconstruction of large HIV Subtype C datasets through multiple-trait subsampling

Affiliations

Optimizing ancestral trait reconstruction of large HIV Subtype C datasets through multiple-trait subsampling

Xingguang Li et al. Virus Evol. .

Abstract

Large datasets along with sampling bias represent a challenge for phylodynamic reconstructions, particularly when the study data are obtained from various heterogeneous sources and/or through convenience sampling. In this study, we evaluate the presence of unbalanced sampled distribution by collection date, location, and risk group of human immunodeficiency virus Type 1 Subtype C using a comprehensive subsampling strategy and assess their impact on the reconstruction of the viral spatial and risk group dynamics using phylogenetic comparative methods. Our study shows that a most suitable dataset for ancestral trait reconstruction can be obtained through subsampling by all available traits, particularly using multigene datasets. We also demonstrate that sampling bias is inflated when considerable information for a given trait is unavailable or of poor quality, as we observed for the trait risk group. In conclusion, we suggest that, even if traits are not well recorded, including them deliberately optimizes the representativeness of the original dataset rather than completely excluding them. Therefore, we advise the inclusion of as many traits as possible with the aid of subsampling approaches in order to optimize the dataset for phylodynamic analysis while reducing the computational burden. This will benefit research communities investigating the evolutionary and spatio-temporal patterns of infectious diseases.

Keywords: HIV Subtype C; ancestral trait reconstruction; multiple-trait subsampling; phylogenetic comparative methods; subsampling approaches.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1.
Figure 1.
Sampling distributions of metadata traits for the full and subsampled datasets of HIV-1 Subtype C. (A) Country distribution for the full and subsampled datasets. The distribution of the original dataset shows a disproportionate amount of samples sampled from BR, BW, IN, MW, SE, TZ, ZA, and ZM. (B) PGRHA distribution for the full and subsampled datasets. The distribution of the data shows a large amount of missing data (labeled NR) and higher amount of SH in comparison to other PGRHA. The number of sequences for the full dataset is labeled on the right y-axis.
Figure 2.
Figure 2.
Sampling distributions of metadata traits for the pol and subsampled datasets of HIV-1 Subtype C. (A) Country distribution for the pol and subsampled datasets. The distribution of the original dataset shows a larger amount of samples sampled from BR, BW, ET, GB, IN, MW, MZ, TZ, US, ZM, and ZW, with a disproportionate amount of samples from ZA. (B) PGRHA distribution for the pol and subsampled datasets. The distribution of the data shows a large amount of missing data (labeled NR) and slightly higher amount of SH in comparison to other PGRHA. The number of sequences for the pol dataset is labelled on the right y-axis.
Figure 3.
Figure 3.
Cluster info distance comparison of the phylogenetic topologies of the full (A) and pol (B) with their respective subsampled dataset subtrees. Zero cluster info distance equals identical trees. The topologies of the full and pol subsampled dataset subtrees are overall similar to that of their respective original datasets.
Figure 4.
Figure 4.
Correlation of SHR and degree centrality metrics as a proxy for transmission network structures of HIV-1 Subtype C for the full and pol with respective subsampled datasets by date, country, and PGRHA. The correlation of the transmission network is higher (thus, similar structures) as the correlation coefficient approximates to 1. (A) Estimates of similarity of the spatial transmission network structure for all subsampled datasets based on SHR metric; (B) estimates of similarity of the spatial transmission network structure of HIV-1 Subtype C for all subsampled datasets based on the degree centrality metric; (C) estimates of similarity of the PGRHA transmission network structure of HIV-1 Subtype C for all subsampled datasets based on SHR metric; (D) estimates of similarity of the PGRHA transmission network structure of HIV-1 Subtype C for all subsampled datasets based on the degree centrality metric. The degree of connectivity of each country or PGRHA node in the overall transmission network is generally maintained irrespective of the subsampling strategy; however, there is a discrepancy of the country and PGRHA node behaviors as indicated by the varying SHR per subsampling strategy.
Figure 5.
Figure 5.
Distribution of the average degree centrality among all subsamples for the top clusters in the pol and full subsets. (A) Degree centrality for top five countries (country trait) for full subsets; (B) degree centrality for top nine countries (country trait) for pol subsets; (C) degree centrality for all PGRHA for full subsets; (D) degree centrality for all PGRHA trait for pol subsets. Including country and PGRHA as subsampling traits yields the most consistent results for both traits’ transmission networks. The distribution of degree centrality among the nodes of the networks evaluated shows that the datasets subsampled by PGRHA and country or solely by country result in patterns similar to those for the original pol and full datasets. The top countries in terms of degree centrality are mostly conserved across the full datasets, with wider variance observed in the pol datasets likely due to the larger number of locations for the country trait in these datasets.

Similar articles

Cited by

References

    1. Alzohairy A. (2011) ‘BioEdit: An Important Software for Molecular Biology’, GERF Bulletin of Biosciences, 2: 60–1.
    1. Arias A. et al. (2016) ‘Rapid Outbreak Sequencing of Ebola Virus in Sierra Leone Identifies Transmission Chains Linked to Sporadic Cases’, Virus Evolution, 2: vew016. - PMC - PubMed
    1. Bedford T. et al. (2015) ‘Global Circulation Patterns of Seasonal Influenza Viruses Vary with Antigenic Drift’, Nature, 523: 217–20. - PMC - PubMed
    1. Bogdanowicz D., and Giaro K. (2012) ‘Matching Split Distance for Unrooted Binary Phylogenetic Trees’, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9: 150–60. - PubMed
    1. Brown T., and Peerapatanapokin W. (2019) ‘Evolving HIV Epidemics: The Urgent Need to Refocus on Populations with Risk’, Current Opinion in HIV and AIDS, 14: 337–53. - PMC - PubMed