Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 6:14:giaf032.
doi: 10.1093/gigascience/giaf032.

VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files

Affiliations

VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files

Lian Xu et al. Gigascience. .

Abstract

Background: Genetic distance metrics are crucial for understanding the evolutionary relationships and population structure of organisms. Progress in next-generation sequencing technology has given rise of genotyping data of thousands of individuals. The standard Variant Call Format (VCF) is widely used to store genomic variation information, but calculating genetic distance and constructing population phylogeny directly from large VCF files can be challenging. Moreover, the existing tools that implement such functions remain limited and have low performance in processing large-scale genotype data, especially in the area of memory efficiency.

Findings: To address these challenges, we introduce VCF2Dis, an ultra-fast and efficient tool that calculates pairwise genetic distance directly from large VCF files and then constructs distance-based population phylogeny using the ape package. Benchmarking results demonstrate the tool's efficiency, with rapid processing times, minimal memory usage (e.g., 0.37 GB for the complete analysis of 2,504 samples with 81.2 million variants), and high accuracy, even when handling datasets with millions of variants from thousands of individuals. Its straightforward command-line interface, compatibility with downstream phylogenetic analysis tools (e.g., MEGA, Phylip, and FastTree), and support for multithreading make it a valuable tool for researchers studying population relationships. These advantages meaning VCF2Dis has already been widely utilized in many published genomic studies.

Conclusion: We present VCF2Dis, a straightforward and efficient tool for calculating genetic distance and constructing population phylogeny directly from large-scale genotype data. VCF2Dis has been widely applied, facilitating the exploration of population relationship in extensive genome sequencing studies.

Keywords: VCF; VCF2Dis; p-distance; population phylogeny.

PubMed Disclaimer

Conflict of interest statement

The authors declare no potential competing interests.

Figures

Figure 1:
Figure 1:
The workflow of VCF2Dis and NJ phylogeny generated from a test dataset consisting 203 samples and 81.2 million bi-allele SNPs isolated from the 1000 Genomes Project. a, The VCF2Dis workflow involves several key steps, including parameter checks (e.g., input format), p-distance calculation, construction of population phylogeny, and phylogeny visualization. VCF2Dis could adopt input with formats of VCF, fasta and “phy.” The outputs include a p-distance matrix, a population phylogeny in Newick format and associated figure. b, Neighbor-joining phylogeny of 203 individuals. Colors indicated individuals from distinct populations. YRI, Africa; CEU, European; CHB, China; JPT: Japan.
Figure 2:
Figure 2:
The memory and runtime performance of VCF2Dis, fastreeR, and ngsDist were assessed based on the number of variants and samples used when calculating the genetic distance. a, Memory test with an increasing number of variants in a dataset containing 91 samples. b, Runtime test with an increasing number of variants in a dataset containing 91 samples. c, Memory test with an increasing number of individuals, each containing 2 million variants. d, Runtime test with an increasing number of individuals, each containing 2 million variants. The runtime of VCF2Dis and fastreeR are also separately shown in the inset.

Similar articles

Cited by

References

    1. Palmer LJ. UK Biobank: bank on it. Lancet. 2007;369(9578):1980–82. 10.1016/S0140-6736(07)60924-6. - DOI - PubMed
    1. The 3,000 Rice Genomes Project . The 3,000 Rice Genomes Project. Gigascience. 2014;3:7. 10.1186/2047-217X-3-7. - DOI - PMC - PubMed
    1. Siva N. 1000 Genomes Project. Nat Biotechnol. 2008;26(3):256. 10.1038/nbt0308-256b. - DOI - PubMed
    1. Holder M, Lewis PO. Phylogeny estimation: traditional and bayesian approaches. Nat Rev Genet. 2003;4(4):275–84. 10.1038/nrg1044. - DOI - PubMed
    1. Pardi F, Gascuel O. Combinatorics of distance-based tree inference. Proc Natl Acad Sci USA. 2012;109(41):16443–48. 10.1073/pnas.1118368109. - DOI - PMC - PubMed