Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 6:14:giaf063.
doi: 10.1093/gigascience/giaf063.

PVGA: a precise viral genome assembler using an iterative alignment graph

Affiliations

PVGA: a precise viral genome assembler using an iterative alignment graph

Zhi Song et al. Gigascience. .

Abstract

Background: Viral genome analysis is crucial for understanding virus evolution and mutation. Investigations into viral evolutionary dynamics and mutation patterns have garnered significant research attention since the outbreak of COVID-19. The basic structure of many virus genomes is highly conserved [1]. RNA viruses have high mutation rates, and single-nucleotide variations may induce substantial phenotypic alterations in terms of viral function and pathogenicity. Thus, special assembly methods are required for viral genome analysis.

Result: PVGA starts with a reference genome and the sequencing reads. The first step in PVGA involves constructing an alignment graph based on a reference genome and the set of input sequencing reads. Then the optimal genomic path is determined through dynamic programming, maximizing the cumulative edge weights that reflect read support density across the alignment graph. The obtained path corresponds to a refined genome. Finally, we repeat the process by using the new reference genomes until no further improvement is possible. We evaluate PVGA's performance across both assembly and polishing tasks using simulated and real datasets, including both long reads and short reads. The experiments demonstrate that PVGA always outperforms popular existing programs in terms of the quality of assembly results, while the running time of our method is compatible to others. In particular, simulated Nanopore datasets show that our method can correctly report the true genomes with 0 mismatches and 0 indels.

Conclusions: PVGA is a novel viral genome assembler that seamlessly integrates assembly and polishing into a unified workflow. Its design prioritizes high accuracy, enabling the detection of subtle genomic variations that can impact viral function and pathogenicity. By addressing the unique challenges of viral genome assembly, PVGA provides a reliable and precise solution for advancing our understanding of viral evolution and behavior.

Keywords: alignment graph; genome assembler; iterative method; maximum total weight path; virus genome.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1:
Figure 1:
Flowchart of construction of alignment graph and iteration process. (A) PVGA takes a reference genome as backbone graph formula image. (B) PVGA aligns the first read Read1 to the backbone. (C) Four reads are aligned with the backbone, awaiting the subsequent merging process. (D) PVGA merges edges that point to the same node; the new edge’s weight is equal to the sum of the weights of the merged edges. This process can be performed either after aligning all reads or during the alignment process, with a final merge conducted after all reads have been aligned. (E) Iteratively construct the alignment graph using the result from the previous iteration as the backbone.
Figure 2:
Figure 2:
Results on simulated Nanopore HIV-1 datasets with an average read length of 2 kb and 4 kb, respectively. The 4 subfigures in each row represent mismatch, indels, indel length, and edit distance from left to right, respectively.
Figure 3:
Figure 3:
Results on simulated Nanopore SARS-CoV-2 datasets with an average read length of 2 kb and 4 kb, respectively. The 4 subfigures in each row represent mismatch, indels, indel length, and edit distance from left to right, respectively.
Figure 4:
Figure 4:
Pairwise similarity matrix of 5 HIV-1 strains.
Figure 5:
Figure 5:
Comparison of CPU times for the 5 tools on the 3 datasets of 50×, 100×, and 200× coverage, respectively. (A) HIV-1 virus 89.6 strain. (B) Measles virus. (C) SARS-CoV-2.
Figure 6:
Figure 6:
Comparison of maximum memory consumption during the runtime across 3 datasets with 50formula image, 100formula image, and 200formula image coverage: (A) HIV-1 89.6 strain, (B) measles virus, and (C) SARS-CoV-2.

Similar articles

References

    1. Hofacker IL, Stadler PF, Stocsits RR. Conserved RNA secondary structures in viral genomes: a survey. Bioinformatics. 2004;20(10):1495–99. 10.1093/bioinformatics/bth108. - DOI - PubMed
    1. Harvey WT, Carabelli AM, Jackson B, et al. SARS-CoV-2 variants, spike mutations and immune escape. Nat Rev Microbiol. 2021;19(7):409–24. 10.1038/s41579-021-00573-0. - DOI - PMC - PubMed
    1. Jain M, Koren S, Miga KH, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36(4):338–45. 10.1038/nbt.4060. - DOI - PMC - PubMed
    1. Eid J, Fehr A, Gray J, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323(5910):133–38. 10.1126/science.1162986. - DOI - PubMed
    1. Wenger AM, Peluso P, Rowell WJ, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37(10):1155–62. 10.1038/s41587-019-0217-9. - DOI - PMC - PubMed

LinkOut - more resources