. 2021 May 20;22(3):bbaa123.

doi: 10.1093/bib/bbaa123.

Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses

Zhi-Luo Deng¹, Akshay Dhingra², Adrian Fritz¹, Jasper Götting², Philipp C Münch³, Lars Steinbrück², Thomas F Schulz², Tina Ganzenmüller², Alice C McHardy¹

Affiliations

¹ Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research.
² Institute of Virology in Hannover Medical School.
³ Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research and Max von Pettenkofer Institute in Ludwig Maximilian University of Munich.

PMID: 34020538
PMCID: PMC8138829
DOI: 10.1093/bib/bbaa123

Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses

Zhi-Luo Deng et al. Brief Bioinform. 2021.

. 2021 May 20;22(3):bbaa123.

doi: 10.1093/bib/bbaa123.

Authors

Zhi-Luo Deng¹, Akshay Dhingra², Adrian Fritz¹, Jasper Götting², Philipp C Münch³, Lars Steinbrück², Thomas F Schulz², Tina Ganzenmüller², Alice C McHardy¹

Affiliations

¹ Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research.
² Institute of Virology in Hannover Medical School.
³ Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research and Max von Pettenkofer Institute in Ludwig Maximilian University of Munich.

PMID: 34020538
PMCID: PMC8138829
DOI: 10.1093/bib/bbaa123

Abstract

Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a 'G.G' context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.

Keywords: HCMV; benchmark; genome assembly; strain mixtures; variant calling; virus.

PubMed Disclaimer

Figures

**Figure 1**
Radar plot of assembler performances with six metrics across the 10 samples. Values were scaled based on the average performance for each metric across tools and samples (best performance 1, lowest performance 0). All the metrics were calculated using the combined reference statistics from MetaQuast, except for NGA50, which is only available in genome-specific report. The NGA50 is the average NGA50 for individual genomes and samples.

**Figure 2**
Performance of nine genome assemblers and one haplotype assembler (Savage) in reconstructing HCMV strains from mixed sequence samples. In the two mixture types, HCMV strains AD169 and Merlin are the high abundant strains, respectively, and TB40/E is less abundant, at ratios ranging from 1:1 to 1:50. Boxplots display the median (black line) and interquartile range of individual assembly quality metrics (panels A–F) across the 10 data sets. ‘metaSPAdes’ is abbreviated as ‘metaSPA’. Strain genomes are denoted by their abundances in individual mixtures. For instance, 1/10 (blue) indicates the low abundant strain in the 1:10 mixture (i.e. TB40/E in TA-1-10 and TM-1-10), and 10/1 (purple) represents the abundant strain in the 1:10 mixture (i.e. AD169 and Merlin in TA-1-10 and TM-1-10, respectively). The best assembler for each metric is marked in bold red. To illustrate the performance of assemblers on each strain in the mixture with different abundance ratios, we used genome-specific reports here from metaQUAST. It should be noted that contrary to the combined reference statistics above (Figure 1), genome-specific estimates are based on nonunique mappings over the entire set of reference genomes with metaQUAST even when using the unique option, which leads to minor discrepancies in metrics, such as increased genome fraction, duplication ratio and number of mismatches.

**Figure 3**
Input/output (I/O), memory consumption and run time of assembly programs measured with the snakemake benchmarking function for all samples. Each dot in the boxplot indicates one sample.

**Figure 4**
Schematic presentation of the variant calling benchmark. Mixed samples include two HCMV strains at different abundance ratios. A genome alignment of the two strains was used to generate the ground truth, defining divergent and conserved sites, denoted by G (‘genomic difference’) and G (‘no genomic difference’), respectively. Based on the caller, sites were classified as predicted variants C, or predicted conserved C, defining true positive (TP, in blue), false positive (FP, in red) and false negative (FN, in orange) variant calls. If a variant was called that differed from the ground truth variant, this was also considered a FP.

**Figure 5**
Comparison of variant caller performances across samples. (A) and (B) SNPs with quality score > =20. (C) and (D) All variants including SNPs and InDels with threshold of quality scores maximizing F1 (analyzed with RTG-tools). In the scatterplots, shapes indicate samples and colors the variant caller. The box in the boxplots indicates interquartile range (colored boxes), median (black line) and outliers (points outside of box) of caller performances across the samples. (A) and (C) share the same figure legend (color and shape).

**Figure 6**
Intersection of SNPs between genomes and SNP calls from LoFreq, VarScan and CLC with score ≥20.

**Figure 7**
Genomic context of LoFreq predictions for the TA (A) and TM (B) strain mixtures. ‘All’: genomic context pattern of all predicted SNPs, ‘TP’: true positive variant calls and ‘FP’: false positive variant calls. The 16 mutation contexts for each mutation type are indicated once for the box in the lower right.

See this image and copyright information in PMC

Cited by

Human cytomegalovirus harnesses host L1 retrotransposon for efficient replication.
Hwang SY, Kim H, Denisko D, Zhao B, Lee D, Jeong J, Kim J, Park K, Park J, Jeong D, Park S, Choi HJ, Kim S, Lee EA, Ahn K. Hwang SY, et al. Nat Commun. 2024 Sep 2;15(1):7640. doi: 10.1038/s41467-024-51961-y. Nat Commun. 2024. PMID: 39223139 Free PMC article.
High-throughput engineering of cytoplasmic- and nuclear-replicating large dsDNA viruses by CRISPR/Cas9.
López-Muñoz AD, Rastrojo A, Martín R, Alcami A. López-Muñoz AD, et al. J Gen Virol. 2022 Oct;103(10):001797. doi: 10.1099/jgv.0.001797. J Gen Virol. 2022. PMID: 36260063 Free PMC article.
Grapevine Virology in the Third-Generation Sequencing Era: From Virus Detection to Viral Epitranscriptomics.
Javaran VJ, Moffett P, Lemoyne P, Xu D, Adkar-Purushothama CR, Fall ML. Javaran VJ, et al. Plants (Basel). 2021 Oct 31;10(11):2355. doi: 10.3390/plants10112355. Plants (Basel). 2021. PMID: 34834718 Free PMC article. Review.
Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs.
Jochheim A, Jochheim FA, Kolodyazhnaya A, Morice É, Steinegger M, Söding J. Jochheim A, et al. Microbiome. 2024 Oct 1;12(1):187. doi: 10.1186/s40168-024-01904-y. Microbiome. 2024. PMID: 39354646 Free PMC article.
Identifying high-confidence variants in human cytomegalovirus genomes sequenced from clinical samples.
Camiolo S, Hughes J, Baldanti F, Furione M, Lilleri D, Lombardi G, Angelini M, Gerna G, Zavattoni M, Davison AJ, Suárez NM. Camiolo S, et al. Virus Evol. 2022 Dec 5;8(2):veac114. doi: 10.1093/ve/veac114. eCollection 2022. Virus Evol. 2022. PMID: 37091479 Free PMC article.

See all "Cited by" articles

References

1. Goodrum F, Caviness K, Zagallo P. Human cytomegalovirus persistence. Cell Microbiol 2012;14:644–55. - PMC - PubMed
1. Griffiths P, Baraniak I, Reeves M. The pathogenesis of human cytomegalovirus. J Pathol 2015;235:288–97. - PubMed
1. Manicklal S, Emery VC, Lazzarotto T, et al. The ‘silent’ global burden of congenital cytomegalovirus. Clin Microbiol Rev 2013;26:86–102. - PMC - PubMed
1. Dolan A, Cunningham C, Hector RD, et al. Genetic content of wild-type human cytomegalovirus. J Gen Virol 2004;85:1301–12. - PubMed
1. Campillo-Balderas JA, Lazcano A, Becerra A. Viral genome size distribution does not correlate with the antiquity of the host lineages. Front Ecol Evol 2015;3:728.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses

Affiliations

Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous