Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct;36(9):875-879.
doi: 10.1038/nbt.4227. Epub 2018 Aug 20.

Variation graph toolkit improves read mapping by representing genetic variation in the reference

Affiliations

Variation graph toolkit improves read mapping by representing genetic variation in the reference

Erik Garrison et al. Nat Biotechnol. 2018 Oct.

Abstract

Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS

ML is an employee of, and EG consults for, DNAnexus Inc. RD holds shares in and consults for Congenica Ltd and Dovetail Inc. The remaining authors declare no competing financial interests.

Figures

Figure 1.
Figure 1.
A region of a yeast genome variation graph. This displays the start of the subtelomeric region on the left arm of chromosome 9 in a multiple alignment of the strains sequenced by Yue et al.22, built using vg from a full genome multiple alignment generated with the Cactus alignment package6. The inset shows a subregion of the alignment at single base level. The colored paths correspond to separate contiguous chromosomal segments of these strains. This illustrates the ability of vg to represent paths corresponding to both colinear (inset) and structurally rearranged (main figure) regions of genomic variation.
Figure 2:
Figure 2:
Mapping accuracy for vg against the human genome. (a) ROC curves parameterised by mapping quality for 10M read pairs simulated from NA24385 as mapped by bwa mem, vg with the 1000GP 1% allele frequency threshold pangenome reference, and vg with a linear reference, using single end (se) or paired end (pe) mapping. Left: all reads, middle: reads simulated from segments matching the linear reference, Right: reads simulated from segments different from the linear reference. (b) the mean alternate allele fraction at heterozygous variants previously called19 in NA24385 as a function of deletion or insertion size (SNPs at 0). Error bars are +/− one standard error.
Figure 3:
Figure 3:
Mapping short and long reads with vg to yeast genome references. (a) ROC curves obtained by mapping 100,000 simulated SK1 yeast strain 150bp paired reads against a variety of references described in the text; (b) a density plot of identity fraction when mapping 43,337 Pacific Biosciences long reads from the SK1 strain to the drop.SK1 reference or the S288c reference.

Comment in

  • Genomes for all.
    Church DM. Church DM. Nat Biotechnol. 2018 Sep 6;36(9):815-816. doi: 10.1038/nbt.4244. Nat Biotechnol. 2018. PMID: 30188541 No abstract available.

References

    1. Paten B, Novak AM, Eizenga JM & Garrison E Genome graphs and the evolution of genome inference. Genome Res, 27, 665–676 (2017). - PMC - PubMed
    1. Dilthey A, Cox C, Iqbal Z, Nelson MR & McVean G Improved genome inference in the MHC using a population reference graph. Nat. Genet 47, 682–688 (2015). - PMC - PubMed
    1. Eggertsson HP et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet 49, 1654–1660 (2017). - PubMed
    1. Rakocevic G et al. Fast and accurate genomic analyses using genome graphs. bioRxiv preprint doi:10.1101/194530 (2017). - DOI - PubMed
    1. Siren J Indexing variation graphs. Proc. 19th Workshop on Algorithm Engineering and Experiments (ALENEX) (2017).

Publication types