Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 14:8:1751.
doi: 10.12688/f1000research.19630.2. eCollection 2019.

A strategy for building and using a human reference pangenome

Affiliations

A strategy for building and using a human reference pangenome

Bastien Llamas et al. F1000Res. .

Abstract

In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.

Keywords: Graph Genome; Hackathon; Pangenome; RNAseq; Structural Variant.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Proposed graph coordinate system to represent multiple haplotypes.
A) Example of a GFA file ( https://github.com/GFA-spec/GFA-spec) that represents a reference genome and one alternate haplotype. The first line beginning in ‘H’” is the header, with an optional 'VN' SAM-tag version number. Nodes, represented by lines starting with ‘S’, have a name in the second column and a nucleotide sequence in the third column. Edges, represented by lines starting with ‘L’, connect nodes whose sequence appears adjacent to each other in one of the haplotypes. The node names appear in the second and fourth columns, and the orientations appear in the third and fifth columns. The line beginning with ‘P’ is from GFA version 1, and encodes subgraphs and paths. B) A path file accompanying the GFA file includes paths for the reference genome and haplotype 1. The haplotype name is in column 2 and the sequence of nodes and their orientations are in column 3. The nucleotide sequence for any haplotype can be resolved by reading out the sequence for each node in the path. C) Visualization of A using path labels from B. The red path represents ref1, while the blue path represents haplotype ref1@h1.
Figure 2.
Figure 2.. Pipeline diagram of the mapper.
Input reads are scanned for minimizers, which are searched against a precomputed minimizer index of the graph reference. Minimizer hits for sufficiently rare minimizers are located in graph space, and the hits for all minimizers are clustered. The clusters are extended gaplessly, with a tolerance for mismatches. If a cluster produces a single full-length gapless extension, it is output as the alignment. Otherwise, partial gapless extensions are chained together by performing alignments of the intervening sequences and graph paths that connect them.
Figure 3.
Figure 3.. Pipeline diagram for mapper evaluation on Zea mays graphs.
After constructing graphs with vg construct and with minimap2 and seqwish (Graph method 1), we sought to simulate reads from the vg construct graph, align them to the minimap2/seqwish graph with our faster, better short read mapper with hit chaining, and then to evaluate the mapper’s accuracy based on the simulated reads’ original and realigned positions along corresponding positional paths in the two graphs.
Figure 4.
Figure 4.
Adding additional haplotype from A to B. The existing sequence and coordinates remain the same even though the nodes and edges change.

References

    1. 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, et al. : A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–73. 10.1038/nature09534 - DOI - PMC - PubMed
    1. 1000 Genomes Project Consortium, Auton A, Brooks LD, et al. : A global reference for human genetic variation. Nature. 2015;526(7571):68–74. 10.1038/nature15393 - DOI - PMC - PubMed
    1. Ameur A, Che H, Martin M, et al. : De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data. Genes (Basel). 2018;9(10):486. 10.3390/genes9100486 - DOI - PMC - PubMed
    1. Audano PA, Sulovari A, Graves-Lindsay TA, et al. : Characterizing the Major Structural Variant Alleles of the Human Genome. Cell. 2019;176(3):663–75.e19. 10.1016/j.cell.2018.12.019 - DOI - PMC - PubMed
    1. Brandt DY, Aguiar VR, Bitarello BD, et al. : Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda). 2015;5(5):931–941. 10.1534/g3.114.015784 - DOI - PMC - PubMed

LinkOut - more resources