Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 16;21(1):265.
doi: 10.1186/s13059-020-02168-z.

The design and construction of reference pangenome graphs with minigraph

Affiliations

The design and construction of reference pangenome graphs with minigraph

Heng Li et al. Genome Biol. .

Abstract

The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.

Keywords: Bioinformatics; Genomics; Pangenome.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Example rGFA and GAF formats. a Example rGFA format. rGFA-specific tags include SN, name of the stable sequence from which the vertex is derived; SO, offset on the stable sequence; SR, rank: 0 if the vertex or edge is on the linear reference; >0 for non-reference. b Corresponding sequence graph. Each thick arrow represents an oriented DNA sequence. c Example GAF format, using the segment coordinate, for reads “ GTGGCT” and “ CGTTTCC” mapped to the graph. d Equivalent GAF format using the stable coordinate
Fig. 2
Fig. 2
Minigraph algorithms. a Diagram of the minigraph mapping algorithm. Minigraph seeds alignments with minimizers, finds good enough linear chains, connects them in the graph and seeks the most weighted path as a graph chain. b Diagram of incremental graph construction. A graph is iteratively constructed by mapping each assembly to an existing graph and augmenting the graph with long poorly mapped sequences in the assembly
Fig. 3
Fig. 3
Characteristics of the human and the great ape graphs. a Human variations stratified by repeat class and by the number of alleles of each variation. The repeat annotation was obtained from the longest allele of each variation. VNTR: variable-number tandem repeat, a tandem repeat with the unit motif length ≥7bp. STR: short random repeat, a tandem repeat with the unit motif length ≤6bp. LCR: low-complexity regions. Mixed-inter.: a variation involving ≥2 types of interspersed repeats. b Great ape variations stratified by repeat class and by the number of alleles. c Human biallelic variations stratified by repeat class and by insertion to/deletion from GRCh38. Both alleles are required to be covered in all assemblies. d Human-specific biallelic variations stratified by repeat class and by insertion to/deletion from GRCh38. Red bars correspond to insertions to the human lineage. e Distribution of different types of human variations along chromosomes. f Boxplot of the longest allele length in each repeat class. Outliers are omitted for the clarity of the figure
Fig. 4
Fig. 4
IGV screenshot of a region enriched with long insertions. Numbers on wide purple bars indicate insertion lengths. CLR: PacBio noisy continuous long reads. HiFi: PacBio high-fidelity reads
Fig. 5
Fig. 5
Implementing 1-dimension Range-Min-Query (RMQ). Given a set of 2-tuples, a binary search tree is built for the first values in the tuples. Each node p in the tree is associated with a pointer. The pointer points to the node that is in the subtree descended from p and has the minimal second value. In this example, RMQ(20,50)=14

Similar articles

Cited by

References

    1. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen H-C, Kitts PA, Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, Fulton RS, Kremitzki M, Magrini V, Markovic C, McGrath S, Steinberg KM, Auger K, Chow W, Collins J, Harden G, Hubbard T, Pelan S, Simpson JT, Threadgold G, Torrance J, Wood JM, Clarke L, Koren S, Boitano M, Peluso P, Li H, Chin C-S, Phillippy AM, Durbin R, Wilson RK, Flicek P, Eichler EE, Church DM. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27(5):849–64. doi: 10.1101/gr.213611.116. - DOI - PMC - PubMed
    1. Huddleston J, Chaisson MJP, Steinberg KM, Warren W, Hoekzema K, Gordon D, Graves-Lindsay TA, Munson KM, Kronenberg ZN, Vives L, Peluso P, Boitano M, Chin C-S, Korlach J, Wilson RK, Eichler EE. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 2017;27(5):677–85. doi: 10.1101/gr.214007.116. - DOI - PMC - PubMed
    1. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11(6):446–50. doi: 10.1038/nrg2809. - DOI - PMC - PubMed
    1. Li H, Bloom JM, Farjoun Y, Fleharty M, Gauthier L, Neale B, MacArthur D. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods. 2018;15(8):595–7. doi: 10.1038/s41592-018-0054-7. - DOI - PMC - PubMed
    1. Wenger AM, Peluso P, Rowell WJ, Chang P-C, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37(10):1155–62. doi: 10.1038/s41587-019-0217-9. - DOI - PMC - PubMed

Publication types

LinkOut - more resources