Review

. 2020 Apr;21(4):243-254.

doi: 10.1038/s41576-020-0210-7. Epub 2020 Feb 7.

Pan-genomics in the human genome era

Rachel M Sherman^{1

2}, Steven L Salzberg^{3

4

5

6}

Affiliations

¹ Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA. rsherman@jhu.edu.
² Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA. rsherman@jhu.edu.
³ Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
⁴ Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
⁵ Department of Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD, USA.
⁶ Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA.

PMID: 32034321
PMCID: PMC7752153
DOI: 10.1038/s41576-020-0210-7

Review

Pan-genomics in the human genome era

Rachel M Sherman et al. Nat Rev Genet. 2020 Apr.

. 2020 Apr;21(4):243-254.

doi: 10.1038/s41576-020-0210-7. Epub 2020 Feb 7.

Authors

Rachel M Sherman^{1

2}, Steven L Salzberg^{3

4

5

6}

Affiliations

¹ Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA. rsherman@jhu.edu.
² Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA. rsherman@jhu.edu.
³ Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
⁴ Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
⁵ Department of Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD, USA.
⁶ Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA.

PMID: 32034321
PMCID: PMC7752153
DOI: 10.1038/s41576-020-0210-7

Abstract

Since the early days of the genome era, the scientific community has relied on a single 'reference' genome for each species, which is used as the basis for a wide range of genetic analyses, including studies of variation within and across species. As sequencing costs have dropped, thousands of new genomes have been sequenced, and scientists have come to realize that a single reference genome is inadequate for many purposes. By sampling a diverse set of individuals, one can begin to assemble a pan-genome: a collection of all the DNA sequences that occur in a species. Here we review efforts to create pan-genomes for a range of species, from bacteria to humans, and we further consider the computational methods that have been proposed in order to capture, interpret and compare pan-genome data. As scientists continue to survey and catalogue the genomic variation across human populations and begin to assemble a human pan-genome, these efforts will increase our power to connect variation to human diversity, disease and beyond.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

**Fig. 1 |. Core and dispensable genomes.**
a | Bacterial and other prokaryotic genomes consist predominantly of genes, with little intergenic sequence. The core genome of a species consists of genes shared by all strains. The dispensable genome is made up of genes shared by some but not all strains (accessory genes) and genes present in only one strain (unique genes). Together, the core and dispensable genomes make up the pan-genome. b | Eukaryotic genomes are not highly variable in their genic content. Pan-genomes consider intergenic sequence as well as genes, resulting in an ordered pan-genome of all sequence present in at least one individual.

**Fig. 2 |. Graphical representations of pan-genomes.**
a | A single-nucleotide polymorphism (SNP) or insertion or deletion (indel) can be represented as two diverging paths (black lines) through the genome. Graph aligners can determine, for a read, which path is the best alignment. b | Nested variation can be represented in a graph. Here, both reads that contain the insertion and reads that do not can be aligned to the graph with no mismatches. For reads with the insertion sequence, they can be aligned to one of two paths within the insertion based on the A/C SNP they contain, again resulting in fewer mismatches in the alignments. c | To avoid a read alignment through the graph that does not represent any individual, colours can be tracked to indicate the population or individual of origin (yellow, orange and purple). Segments with no colour are equivalent to all colours, as they must be traversed in all paths. In this graph, a path containing the base A at the first SNP position, the insertion, and A as the within-insertion SNP would be a disallowed path for a read, because it is not colour-consistent: the first A SNP is only purple, and the second is only orange.

**Fig. 3 |. Addition of variants increases alignment ambiguity.**
A graph-based representation includes alternate variants (blue, green) at position P1, whereas the reference contains only the pink reference allele. These variants are within a repeat (dark blue). The addition of each alternate variant increases alignment ambiguity. The six reads with the blue variant allele align perfectly only to P3 in the original reference, and now align to P1 and P3 equally well. Likewise, the six reads with the green variant allele now align to P1 or P2 perfectly, not just P2. Ambiguous reads are highlighted with yellow outlines.

**Fig. 4 |. Two-step alignment method.**
First, alignment to a graph is performed. Reads can align to either variant A or B at the first variant locus, and to C or D at the second. The path through the graph with the most reads aligned to it is then extracted — in this case, the path containing B and then C. In the second step, reads are realigned to the extracted linear genome. This allows for reads that may have been misaligned in the initial step (due to the introduction of variants) to be realigned only to the alleles they are most likely to have originated from. Here, the four reads that aligned to variant A now align to variant B, allowing a single-nucleotide polymorphism (SNP) to be detected that was undetectable from the graph alignment alone, as the reads with the SNP were misaligned.

**Fig. 5 |. Variant discovery from a pan-genome reference.**
When the reference genome sequence is augmented with a known insertion, reads will align to this region for individuals containing this insertion. The 1,250-bp insertion included on chromosome 17 (chr 17) is within the gene *KDM6B* and has been reported in numerous studies^,,,, including at a frequency of 1 in the Trans-Omics Precision Medicine (TOPMed) dataset of over 53,000 individuals, and thus appears to be present in all or most individuals. With the insertion included in a pan-genome reference, reads from sequenced individuals will align to the region, allowing for the detection of single-nucleotide polymorphisms (SNPs). Here a SNP can be detected that is present in individual A but not individual B. However, when no pan-genomic variation is included in the reference, neither the insertion sequence nor the SNP in individual A can be detected. The depicted coordinates and the length of the *KDM6B* insertion were taken from Sherman et al. (2019), although they are nearly identical in all reports.

See this image and copyright information in PMC

References

1. National Human Genome Reserach Institute. Human Genome Project FAQ. NIH https://www.genome.gov/human-genome-project/Completion-FAQ (2019).
1. Rouli L, Merhej V, Fournier PE & Raoult D The bacterial pangenome as a new tool for analyzing pathogenic bacteria. New Microbes New Infect. 7, 72–85 (2015). - PMC - PubMed
1. Pallen MJ & Wren BW Bacterial pathogenomics. Nature 449, 835–842 (2007). - PubMed
1. Tettelin H et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘pan-genome’. Proc. Natl Acad. Sci. USA 102, 13950–13955 (2005). - PMC - PubMed
2. The first work on pan-genomes in bacteria, this paper coined the term ‘pan-genome’ and the associated concepts of the ‘core’ and ‘dispensable’ genomes.
1. Ali A et al. Pan-genome analysis of human gastric pathogen H. pylori: comparative genomics and pathogenomics approaches to identify regions associated with pathogenicity and prediction of potential core therapeutic targets. Biomed. Res. Int 2015, 139580 (2015). - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Pan-genomics in the human genome era

Affiliations

Pan-genomics in the human genome era

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials