Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 9;20(1):159.
doi: 10.1186/s13059-019-1774-4.

Is it time to change the reference genome?

Affiliations

Is it time to change the reference genome?

Sara Ballouz et al. Genome Biol. .

Abstract

The use of the human reference genome has shaped methods and data across modern genomics. This has offered many benefits while creating a few constraints. In the following opinion, we outline the history, properties, and pitfalls of the current human reference genome. In a few illustrative analyses, we focus on its use for variant-calling, highlighting its nearness to a 'type specimen'. We suggest that switching to a consensus reference would offer important advantages over the continued use of the current reference with few disadvantages.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The reference genome is a type specimen. a Cumulative distributions of variants in the reference genome and those in personal/individual genomes. If we collapse the diploid whole genomes genotyped in the 1000 Genomes Project into haploid genomes, we can observe just how similar the reference is to an individual genome. First, taking population allele frequencies from a random sample of 100 individual genomes, we generated new haploid ‘reference’ sequences. We replaced the alleles of the reference genome with the personal homozygous variant, and a randomly chosen heterozygous allele. For simplicity, all calculations were performed against the autosomal chromosomes of the GRCh37 assembly and include only single nucleotide bi-allelic variants (i.e., only two alleles per single nucleotide polymorphism (SNP)). b Cumulative distributions of allele frequencies for variants called in 100 randomly chosen personal genomes, computed against the reference genome. Here, the presence of a variant with respect to the reference is quite likely to mean that the reference itself has the ‘variant’ with respect to any default expectation, particularly if the variant is homozygous
Fig. 2
Fig. 2
How consensus alleles improve the interpretability of the reference. a To build a consensus genome, we replaced minor alleles within the current reference with their major alleles (allele frequency (AF) > 0.5) across all bi-allelic SNPs. b Cumulative distributions of variants in the consensus genome (red line) and the current reference (blue line). c Cumulative distributions of AFs for variants in 100 randomly chosen personal genomes, computed against a consensus genome. d Distribution of the number of homozygous single nucleotide variants (SNVs) in 2504 personal genomes, computed against the reference, against an all-human consensus, the mean of the super-population consensuses and the mean of the population consensuses. The consensus reference for each of the five super-populations leads to an additional reduction in the number of homozygous variants in the personal genomes for each super-population (dark red curve). Further breakdown into 26 representative populations does not dramatically reduce the number of homozygous variants (dashed red line). Super-populations are defined broadly as: AFR African, AMR admixed American, EAS East Asian, EUR European, SAS South Asian
Fig. 3
Fig. 3
How-to reference. For future or new populations, sequencing is followed by building the consensus sequence from those genomes. Any new genomes will only adjust and improve on the current consensus on the basis of a change in allele frequencies. Finally, the reference can be replicated and diversified into other population-specific references

References

    1. National Institute of Standards and Technology. Kilogram: mass and Planck's constant. https://www.nist.gov/si-redefinition/kilogram-mass-and-plancks-constant. Accessed 16 Jun 2019.
    1. Richard D. The SI unit of mass. Metrologia. 2003;40:299. doi: 10.1088/0026-1394/40/6/001. - DOI
    1. Bureau International des Poids et Mesures. International prototype of the kilogram. https://www.bipm.org/en/bipm/mass/ipk/. Accessed 16 Jun 2019.
    1. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–864. doi: 10.1101/gr.213611.116. - DOI - PMC - PubMed
    1. Pujar S, O'Leary NA, Farrell CM, Loveland JE, Mudge JM, Wallin C, et al. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Res. 2018;46:D221–D228. doi: 10.1093/nar/gkx1031. - DOI - PMC - PubMed

Publication types

LinkOut - more resources