Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 2;3(8):100543.
doi: 10.1016/j.crmeth.2023.100543. eCollection 2023 Aug 28.

Pan-conserved segment tags identify ultra-conserved sequences across assemblies in the human pangenome

Collaborators, Affiliations

Pan-conserved segment tags identify ultra-conserved sequences across assemblies in the human pangenome

HoJoon Lee et al. Cell Rep Methods. .

Abstract

The human pangenome, a new reference sequence, addresses many limitations of the current GRCh38 reference. The first release is based on 94 high-quality haploid assemblies from individuals with diverse backgrounds. We employed a k-mer indexing strategy for comparative analysis across multiple assemblies, including the pangenome reference, GRCh38, and CHM13, a telomere-to-telomere reference assembly. Our k-mer indexing approach enabled us to identify a valuable collection of universally conserved sequences across all assemblies, referred to as "pan-conserved segment tags" (PSTs). By examining intervals between these segments, we discerned highly conserved genomic segments and those with structurally related polymorphisms. We found 60,764 polymorphic intervals with unique geo-ethnic features in the pangenome reference. In this study, we utilized ultra-conserved sequences (PSTs) to forge a link between human pangenome assemblies and reference genomes. This methodology enables the examination of any sequence of interest within the pangenome, using the reference genome as a comparative framework.

Keywords: k-mer; pan-conserved segment; pangenome; reference genome; structural polymorphism; structural variations.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
Identification of pan-conserved segment tag in HPRC assemblies and their properties based on GRCh38 coordinates (A) We define PST as when the set of consecutive unique sequence is present in all assemblies. (B) The distribution of PSTs on GRCh38. The density of PSTs was calculated in 500 kb window; number of pan-conserved 31-mers/size of window. (C) The distribution of PSTs across the different types of genomics regions on chr20. Annotate genomic regions with N’s as N regions. (D) Change rate (%) in number of pan-conserved 31-mers in relation to number of included haploid assemblies.
Figure 2
Figure 2
Intervals between PSTs (A) The three types of interval lengths relative to the interval length on GRCh38: (1) no arrangement: interval length on an assembly is identical to the length on GRCh38; (2) insertion: interval length on an assembly is larger than the length on GRCh38; and (3) deletion: interval length on an assembly is less than the length on GRCh38. (B) Measuring length of intervals between adjacent pan-conserved sequence pair after sorting them by GRCh38 coordinates. A small number (<0.00001%) of tandem pairs of PSTs were on different contigs for a given haploid genome thanks to the high quality of HPRC assemblies (i.e., S2 and S3 in Assm1). (C) The distribution of polymorphic intervals with Shannon diversity index of the divergent lengths across assemblies. Si indicates the ith PST, while Assm stands for assembly.
Figure 3
Figure 3
Polymorphic intervals on chr18 (A) The locations of polymorphic intervals across chr18. Blue dots indicate the median of interval lengths, while gray dots indicate the interval length of an assembly. (B) A highly polymorphic interval, with a size of 4.67 kb as per GRCh38, exhibited 92 different lengths, resulting in a diversity index of 6.51. (C) Long polymorphic interval had the highest IQR of divergent length relative to the reference interval size of 47 kb. (D) A biallelic polymorphic interval with high frequency (0.457) has a binomial distribution of different lengths (658 bp deletion for 43 assemblies; no changes for 51 assemblies). The entire region of this interval is annotated as LINE by RepeatMasker. (E) Population-specific intervals with divergent lengths only for AFR and AMR. The interval at chr16:82,043,368–82,043,731 had a deletion of 322 bp on intron 1 of SD17B2. This deletion was present exclusively among the 25 AFR assemblies, where 12 of them were homozygous for 6 individuals.
Figure 4
Figure 4
The SV plots using 31-mers from polymorphic intervals (A) The scheme of plotting 31-mers from both reference and query assemblies to depict SVs. (B–G) Examples of different types of SVs are shown: (B) insertion, (C) deletion, (D) tandem duplication, (E) multiple duplications, (F) inversion, and (G) complex SVs.
Figure 5
Figure 5
Anatomy of complex SV We demonstrate that an SV plot displays the various components within a complex SV.

Similar articles

  • A draft human pangenome reference.
    Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu TY, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang PC, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Abou Tayoun AN, Thibaud-Nissen F, Tricomi FF, Wagner J, Walenz B, Wood JMD, Zimin AV, Bourque G, Chaisson MJP, Flicek P, Phillippy AM, Zook JM, Eichler EE, … See abstract for full author list ➔ Liao WW, et al. Nature. 2023 May;617(7960):312-324. doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10. Nature. 2023. PMID: 37165242 Free PMC article.
  • A Draft Pacific Ancestry Pangenome Reference.
    Littlefield C, Lazaro-Guevara JM, Stucki D, Lansford M, Pezzolesi MH, Taylor EJ, Wolfgramm EC, Taloa J, Lao K, Dumaguit CDC, Ridge PG, Tavana JP, Holland WL, Raphael KL, Pezzolesi MG. Littlefield C, et al. bioRxiv [Preprint]. 2024 Aug 26:2024.08.07.606392. doi: 10.1101/2024.08.07.606392. bioRxiv. 2024. PMID: 39282288 Free PMC article. Preprint.
  • Genome-wide maps of highly-similar intrachromosomal repeats that mediate ectopic recombination in three human genome assemblies.
    Fernandez-Luna L, Aguilar-Perez C, Grochowski CM, Mehaffey M, Carvalho CMB, Gonzaga-Jauregui C. Fernandez-Luna L, et al. bioRxiv [Preprint]. 2024 Jan 31:2024.01.29.577884. doi: 10.1101/2024.01.29.577884. bioRxiv. 2024. Update in: HGG Adv. 2025 Apr 10;6(2):100396. doi: 10.1016/j.xhgg.2024.100396. PMID: 38352399 Free PMC article. Updated. Preprint.
  • A review of the pangenome: how it affects our understanding of genomic variation, selection and breeding in domestic animals?
    Gong Y, Li Y, Liu X, Ma Y, Jiang L. Gong Y, et al. J Anim Sci Biotechnol. 2023 May 5;14(1):73. doi: 10.1186/s40104-023-00860-1. J Anim Sci Biotechnol. 2023. PMID: 37143156 Free PMC article. Review.
  • Computational Strategies for Eukaryotic Pangenome Analyses.
    Hu Z, Wei C, Li Z. Hu Z, et al. 2020 May 1. In: Tettelin H, Medini D, editors. The Pangenome: Diversity, Dynamics and Evolution of Genomes [Internet]. Cham (CH): Springer; 2020. 2020 May 1. In: Tettelin H, Medini D, editors. The Pangenome: Diversity, Dynamics and Evolution of Genomes [Internet]. Cham (CH): Springer; 2020. PMID: 32633910 Free Books & Documents. Review.

Cited by

References

    1. Sherman R.M., Salzberg S.L. Pan-genomics in the human genome era. Nat. Rev. Genet. 2020;21:243–254. doi: 10.1038/s41576-020-0210-7. - DOI - PMC - PubMed
    1. Hurgobin B., Edwards D. SNP discovery using a Pangenome: has the single reference approach become obsolete? Biology. 2017;6:21. doi: 10.3390/biology6010021. - DOI - PMC - PubMed
    1. Miga K.H., Wang T. The need for a human pangenome reference sequence. Annu. Rev. Genomics Hum. Genet. 2021;22:81–102. doi: 10.1146/annurev-genom-120120-081921. - DOI - PMC - PubMed
    1. Zhou B., Arthur J.G., Guo H., Hughes C.R., Kim T., Huang Y., Pattni R., Lee H., Ji H.P., Song G., et al. Automatic detection of complex structural genome variation across world populations. bioRxiv. 2023 doi: 10.1101/200170. Preprint at. - DOI
    1. Nurk S., Koren S., Rhie A., Rautiainen M., Bzikadze A.V., Mikheenko A., Vollger M.R., Altemose N., Uralsky L., Gershman A., et al. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. - DOI - PMC - PubMed

Publication types