Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 7;299(1):65.
doi: 10.1007/s00438-024-02158-x.

Unveiling novel genetic variants in 370 challenging medically relevant genes using the long read sequencing data of 41 samples from 19 global populations

Affiliations

Unveiling novel genetic variants in 370 challenging medically relevant genes using the long read sequencing data of 41 samples from 19 global populations

Yanfeng Ji et al. Mol Genet Genomics. .

Abstract

Background: A large number of challenging medically relevant genes (CMRGs) are situated in complex or highly repetitive regions of the human genome, hindering comprehensive characterization of genetic variants using next-generation sequencing technologies. In this study, we employed long-read sequencing technology, extensively utilized in studying complex genomic regions, to characterize genetic alterations, including short variants (single nucleotide variants and short insertions and deletions) and copy number variations, in 370 CMRGs across 41 individuals from 19 global populations.

Results: Our analysis revealed high levels of genetic variants in CMRGs, with 68.73% exhibiting copy number variations and 65.20% containing short variants that may disrupt protein function across individuals. Such variants can influence pharmacogenomics, genetic disease susceptibility, and other clinical outcomes. We observed significant differences in CMRG variation across populations, with individuals of African ancestry harboring the highest number of copy number variants and short variants compared to samples from other continents. Notably, 15.79% to 33.96% of short variants were exclusively detectable through long-read sequencing. While the T2T-CHM13 reference genome significantly improved the assembly of CMRG regions, thereby facilitating variant detection in these regions, some regions still lacked resolution.

Conclusion: Our results provide an important reference for future clinical and pharmacogenetic studies, highlighting the need for a comprehensive representation of global genetic diversity in the reference genome and improved variant calling techniques to fully resolve medically relevant genes.

Keywords: Challenging medically relevant genes; Copy number variation; Genome sequencing; Long read sequencing; Short insertion and deletion; Single nucleotide polymorphism.

PubMed Disclaimer

Conflict of interest statement

Declarations

Conflicts of interest F.S receives research support from Illumina, PacBio, and Oxford Nanopore.

Figures

Fig. 1
Fig. 1
Assessing CMRGs in the GRCh38 and T2T-CHM13 using PacBio HiFi data from the T2T-CHM13 assembly. A Numbers of CMRGs that contain windows with depth of coverage (DoC) values significantly deviating from the average DoC values of protein-coding gene regions in the GRCh38 and T2T-CHM13 assemblies. B, C Windows with significantly greater DoC values than the average DoC values of the protein-coding gene regions that overlap with the first two exons of TERT in both the GRCh38 (B) and T2T-CHM13 (C) assemblies
Fig. 2
Fig. 2
Summary of CMRGs with CNV signals in the global populations. A Summary of CNV signals within 323 CMRGs in the T2T-CHM13 assembly. The numbers represent numbers/percentages of CMRGs, respectively. A region is likely to carry a minor allele of the global populations when CNV signals were detected at one locus in > 95% of the samples. CNV signals were detected in all regions (left panel) and non-segmental duplication regions (right panel) of 323 CMRGs. B A Vietnamese sample (HG02059) exhibited an individual-specific inversion duplication affecting the first six exons of FLAD1 in the T2T-CHM13 assembly. Left panel: normalized depth of coverage at the FLAD1 locus. Right Panel: alignment of one LRS read of HG02059 against the T2T-CHM13 assembly using LASTZ version 1.04.15 with default parameters (Harris 2007). The gray and red lines indicate the average DoC values of protein-coding gene regions and DoC + 3*SD, respectively. C Three African-specific duplications were identified at the CYP4F12 locus in the T2T-CHM13 assembly. HG02011 carries a duplication (Chr19:15,798,417–15,807,516) affecting the whole gene body of CYP4F12. HG02818 and NA19239 carry a ~ 1,400 bp (Chr19:15,812,817–15,814,216) duplication in intron 9 and a ~ 11,600 bp (Chr19:15,809,717–15821316) duplication overlapping exons 8 to 12 of CYP4F12. No segmental duplication was identified in this region. The gray and red lines indicate average DoC values of protein-coding gene regions and DoC + 3*SD, respectively. D A ~ 12,000 bp (Chr6:35,607,140–35,619,359) duplication affecting the whole gene body CLPS (not a CMRG) and the first two exons of LHFPL5 in the T2T-CHM13 assembly was identified in samples across super-continental-populations. No segmental duplication was detected in this region. The gray and red lines indicate the average DoC values of protein-coding gene regions and DoC + 3*SD, respectively
Fig. 3
Fig. 3
Summary of short variants detected among 323 CMRGs in the T2T-CHM13 assembly using LRS data. A Numbers of short variation (SNVs and InDels) in samples from humans with different ancestries. The X-axis indicates sample ID and sequencing technology in bracket, while the Y-axis represents numbers of SNVs and InDels. SNV single nucleotide variant; InDel insertion and deletion; AFR Africans; EUR European; AMR American; EAS east Asian; SAS south Asian. B Distribution summary of 154,675 short variants among 323 CMRGs in global populations. C Functional annotation of 152,298 short variants in CMRGs using VEP analysis. The numbers before and after slash are the numbers and percentages of variants of each functional consequence. The most severe functional consequence of a variant was used based on the order of severity estimated by VEP when multiple consequences were predicted. The colors from light yellow to black indicate the functional consequences from low to high based on VEP
Fig. 4
Fig. 4
Side-by-side comparison of the short variant calls based on the SRS and LRS data of 13 individuals. A More short variants were identified in the CMRGs per sample using PacBio HiFi data than when using SRS data. X- and Y-axes indicate the numbers of short variants and sample ID, respectively. B Significantly more SNVs were detected in segmental duplication, low complex/simple repeat regions using PacBio HiFi data (p < 0.01, Wilcoxon two-tailed test) than when using the SRS data. Differences were observed in the exonic, intronic, LINE, and SINE regions, but were not statistically insignificant. *: p < 0.01, ****: p < 0.00001. C The C to T mutation generated a pre-stop codon in DUX4. This mutation was only detected using PacBio data, since SRS reads cannot be reliably mapped to this region. D The C to T mutation cause a stop codon loss in KIR2DL3. This mutation was only detected using PacBio data. E The GCTGAAAAGACA to G InDel generated an open reading frameshift mutation in NUTM2B. This InDel was only detected using PacBio data. F A 41 bp insertion that caused an open reading frameshift mutation in CHMP1A was only detectable using PacBio data. This region is difficult to identify using SRS reads, because the depth of coverage in the region considerably dropped. The red arrow indicates the direction of transcription

Similar articles

References

    1. Aganezov S, Yan SM, Soto DC et al. (2022) A complete reference genome improves analysis of human genetic variation. Science 376:eab13533 - PMC - PubMed
    1. Altemose N, Logsdon GA, Bzikadze AV et al. (2022) Complete genomic and epigenetic maps of human centromeres. Science 376:l4178 - PMC - PubMed
    1. Amberger JS, Bocchini CA, Schiettecatte F et al. (2015) OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res 43:789–798. 10.1093/nar/gku1205 - DOI - PMC - PubMed
    1. Audano PA, Sulovari A, Graves-Lindsay TA et al. (2019) Characterizing the Major Structural Variant Alleles of the Human Genome. Cell 176(3):663–675. 10.1016/j.cell.2018.12.019 - DOI - PMC - PubMed
    1. Barile M, Giancaspero TA, Leone P et al. (2016) Riboflavin transport and metabolism in humans. J Inherit Metab Dis 39:545–557 - PubMed

MeSH terms

LinkOut - more resources