Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May;581(7809):452-458.
doi: 10.1038/s41586-020-2329-2. Epub 2020 May 27.

Transcript expression-aware annotation improves rare variant interpretation

Collaborators, Affiliations

Transcript expression-aware annotation improves rare variant interpretation

Beryl B Cummings et al. Nature. 2020 May.

Erratum in

  • Author Correction: Transcript expression-aware annotation improves rare variant interpretation.
    Cummings BB, Karczewski KJ, Kosmicki JA, Seaby EG, Watts NA, Singer-Berk M, Mudge JM, Karjalainen J, Satterstrom FK, O'Donnell-Luria AH, Poterba T, Seed C, Solomonson M, Alföldi J; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium; Daly MJ, MacArthur DG. Cummings BB, et al. Nature. 2021 Feb;590(7846):E54. doi: 10.1038/s41586-020-03175-7. Nature. 2021. PMID: 33536626 Free PMC article. No abstract available.

Abstract

The acceleration of DNA sequencing in samples from patients and population studies has resulted in extensive catalogues of human genetic variation, but the interpretation of rare genetic variants remains problematic. A notable example of this challenge is the existence of disruptive variants in dosage-sensitive disease genes, even in apparently healthy individuals. Here, by manual curation of putative loss-of-function (pLoF) variants in haploinsufficient disease genes in the Genome Aggregation Database (gnomAD)1, we show that one explanation for this paradox involves alternative splicing of mRNA, which allows exons of a gene to be expressed at varying levels across different cell types. Currently, no existing annotation tool systematically incorporates information about exon expression into the interpretation of variants. We develop a transcript-level annotation metric known as the 'proportion expressed across transcripts', which quantifies isoform expression for variants. We calculate this metric using 11,706 tissue samples from the Genotype Tissue Expression (GTEx) project2 and show that it can differentiate between weakly and highly evolutionarily conserved exons, a proxy for functional importance. We demonstrate that expression-based annotation selectively filters 22.8% of falsely annotated pLoF variants found in haploinsufficient disease genes in gnomAD, while removing less than 4% of high-confidence pathogenic variants in the same genes. Finally, we apply our expression filter to the analysis of de novo variants in patients with autism spectrum disorder and intellectual disability or developmental disorders to show that pLoF variants in weakly expressed regions have similar effect sizes to those of synonymous variants, whereas pLoF variants in highly expressed exons are most strongly enriched among cases. Our annotation is fast, flexible and generalizable, making it possible for any variant file to be annotated with any isoform expression dataset, and will be valuable for the genetic diagnosis of rare diseases, the analysis of rare variant burden in complex disorders, and the curation and prioritization of variants in recall-by-genotype studies.

PubMed Disclaimer

Conflict of interest statement

K.J.K. owns stock in Personalis. A.H.O’D.-L. has received honoraria from ARUP and Chan Zuckerberg Initiative. M.J.D. is a founder of Maze Therapeutics. D.G.M. is a founder with equity in Goldfinch Bio, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer, and Sanofi-Genzyme.

Figures

Fig. 1
Fig. 1. Curation of pLoF variants in haploinsufficient disease genes found in gnomAD reveals transcript errors as a major confounding error mode in variant annotation.
We identified and manually curated 401 pLoF variants in the gnomAD dataset in 61 haploinsufficient severe developmental delay genes and flagged any reason the pLoF may not be a true LoF variant. Top, the frequency of each error mode present in the 306 variants classified as unlikely to be a true LoF. Transcript errors emerge as a major putative error mode in the annotation of these pLoF variants. Bottom, bee swarm plot shows the average pext score across GTEx tissues for each variant in the error categories. This shows that pext values are discriminately lower for variants that are annotated as possible transcript errors (P = 4.1 × 10−38, two-sided Wilcoxon test between transcript errors and other error modes).
Fig. 2
Fig. 2. Summary of transcript-expression based annotation method.
a, Overview of transcript-aware annotation. Most genes have many annotated isoforms, which can have varying expression patterns across tissues. Using the number of reads aligning to exonic regions in transcriptome datasets as a proxy for exon expression (top, black) has confounding effects, due to 3′ bias. In this example, although exons 3 and 8 have markedly different expression levels in brain cortex, the number of reads aligning to the two exons is similar, and this masks the differences in exon usage. Transcript-aware annotation defines the expression of every variant as the sum of transcripts that have the same annotation. The resulting transcript-level expression plots do not exhibit 3′ bias, and reveal differences in exon usage, such as those in exons 3 and 8, across tissues. b, Example of utility of transcript-expression based annotation. There are 20 high-quality pLoF variants in the haploinsufficient developmental delay gene TCF4 in gnomAD, annotated as dashed lines and arrows. All 20 variants have no evidence of expression in the GTEx dataset, which suggests that functional TCF4 protein can be made in the presence of these variants.
Fig. 3
Fig. 3. Functional validation of transcript-expression based annotation.
a, We define highly conserved and unconserved regions as phyloCSF > 1,000 (n = 9,817) and phyloCSF < −100 (n = 11,860), respectively, and compare the expression status of these regions across GTEx. Regions with high phyloCSF scores are enriched for near-constitutive expression, whereas unconserved regions are enriched for little to no usage across GTEx. This difference is significant after correcting for gene length (logistic regression P < 1 × 10−100). We note that unconserved regions with high levels of expression (pext > 0.9) are enriched for immune-related genes, which are selected for diversity and thus have low conservation, but represent true coding regions. b, Transcript-expression based annotation recapitulates, and adds information to, existing interpretation tools. High-confidence pLoF LOFTEE variants in gnomAD with no flags (n = 458,880) are enriched for higher pext values, whereas high-confidence pLoF variants falling on low phyloCSF (n = 44,373) or unlikely open-reading frame regions (n = 2,437) are enriched for low expression. However, high-confidence pLoF variants can also have a low pext score. Variants flagged falling on regions that are unlikely open-reading frame or have weak conservation are enriched for lower pext values. Red dots denote the median pext value across GTEx, c, Non-synonymous variants found on near-constitutive regions tend to be more deleterious. We compared the MAPS score for variants with low (<0.1), medium (0.1 ≤ pext ≤ 0.9) and high (pext > 0.9) expression. Variants with near-constitutive expression have a higher MAPS score, which indicates higher deleteriousness than those with little to no evidence of expression. Points represent MAPS values and error bars denote the 95% confidence interval. Dashed grey and orange lines represent MAPS values for all gnomAD missense and synonymous variants, respectively. The number of variants evaluated per category and unadjusted proportion singleton values can be found in Supplementary Table 5a.
Fig. 4
Fig. 4. Transcript-expression based annotation aids Mendelian variant interpretation.
a, Comparison of the proportion of high-quality pLoF variants filtered in a curated list of 61 haploinsufficient developmental delay genes in gnomAD versus ClinVar with a cut-off value of average pext across GTEx ≤ 0.1 (low expression). Expression-based filtering results in removal of 22.8% of gnomAD pLoFs and 3.8% of confidently curated set of pLoFs in ClinVar. b, Expression-based annotation filters 30% of pLoF variants found in gnomAD in a homozygous state in at least one individual, and 3.2% of any pLoF variants found in the same genes in ClinVar. c, We extended this filtering approach to pLoF and synonymous variants in gnomAD pLoF-intolerant genes (defined by LOEUF < 0.35). This filters 16.8% of LoF and 5.2% of synonymous variants. The total number of high-quality variants considered in each group is shown. For all pLoFs only high-confidence LOFTEE variants were considered. P values were determined by two-sided Fisher’s exact test for counts.
Fig. 5
Fig. 5. Application of transcript-expression based annotation to de novo variant analyses in ASD and DD/ID.
a, b, Transcript-expression-based analyses in patients with DD/ID (a) or ASD (b). We find that de novo pLoF variants found on near-constitutively expressed regions in GTEx brain tissues have larger effect sizes than de novo LoF variants in weakly expressed regions in both disorders. Notably, de novo pLoF variants found on regions with little evidence for expression are as equally distributed in cases versus controls as de novo synonymous variants, which suggests that such variants can be removed from analyses of gene burden testing to boost discovery power. The high pext expression bin contains 46.1%, 42.3% and 11.4%, and the low-expression bin contains 4.0%, 6.0% and 11.4% of 1,249, 752 and 166 de novo pLoF variants found in patients with DD/ID, ASD and controls, respectively. Points represent rate ratio estimate and error bars represent 95% confidence interval from the Poisson exact test.

Comment in

References

    1. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature10.1038/s41586-020-2308-7 (2020). - PMC - PubMed
    1. GTEx Consortium. Genetic effects on gene expression across human tissues. Nature550, 204–213 (2017). - PMC - PubMed
    1. MacArthur, D. G. et al. Guidelines for investigating causality of sequence variants in human disease. Nature508, 469–476 (2014). - PMC - PubMed
    1. Goldstein, D. B. et al. Sequencing studies in human genetics: design and interpretation. Nat. Rev. Genet. 14, 460–470 (2013). - PMC - PubMed
    1. Dick, I. E., Joshi-Mukherjee, R., Yang, W. & Yue, D. T. Arrhythmogenesis in Timothy Syndrome is associated with defects in Ca2+-dependent inactivation. Nat. Commun. 7, 10370 (2016). - PMC - PubMed

MeSH terms