Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction

Adam Frankish, Barbara Uszczynska, Graham R S Ritchie, Jose M Gonzalez, Dmitri Pervouchine, Robert Petryszak, Jonathan M Mudge, Nuno Fonseca, Alvis Brazma, Roderic Guigo, Jennifer Harrow

PMID: 26110515
PMCID: PMC4502323
DOI: 10.1186/1471-2164-16-S8-S2

Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction

Adam Frankish et al. BMC Genomics. 2015.

. 2015;16 Suppl 8(Suppl 8):S2.

doi: 10.1186/1471-2164-16-S8-S2. Epub 2015 Jun 18.

Authors

Adam Frankish, Barbara Uszczynska, Graham R S Ritchie, Jose M Gonzalez, Dmitri Pervouchine, Robert Petryszak, Jonathan M Mudge, Nuno Fonseca, Alvis Brazma, Roderic Guigo, Jennifer Harrow

PMID: 26110515
PMCID: PMC4502323
DOI: 10.1186/1471-2164-16-S8-S2

Abstract

Background: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based.

Results: We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome.

Conclusions: The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.

PubMed Disclaimer

Figures

**Figure 1**
**General properties of GENCODE and RefSeq protein-coding genes**. A) Mean number of alternatively spliced transcripts per multi-exon protein-coding locus B) Mean number of unique CDS per multi-exon protein-coding locus C) Mean number of unique (non-redundant) exons per multi-exon protein-coding locus D) Percentage genomic coverage of unique (non-redundant) exons at multi-exon protein-coding loci

**Figure 2**
**Common and unique annotated features of GENCODE and RefSeq protein-coding genes**. Venn diagram to show intersection between A) transcripts annotated at GENCODE Comprehensive and RefSeq NXR protein-coding loci B) unique (non-redundant) translations annotated at GENCODE Comprehensive and RefSeq NXR protein-coding loci C) unique (non-redundant) exons annotated at GENCODE Comprehensive and RefSeq NXR protein-coding loci

**Figure 3**
**Expression of GENCODE and RefSeq exons and introns**. Cumulative distibutions of RNAseq read count for GENCODE-only (Red), RefSeq-only (Blue) and GENCODE-RefSeq common (Green) exons and introns A) Shows maximum expression i.e. read density in the sample with highest expression B) Shows median expression i.e. read density level in the sample with median expresion

**Figure 4**
**Non-concordance of variant functional annotation**. Percentage non-concordant annotation i.e. variants with annotation in only one dataset (unique) or different annotation between datasets (discordant). The variants are represented in four broad classes; CDS, other, splice and LoF with comparisons between GENCODE Comprehensive and RefSeq NXR using 1KG data (Blue), GENCODE Basic and RefSeq NXR using 1KG data (Red), GENCODE Comprehensive and RefSeq NXR using ESP data (Green), and GENCODE Basic and RefSeq NXR using 1KG data (Purple).

See this image and copyright information in PMC

References

1. Genomes Project C. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. doi: 10.1038/nature11632. - DOI - PMC - PubMed
1. UK10K: Rare Genetic Variants in Health and Disease (2010-2013) http://www.uk10k.org http://www.uk10k.org
1. Futema M, Plagnol V, Li K, Whittall RA, Neil HA, Seed M, Simon Broome C, Bertolini S, Calandra S, Descamps OS. et al.Whole exome sequencing of familial hypercholesterolaemia patients negative for LDLR/APOB/PCSK9 mutations. J Med Genet. 2014;51(8):537–544. doi: 10.1136/jmedgenet-2014-102405. - DOI - PMC - PubMed
1. Fu W, O'Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Rieder MJ, Altshuler D, Shendure J. et al.Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493(7431):216–220. - PMC - PubMed
1. 100,000 Genomes Project. http://www.genomicsengland.co.uk http://www.genomicsengland.co.uk

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction

Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction

Authors

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources