Genome annotation for clinical genomic diagnostics: strengths and weaknesses

doi:10.1186/s13073-017-0441-1

Review

. 2017 May 30;9(1):49.

doi: 10.1186/s13073-017-0441-1.

Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Charles A Steward^{1

2}, Alasdair P J Parker³, Berge A Minassian^{4

5}, Sanjay M Sisodiya^{6

7}, Adam Frankish^{8

9}, Jennifer Harrow^{8

10}

Affiliations

¹ Congenica Ltd, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1DR, UK. charles.steward@congenica.com.
² The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. charles.steward@congenica.com.
³ Addenbrooke's Hospital and University of Cambridge, Cambridge, CB2 0QQ, UK.
⁴ Department of Pediatrics (Neurology), University of Texas Southwestern, Dallas, TX, USA.
⁵ Program in Genetics and Genome Biology and Department of Paediatrics (Neurology), The Hospital for Sick Children and University of Toronto, Toronto, Canada.
⁶ Department of Clinical and Experimental Epilepsy, UCL Institute of Neurology, London, WC1N 3BG, UK.
⁷ Chalfont Centre for Epilepsy, Chesham Lane, Chalfont St Peter, Buckinghamshire, SL9 0RJ, UK.
⁸ The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
⁹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
¹⁰ Illumina Inc, Great Chesterford, Essex, CB10 1XL, UK.

PMID: 28558813
PMCID: PMC5448149
DOI: 10.1186/s13073-017-0441-1

Review

Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Charles A Steward et al. Genome Med. 2017.

. 2017 May 30;9(1):49.

doi: 10.1186/s13073-017-0441-1.

Authors

Charles A Steward^{1

2}, Alasdair P J Parker³, Berge A Minassian^{4

5}, Sanjay M Sisodiya^{6

7}, Adam Frankish^{8

9}, Jennifer Harrow^{8

10}

Affiliations

¹ Congenica Ltd, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1DR, UK. charles.steward@congenica.com.
² The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. charles.steward@congenica.com.
³ Addenbrooke's Hospital and University of Cambridge, Cambridge, CB2 0QQ, UK.
⁴ Department of Pediatrics (Neurology), University of Texas Southwestern, Dallas, TX, USA.
⁵ Program in Genetics and Genome Biology and Department of Paediatrics (Neurology), The Hospital for Sick Children and University of Toronto, Toronto, Canada.
⁶ Department of Clinical and Experimental Epilepsy, UCL Institute of Neurology, London, WC1N 3BG, UK.
⁷ Chalfont Centre for Epilepsy, Chesham Lane, Chalfont St Peter, Buckinghamshire, SL9 0RJ, UK.
⁸ The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
⁹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
¹⁰ Illumina Inc, Great Chesterford, Essex, CB10 1XL, UK.

PMID: 28558813
PMCID: PMC5448149
DOI: 10.1186/s13073-017-0441-1

Abstract

The Human Genome Project and advances in DNA sequencing technologies have revolutionized the identification of genetic disorders through the use of clinical exome sequencing. However, in a considerable number of patients, the genetic basis remains unclear. As clinicians begin to consider whole-genome sequencing, an understanding of the processes and tools involved and the factors to consider in the annotation of the structure and function of genomic elements that might influence variant identification is crucial. Here, we discuss and illustrate the strengths and weaknesses of approaches for the annotation and classification of important elements of protein-coding genes, other genomic elements such as pseudogenes and the non-coding genome, comparative-genomic approaches for inferring gene function, and new technologies for aiding genome annotation, as a practical guide for clinicians when considering pathogenic sequence variation. Complete and accurate annotation of structure and function of genome features has the potential to reduce both false-negative (from missing annotation) and false-positive (from incorrect annotation) errors in causal variant identification in exome and genome sequences. Re-analysis of unsolved cases will be necessary as newer technology improves genome annotation, potentially improving the rate of diagnosis.

PubMed Disclaimer

Figures

**Fig. 1**
The genome analysis pipeline. Note that, for clarity, some steps have been omitted. Figure illustrations are not to scale and are only meant to be illustrative of the differences between short- and long-read sequencing. a Unaligned reads from sequencing machines are stored as FASTQ file formats. This is a text-based format for storing both a DNA sequence and its corresponding quality scores. b Reads are aligned to the genome. Short reads provide deep coverage, whereas reads that have been sequenced from both ends (*blue arrows*) help to orientate unaligned contigs. It is difficult to align short reads confidently across repetitive sequences when the repeating genome sequence is longer than the sequence read. Long-read sequences help to order contigs across larger regions, particularly with repetitive sequences, but do not provide the necessary depth needed to be confident of calling a base at a certain position. Note that there is a large region where there is no read coverage at all. This is indicative of structural variation. Here, the patient has a large deletion with respect to the reference genome. Once the reads have been aligned to the reference genome they are stored in a BAM file. A BAM file (.bam) is the binary version of a sequence alignment map (SAM file format). The latter is a tab-delimited text-based format for storing DNA sequences aligned to a reference sequence. c The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing genetic sequence variations. VCF files are much smaller than FASTQ and BAM files. Note that single-nucleotide variants (SNVs) and small insertions and deletions (‘indels’) are illustrated as *red* and *purple blocks*, whereas a much larger structural variant is indicated by an *orange block*

**Fig. 2**
The generic gene model (not to scale). a The exons comprise the untranslated regions (UTRs), which are shown in *red* (the 5′ UTR depicted on the *left* and the 3′ UTR depicted on the *right*) and the coding sequence (CDS), which is shown in *green*. Many important regulatory regions lie outside of the exons of a gene. Intronic regulatory regions are shown in *grey*. Promoters are illustrated as *yellow* intergenic regulatory regions, although some genes have internal transcription start sites. The transcription start site (*TSS*) is positioned at the 5′ end of the UTR, where transcription starts. The 5′ UTRs of genes contain regulatory regions. The CDS start codon is the first codon of a messenger RNA (mRNA) from which a ribosome translates. The genomic sequence around the start codon often has the consensus sequence gccAcc|**AUG**|G (note that the important bases are highlighted here in bold, whereas the most crucial positions are –3 and +4 from the A of the AUG) [197], although, in very rare cases, a non-AUG start codon is used [198]. The stop codon, of which there are three in eukaryotes—UGA, UAG, UAA—is a nucleotide triplet sequence in an mRNA that gives the signal to terminate translation by binding release factors, causing the ribosome to release the peptide chain [199]. The 3′ untranslated region of genes contains regulatory regions. In particular, the 3′ UTR has binding sites for regulatory proteins such as RNA-binding proteins (*RBP*) and microRNAs (*miRNA*). Promoters are DNA sequences, between 100 and 1000 bp in length, where proteins that help control gene transcription bind to DNA [200]. These proteins can contain one or more DNA-binding domains that attach to a specific DNA sequence located next to the relevant gene [201]. Promoters regulate transcriptional machinery by moving it to the right place in the genome, as well as locating the 5′ end of the gene or an internal transcription start site. Approximately 40% of human genes have promoters situated in regions of elevated cytosine and guanine content, termed CpG islands [202]. A subset of promoters incorporate the variable TATA box sequence motif, which is found between 25 and 30 bp upstream of the TSS and is the position at the 5′ end of the UTR where transcription starts [203]. b–d Pre-mRNA transcribed from DNA contains both introns and exons. An RNA and protein complex called the spliceosome undertakes the splicing out of introns, leaving the constitutive exons. Intronic and exonic splice enhancers and silencers help direct this procedure, such as the branch point (‘A’) and a poly-pyrimidine (*poly-py*) tract. The vast majority of introns have a GT sequence at the 5′ end that the branch point binds to. The intron is then cleaved from the 5′ exon (donor site) and then from the 3′ exon (acceptor site) [204] and a phosphodiester bond joins the exons, whereas the intron is discarded and degraded. During the formation of mature mRNA, the pre-mRNA is cleaved and polyadenylated. Polyadenylation occurs between 10 and 30 bp downstream from a hexamer recognition sequence that is generally AAUAAA, or AUUAAA, although other hexamer signal sequences are known [35] (as depicted in a). A specially modified nucleotide at the 5′ end of the mRNA, called the 5′ cap, helps with mRNA stability while it undergoes translation. This capping process occurs in the nucleus and is a vital procedure that creates the mature mRNA. e The translation of mRNA into protein by ribosomes occurs in the cytosol. Transfer RNAs (tRNAs), which carry specific amino acids, are read by the ribosome and then bound in a complementary manner to the mRNA. The amino acids are joined together into a polypeptide chain to generate the complete protein sequence for the coding sequence of the transcript. (*Light blue background shading* shows processes that occur in the nucleus. *Light yellow background shading* shows processes that occur in the cytosol, such as the translation of mRNAs into protein by ribosomes)

**Fig. 3**
Alternative splicing transcript variants. Different types of alternative splicing can give rise to transcripts that are functionally distinct from a nominal reference model. *Red* represents the untranslated region (UTR) and *green* represents the coding sequence (CDS). The retained intron is illustrated as non-coding as a retained intron is presumed to represent an immature transcript. Some transcripts can contain exons that are mutually exclusive (*boxed*). All the types of alternative exon splicing events shown here can also occur in non-coding genes. There can also be multiple alternative poly(A) features within the gene models, as seen for the skipped-exon transcript

**Fig. 4**
The nonsense-mediated decay (NMD) pathway. Under normal cellular circumstances, exon–exon junction complexes (*EJCs*) that are in place after splicing are removed by the ribosome during the first round of translation. However, when a transcript contains a premature termination codon (*PTC*), perhaps as a result of an single-nucleotide variant (SNV), indel or inclusion of an out-of-frame exon upstream of one or more EJCs, these EJCs remain in place because the ribosome complex disassociates at the premature stop codon and thus cannot remove the downstream EJC. This triggers the NMD pathway, and the transcript is degraded

**Fig. 5**
The processes involved in the ‘pseudogenisation’ of genes. a Processed pseudogenes are derived from mature mRNA that is reverse-transcribed by the viral L1 repeat enzyme reverse-transcriptase and reintegrated into the genome, and will generally lack introns. Processed pseudogenes are often flanked by direct repeats that might have some function in inserting the pseudogene into the genome and they are often missing sequence compared with their parent. Often they terminate in a series of adenines, which are the remains of the poly(A) tail, which is the site of genomic integration. b Unprocessed pseudogenes—the defunct relatives of functional genes—arise from genomic duplication. Such duplications can be complete or partial with respect to the parent gene

**Fig. 6**
Different classifications of long non-coding RNAs (lncRNAs). The classification of lncRNAs is based on their position with respect to coding genes. lncRNAs are illustrated here with only *red* exons, whereas the coding genes are shown as *red* and *green. AS* antisense, *BDP* bi-directional promoter, *lincRNA* long-intergenic RNA (not overlapping a protein-coding locus on either strand), OS overlapping sense, *O3′* overlapping 3′, SI sense intronic. Figure adapted from Wright 2014 [84]

**Fig. 7**
Examples of genome browsers. a Screenshot of Ensembl genome browser showing the transcript splicing variants for the gene *KCNT1* encoding a potassium channel subunit. *Gold-coloured* transcripts are those that are found by both manual and computational annotation. *Black* transcripts are those that have been identified only through manual annotation. *Blue* transcripts are annotated without a coding sequence (CDS). For example, the *red arrow* highlights an exon that causes a premature stop codon. This transcript has therefore been identified as being subject to nonsense-mediated decay. b Screenshot of the UCSC genome browser also showing *KCNT1*. Comparison of, first, the basic GENCODE gene annotation set (generally full-length coding transcripts based on full-length cDNAs) and, second, RefSeq manually curated genes, which generally have fewer transcripts than GENCODE. The *red boxes* highlight novel transcription start site exons and novel internal exons that are not present in RefSeq

**Fig. 8**
The importance of multiple alternative transcripts for variant interpretation. This hypothetical example of gene ‘*AGENE*’ expressed in brain highlights how the same variant could have different outcomes in different transcripts. We illustrate this further using hypothetical HGVS nomenclature. Note that when there are multiple transcripts for a gene, this can have an effect on amino acid numbering of variants as different transcripts can have different exon combinations, meaning that the same exon in two different transcripts can have a different translation and can also result in different lengths for the amino acid sequence. Note too that the untranslated region is represented by *orange boxes. Green boxes* represent the coding sequence (CDS), whereas *purple boxes* represent the CDS of the nonsense-mediated decay (*NMD*) transcript. *Lines* that join exons represent introns. *Asterisks* indicate the positions of the following hypothetical variants. (1) NM_000000001.99(AGENE):c.2041C > T (p.Arg681Ter). This variant might not be of interest to the clinician as it lies in an exon that is not expressed in brain. (2) NM_000000002.99(AGENE):c.4002 + 2451G > C. The Human Genome Variation Society (HGVS) suggests that this variant is intronic, yet, by looking across other transcripts, it is clear that the variant falls in an extended coding exon that is expressed in brain. (3) NC_000000003.99:g.66178947G > T. This variant is intronic to the canonical transcript, but falls in a well-conserved exon that is expressed in brain. (4) ENSP0000000004.1(AGENE):p.Gly276Ala. This variant falls in an exon that induces NMD. The exon is well conserved and expressed in the brain, making it potentially relevant to the clinician. Generally, NMD transcripts have been considered to be non-coding and excluded from sequence analysis. However, such exons are now known to have an important role in gene regulation. For example, Lynch and colleagues [194] reported that variation in the highly conserved exon in *SNRPB* that induces NMD can result in severe developmental disorders

See this image and copyright information in PMC

Cited by

Three Rounds of Read Correction Significantly Improve Eukaryotic Protein Detection in ONT Reads.
Safar HA, Alatar F, Mustafa AS. Safar HA, et al. Microorganisms. 2024 Jan 24;12(2):247. doi: 10.3390/microorganisms12020247. Microorganisms. 2024. PMID: 38399651 Free PMC article.
Genome-Wide Sequencing Modalities for Children with Unexplained Global Developmental Delay and Intellectual Disabilities-A Narrative Review.
Ko MH, Chen HJ. Ko MH, et al. Children (Basel). 2023 Mar 3;10(3):501. doi: 10.3390/children10030501. Children (Basel). 2023. PMID: 36980059 Free PMC article. Review.
Evolution of Sequence and Structure of SARS-CoV-2 Spike Protein: A Dynamic Perspective.
Sinha A, Sangeet S, Roy S. Sinha A, et al. ACS Omega. 2023 Jun 21;8(26):23283-23304. doi: 10.1021/acsomega.3c00944. eCollection 2023 Jul 4. ACS Omega. 2023. PMID: 37426203 Free PMC article. Review.
Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing.
Ejigu GF, Jung J. Ejigu GF, et al. Biology (Basel). 2020 Sep 18;9(9):295. doi: 10.3390/biology9090295. Biology (Basel). 2020. PMID: 32962098 Free PMC article. Review.
The future of cystic fibrosis care: a global perspective.
Bell SC, Mall MA, Gutierrez H, Macek M, Madge S, Davies JC, Burgel PR, Tullis E, Castaños C, Castellani C, Byrnes CA, Cathcart F, Chotirmall SH, Cosgriff R, Eichler I, Fajac I, Goss CH, Drevinek P, Farrell PM, Gravelle AM, Havermans T, Mayer-Hamblett N, Kashirskaya N, Kerem E, Mathew JL, McKone EF, Naehrlich L, Nasr SZ, Oates GR, O'Neill C, Pypops U, Raraigh KS, Rowe SM, Southern KW, Sivam S, Stephenson AL, Zampoli M, Ratjen F. Bell SC, et al. Lancet Respir Med. 2020 Jan;8(1):65-124. doi: 10.1016/S2213-2600(19)30337-6. Epub 2019 Sep 27. Lancet Respir Med. 2020. PMID: 31570318 Free PMC article. Review.

See all "Cited by" articles

References

1. EpiPM Consortium A roadmap for precision medicine in the epilepsies. Lancet Neurol. 2015;14:1219–28. doi: 10.1016/S1474-4422(15)00199-4. - DOI - PMC - PubMed
1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
1. International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–45. doi: 10.1038/nature03001. - DOI - PubMed
1. Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9:e1001091. doi: 10.1371/journal.pbio.1001091. - DOI - PMC - PubMed
1. GENCODE. Human GENCODE version 24. 2016. http://www.gencodegenes.org/stats/current.html. Accessed 14 Feb 2017.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

[1] EpiPM Consortium A roadmap for precision medicine in the epilepsies. Lancet Neurol. 2015;14:1219–28. doi: 10.1016/S1474-4422(15)00199-4. - DOI - PMC - PubMed

[2] EpiPM Consortium A roadmap for precision medicine in the epilepsies. Lancet Neurol. 2015;14:1219–28. doi: 10.1016/S1474-4422(15)00199-4. - DOI - PMC - PubMed

[3] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed

[4] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed

[5] International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–45. doi: 10.1038/nature03001. - DOI - PubMed

[6] International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–45. doi: 10.1038/nature03001. - DOI - PubMed

[7] Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9:e1001091. doi: 10.1371/journal.pbio.1001091. - DOI - PMC - PubMed

[8] Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9:e1001091. doi: 10.1371/journal.pbio.1001091. - DOI - PMC - PubMed

[9] GENCODE. Human GENCODE version 24. 2016. http://www.gencodegenes.org/stats/current.html. Accessed 14 Feb 2017.

[10] GENCODE. Human GENCODE version 24. 2016. http://www.gencodegenes.org/stats/current.html. Accessed 14 Feb 2017.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Affiliations

Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials