Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2017 May 30;9(1):49.
doi: 10.1186/s13073-017-0441-1.

Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Affiliations
Review

Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Charles A Steward et al. Genome Med. .

Abstract

The Human Genome Project and advances in DNA sequencing technologies have revolutionized the identification of genetic disorders through the use of clinical exome sequencing. However, in a considerable number of patients, the genetic basis remains unclear. As clinicians begin to consider whole-genome sequencing, an understanding of the processes and tools involved and the factors to consider in the annotation of the structure and function of genomic elements that might influence variant identification is crucial. Here, we discuss and illustrate the strengths and weaknesses of approaches for the annotation and classification of important elements of protein-coding genes, other genomic elements such as pseudogenes and the non-coding genome, comparative-genomic approaches for inferring gene function, and new technologies for aiding genome annotation, as a practical guide for clinicians when considering pathogenic sequence variation. Complete and accurate annotation of structure and function of genome features has the potential to reduce both false-negative (from missing annotation) and false-positive (from incorrect annotation) errors in causal variant identification in exome and genome sequences. Re-analysis of unsolved cases will be necessary as newer technology improves genome annotation, potentially improving the rate of diagnosis.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The genome analysis pipeline. Note that, for clarity, some steps have been omitted. Figure illustrations are not to scale and are only meant to be illustrative of the differences between short- and long-read sequencing. a Unaligned reads from sequencing machines are stored as FASTQ file formats. This is a text-based format for storing both a DNA sequence and its corresponding quality scores. b Reads are aligned to the genome. Short reads provide deep coverage, whereas reads that have been sequenced from both ends (blue arrows) help to orientate unaligned contigs. It is difficult to align short reads confidently across repetitive sequences when the repeating genome sequence is longer than the sequence read. Long-read sequences help to order contigs across larger regions, particularly with repetitive sequences, but do not provide the necessary depth needed to be confident of calling a base at a certain position. Note that there is a large region where there is no read coverage at all. This is indicative of structural variation. Here, the patient has a large deletion with respect to the reference genome. Once the reads have been aligned to the reference genome they are stored in a BAM file. A BAM file (.bam) is the binary version of a sequence alignment map (SAM file format). The latter is a tab-delimited text-based format for storing DNA sequences aligned to a reference sequence. c The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing genetic sequence variations. VCF files are much smaller than FASTQ and BAM files. Note that single-nucleotide variants (SNVs) and small insertions and deletions (‘indels’) are illustrated as red and purple blocks, whereas a much larger structural variant is indicated by an orange block
Fig. 2
Fig. 2
The generic gene model (not to scale). a The exons comprise the untranslated regions (UTRs), which are shown in red (the 5′ UTR depicted on the left and the 3′ UTR depicted on the right) and the coding sequence (CDS), which is shown in green. Many important regulatory regions lie outside of the exons of a gene. Intronic regulatory regions are shown in grey. Promoters are illustrated as yellow intergenic regulatory regions, although some genes have internal transcription start sites. The transcription start site (TSS) is positioned at the 5′ end of the UTR, where transcription starts. The 5′ UTRs of genes contain regulatory regions. The CDS start codon is the first codon of a messenger RNA (mRNA) from which a ribosome translates. The genomic sequence around the start codon often has the consensus sequence gccAcc|AUG|G (note that the important bases are highlighted here in bold, whereas the most crucial positions are –3 and +4 from the A of the AUG) [197], although, in very rare cases, a non-AUG start codon is used [198]. The stop codon, of which there are three in eukaryotes—UGA, UAG, UAA—is a nucleotide triplet sequence in an mRNA that gives the signal to terminate translation by binding release factors, causing the ribosome to release the peptide chain [199]. The 3′ untranslated region of genes contains regulatory regions. In particular, the 3′ UTR has binding sites for regulatory proteins such as RNA-binding proteins (RBP) and microRNAs (miRNA). Promoters are DNA sequences, between 100 and 1000 bp in length, where proteins that help control gene transcription bind to DNA [200]. These proteins can contain one or more DNA-binding domains that attach to a specific DNA sequence located next to the relevant gene [201]. Promoters regulate transcriptional machinery by moving it to the right place in the genome, as well as locating the 5′ end of the gene or an internal transcription start site. Approximately 40% of human genes have promoters situated in regions of elevated cytosine and guanine content, termed CpG islands [202]. A subset of promoters incorporate the variable TATA box sequence motif, which is found between 25 and 30 bp upstream of the TSS and is the position at the 5′ end of the UTR where transcription starts [203]. bd Pre-mRNA transcribed from DNA contains both introns and exons. An RNA and protein complex called the spliceosome undertakes the splicing out of introns, leaving the constitutive exons. Intronic and exonic splice enhancers and silencers help direct this procedure, such as the branch point (‘A’) and a poly-pyrimidine (poly-py) tract. The vast majority of introns have a GT sequence at the 5′ end that the branch point binds to. The intron is then cleaved from the 5′ exon (donor site) and then from the 3′ exon (acceptor site) [204] and a phosphodiester bond joins the exons, whereas the intron is discarded and degraded. During the formation of mature mRNA, the pre-mRNA is cleaved and polyadenylated. Polyadenylation occurs between 10 and 30 bp downstream from a hexamer recognition sequence that is generally AAUAAA, or AUUAAA, although other hexamer signal sequences are known [35] (as depicted in a). A specially modified nucleotide at the 5′ end of the mRNA, called the 5′ cap, helps with mRNA stability while it undergoes translation. This capping process occurs in the nucleus and is a vital procedure that creates the mature mRNA. e The translation of mRNA into protein by ribosomes occurs in the cytosol. Transfer RNAs (tRNAs), which carry specific amino acids, are read by the ribosome and then bound in a complementary manner to the mRNA. The amino acids are joined together into a polypeptide chain to generate the complete protein sequence for the coding sequence of the transcript. (Light blue background shading shows processes that occur in the nucleus. Light yellow background shading shows processes that occur in the cytosol, such as the translation of mRNAs into protein by ribosomes)
Fig. 3
Fig. 3
Alternative splicing transcript variants. Different types of alternative splicing can give rise to transcripts that are functionally distinct from a nominal reference model. Red represents the untranslated region (UTR) and green represents the coding sequence (CDS). The retained intron is illustrated as non-coding as a retained intron is presumed to represent an immature transcript. Some transcripts can contain exons that are mutually exclusive (boxed). All the types of alternative exon splicing events shown here can also occur in non-coding genes. There can also be multiple alternative poly(A) features within the gene models, as seen for the skipped-exon transcript
Fig. 4
Fig. 4
The nonsense-mediated decay (NMD) pathway. Under normal cellular circumstances, exon–exon junction complexes (EJCs) that are in place after splicing are removed by the ribosome during the first round of translation. However, when a transcript contains a premature termination codon (PTC), perhaps as a result of an single-nucleotide variant (SNV), indel or inclusion of an out-of-frame exon upstream of one or more EJCs, these EJCs remain in place because the ribosome complex disassociates at the premature stop codon and thus cannot remove the downstream EJC. This triggers the NMD pathway, and the transcript is degraded
Fig. 5
Fig. 5
The processes involved in the ‘pseudogenisation’ of genes. a Processed pseudogenes are derived from mature mRNA that is reverse-transcribed by the viral L1 repeat enzyme reverse-transcriptase and reintegrated into the genome, and will generally lack introns. Processed pseudogenes are often flanked by direct repeats that might have some function in inserting the pseudogene into the genome and they are often missing sequence compared with their parent. Often they terminate in a series of adenines, which are the remains of the poly(A) tail, which is the site of genomic integration. b Unprocessed pseudogenes—the defunct relatives of functional genes—arise from genomic duplication. Such duplications can be complete or partial with respect to the parent gene
Fig. 6
Fig. 6
Different classifications of long non-coding RNAs (lncRNAs). The classification of lncRNAs is based on their position with respect to coding genes. lncRNAs are illustrated here with only red exons, whereas the coding genes are shown as red and green. AS antisense, BDP bi-directional promoter, lincRNA long-intergenic RNA (not overlapping a protein-coding locus on either strand), OS overlapping sense, O3′ overlapping 3′, SI sense intronic. Figure adapted from Wright 2014 [84]
Fig. 7
Fig. 7
Examples of genome browsers. a Screenshot of Ensembl genome browser showing the transcript splicing variants for the gene KCNT1 encoding a potassium channel subunit. Gold-coloured transcripts are those that are found by both manual and computational annotation. Black transcripts are those that have been identified only through manual annotation. Blue transcripts are annotated without a coding sequence (CDS). For example, the red arrow highlights an exon that causes a premature stop codon. This transcript has therefore been identified as being subject to nonsense-mediated decay. b Screenshot of the UCSC genome browser also showing KCNT1. Comparison of, first, the basic GENCODE gene annotation set (generally full-length coding transcripts based on full-length cDNAs) and, second, RefSeq manually curated genes, which generally have fewer transcripts than GENCODE. The red boxes highlight novel transcription start site exons and novel internal exons that are not present in RefSeq
Fig. 8
Fig. 8
The importance of multiple alternative transcripts for variant interpretation. This hypothetical example of gene ‘AGENE’ expressed in brain highlights how the same variant could have different outcomes in different transcripts. We illustrate this further using hypothetical HGVS nomenclature. Note that when there are multiple transcripts for a gene, this can have an effect on amino acid numbering of variants as different transcripts can have different exon combinations, meaning that the same exon in two different transcripts can have a different translation and can also result in different lengths for the amino acid sequence. Note too that the untranslated region is represented by orange boxes. Green boxes represent the coding sequence (CDS), whereas purple boxes represent the CDS of the nonsense-mediated decay (NMD) transcript. Lines that join exons represent introns. Asterisks indicate the positions of the following hypothetical variants. (1) NM_000000001.99(AGENE):c.2041C > T (p.Arg681Ter). This variant might not be of interest to the clinician as it lies in an exon that is not expressed in brain. (2) NM_000000002.99(AGENE):c.4002 + 2451G > C. The Human Genome Variation Society (HGVS) suggests that this variant is intronic, yet, by looking across other transcripts, it is clear that the variant falls in an extended coding exon that is expressed in brain. (3) NC_000000003.99:g.66178947G > T. This variant is intronic to the canonical transcript, but falls in a well-conserved exon that is expressed in brain. (4) ENSP0000000004.1(AGENE):p.Gly276Ala. This variant falls in an exon that induces NMD. The exon is well conserved and expressed in the brain, making it potentially relevant to the clinician. Generally, NMD transcripts have been considered to be non-coding and excluded from sequence analysis. However, such exons are now known to have an important role in gene regulation. For example, Lynch and colleagues [194] reported that variation in the highly conserved exon in SNRPB that induces NMD can result in severe developmental disorders

Similar articles

Cited by

References

    1. EpiPM Consortium A roadmap for precision medicine in the epilepsies. Lancet Neurol. 2015;14:1219–28. doi: 10.1016/S1474-4422(15)00199-4. - DOI - PMC - PubMed
    1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
    1. International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–45. doi: 10.1038/nature03001. - DOI - PubMed
    1. Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9:e1001091. doi: 10.1371/journal.pbio.1001091. - DOI - PMC - PubMed
    1. GENCODE. Human GENCODE version 24. 2016. http://www.gencodegenes.org/stats/current.html. Accessed 14 Feb 2017.

Publication types

LinkOut - more resources