Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2018 Sep;19(9):535-548.
doi: 10.1038/s41576-018-0017-y.

Towards a complete map of the human long non-coding RNA transcriptome

Affiliations
Review

Towards a complete map of the human long non-coding RNA transcriptome

Barbara Uszczynska-Ratajczak et al. Nat Rev Genet. 2018 Sep.

Abstract

Gene maps, or annotations, enable us to navigate the functional landscape of our genome. They are a resource upon which virtually all studies depend, from single-gene to genome-wide scales and from basic molecular biology to medical genetics. Yet present-day annotations suffer from trade-offs between quality and size, with serious but often unappreciated consequences for downstream studies. This is particularly true for long non-coding RNAs (lncRNAs), which are poorly characterized compared to protein-coding genes. Long-read sequencing technologies promise to improve current annotations, paving the way towards a complete annotation of lncRNAs expressed throughout a human lifetime.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

Fig. 1 |
Fig. 1 |. Basic concepts of lncRNA annotations.
a | The principal structures of a long non-coding RNA (lncRNA) to be annotated. Annotations are hierarchical: they are composed of gene loci, each of which is composed of one or more partially overlapping transcripts, themselves composed of one or more exons (blue rectangles). b | Positional classification of lncRNAs with respect to the nearest protein-coding gene. Genic lncRNAs overlap a protein-coding gene locus, whereas intergenic lncRNAs, also known as long intergenic non-coding RNAs (lincRNAs), do not. Transcripts that overlap a protein-coding gene on the opposite strand are identified as antisense. PAS, polyadenylation site; TSS, transcription start site.
Fig. 2 |
Fig. 2 |. Annotation strategies for lncRNAs.
a | Automatic annotation based on RNA sequencing (RNA-seq) may follow two distinct strategies that differ in how the genome reference is used. The align-then-assemble strategy (left) aligns reads to the reference genome to reveal possible splicing events and then assembles reads into transcript models. The assemble-then-align strategy (right) builds transcript models de novo, directly from the RNA-seq reads, and then aligns them to the reference genome to determine their exon–intron structure. De novo transcriptome assembly has more explorative potential than alignment-based assembly but tends to have worse performance. b | In manual annotation, human annotators employ various sources of data to build transcript models. Expressed sequence tags (ESTs) and cDNA form the primary evidence for transcript models and are often supplemented with RNA-seq reads to validate introns, cap analysis of gene expression (CAGE) clusters to identify 5′ ends and poly(A)-position profiling by sequencing (3P-seq) to identify polyadenylation sites (PASs). A key step in the annotation process is to assess the protein-coding potential of transcripts, usually on the basis of a combination of methods. lncRNA, long non-coding RNA.
Fig. 3 |
Fig. 3 |. Comparison of leading lncRNA annotations.
a | Growth of GENCODE long non-coding RNA (lncRNA) collection over time, in terms of gene loci. Only reference releases are included. b | Overlap between annotations at the gene level, based on a medium-stringency definition. Values represent the percentage of gene loci in the annotation of each row that overlap the annotation in each column. Overlap is defined as at least 60% of the span of the shorter gene on the same strand. Only genes with at least one multiexonic transcript were included. See TABLE 1 for details. c | Comparison of quality metrics between annotations. x-axis: comprehensiveness, or the total number of gene loci; y-axis: completeness, or percentage of transcript structures whose start is supported by a robust phase 1/2 Functional Annotation of the Mammalian genome (FANTOM) cap analysis of gene expression (CAGE) cluster (n = 201,802) within ±50 bases and whose end contains a canonical polyadenylation motif within a window of 10–50 bp upstream. Circle diameters reflect exhaustiveness, or mean number of transcripts per gene. GENCODE+ is the union of GENCODE version 20 with non-anchor-merged capture long-read sequencing (CLS) transcript models. Protein-coding is a set of confident GENCODE protein-coding transcripts as described in REF.. d | As for part c, but separately for 5′ and 3′ completeness. e | The distribution of predicted splice junction strength for splice site acceptors and donors in each lncRNA catalogue, as calculated by the GeneID software. The plots show non-redundant splice sites from lncRNA annotations sets (top), confident GENCODE protein-coding transcripts (middle), and 500,000 randomly selected GC|GT donors + AG acceptors with no evidence of splicing in any of the annotation sets under study (bottom). For each non-canonical splice site not scored by GeneID, a random score between −30 and −20 was assigned.
Fig. 4 |
Fig. 4 |. Integrating capture and long-read sequencing with annotation pipelines.
a | Full-length cDNA libraries are prepared from a variety of tissues across the human lifespan. b | Target annotations are prepared from a variety of known and suspected long non-coding RNA (lncRNA) loci and used to design capture probes (black bars). c | Solution-phase oligonucleotide capture is performed, and enriched cDNA libraries are sequenced by long-read nanopore and short-read Illumina technologies. d | The resulting long reads are collapsed to produce non-redundant transcript models. The completeness and accuracy of these models are assessed using various evidence: introns (blue triangles) by short reads; transcription start site (TSS; green star) by promoter histone modifications, cap analysis of gene expression (CAGE) clusters and DNase I hypersensitivity sites (DHSs); and polyadenylation site (PAS; red star) by long-read-encoded poly(A) tails. e | With this information, transcript models are graded for completeness, checked for protein-coding potential and passed to annotators for either direct incorporation into annotation pipelines (for complete models) or further manual curation (incomplete models).

References

    1. Liu G, Mattick J & Taft RJ A meta-analysis of the genomic and transcriptomic composition of complex life. Cell Cycle 12, 2061–2072 (2013). - PMC - PubMed
    1. Derrien T et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012). - PMC - PubMed
    1. Fang S et al. NONCODEV5: a comprehensive annotation database for long non-coding RNAs. Nucleic Acids Res. 46, D308–D314 (2018). - PMC - PubMed
    2. This study presents the latest instalment of the long-running NONCODE annotation, which was amongst the first ncRNA annotations and currently represents the most extensive collection.

    1. Ponjavic J, Ponting CP & Lunter G Functionality or transcriptional noise? Evidence selection within long noncoding RNAs. Genome Res. 17, 556–565 (2007). - PMC - PubMed
    2. This study initially demonstrated that lncRNA exons and promoters are under purifying evolutionary selection and hence provided strong evidence that, as a gene class, they are functional.

    1. Pegueroles C & Gabaldón T Secondary structure impacts patterns of selection in human lncRNAs. BMC Biol. 14, 60 (2016). - PMC - PubMed

Publication types

Substances