Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug;16(8):1478-87.
doi: 10.1261/rna.1951310. Epub 2010 Jun 29.

Genome-wide computational identification and manual annotation of human long noncoding RNA genes

Affiliations

Genome-wide computational identification and manual annotation of human long noncoding RNA genes

Hui Jia et al. RNA. 2010 Aug.

Abstract

Experimental evidence suggests that half or more of the mammalian transcriptome consists of noncoding RNA. Noncoding RNAs are divided into short noncoding RNAs (including microRNAs) and long noncoding RNAs (lncRNAs). We defined complementary DNAs (cDNAs) lacking any positive-strand open reading frames (ORFs) longer than 30 amino acids, as well as cDNAs lacking any evidence of interspecies conservation of their longer-than-30-amino acid ORFs, as noncoding. We have identified 5446 lncRNA genes in the human genome from approximately 24,000 full-length cDNAs, using our new ORF-prediction pipeline. We combined them nonredundantly with lncRNAs from four published sources to derive 6736 lncRNA genes. In an effort to distinguish standalone and antisense lncRNA genes from database artifacts, we stratified our catalog of lncRNAs according to the distance between each lncRNA gene candidate and its nearest known protein-coding gene. We concurrently examined the protein-coding capacity of known genes overlapping with lncRNAs. Remarkably, 62% of known genes with "hypothetical protein" names actually lacked protein-coding capacity. This study has greatly expanded the known human lncRNA catalog, increased its accuracy through manual annotation of cDNA-to-genome alignments, and revealed that a large set of hypothetical-protein genes in GenBank lacks protein-coding capacity. In addition, we have developed, independently of existing NCBI tools, command-line programs with high-throughput ORF-finding and BLASTP-parsing functionality, suitable for future automated assessments of protein-coding capacity of novel transcripts.

PubMed Disclaimer

Figures

FIGURE 1.
FIGURE 1.
lncRNA discovery with our analytical pipeline and lncRNA import from public databases. (A) A set of lncRNAs was predicted by applying our own ORF-Predictor/BLASTP parsing pipeline to a human genome-wide transcriptional unit (TU) catalog (Engström et al. 2006). (B) We retrieved 534 putative lncRNAs from the H-Invitational Database (H-InvDB), 335 from RNAdb, 512 lncRNAs from an early functional study of conserved lncRNAs (Willingham et al. 2005), and 351 primate-specific transcriptionally active region ncRNAs (Zhang et al. 2007). There were 6736 (= 1732 − 442 + 5446) nonredundant lncRNA genes.
FIGURE 2.
FIGURE 2.
Independent assessment of the protein-coding capacity of our lncRNA catalog. cDNA sequences of ncRNAs were submitted to the Coding Potential Calculator (CPC) (Kong et al. 2007). All sequences are assigned a score based on their estimated protein-coding capacity: <−1, noncoding; from −1 to 0, weakly noncoding; from 0 to 1, weakly coding; and >1, coding. A similar analysis was also carried out for 27,864 coding RefSeq genes, having accession numbers commencing with NM. CPC scores for each transcript are plotted as a continuous distribution in A, and the fractions of transcripts are broken down by classification (our ncRNA genes versus NCBI RefSeq genes) in B, as well as by data source (for ncRNA genes only) in C.
FIGURE 3.
FIGURE 3.
Examples of lncRNA stratification based on genomic context. (A) Flank10k (lncRNA gene maps within 10 kb of a known gene); (B) no overlap (lncRNA gene is >10 kb away from any known gene); and (C) overlap (lncRNA gene is encoded on the same strand of the genome as a known gene, and overlaps at least a part of that known gene). One representative UCSC Genome Browser snapshot for each category is shown. The lncRNA from our list is highlighted by the red arrow, which also indicates the direction of transcription of the lncRNA.
FIGURE 4.
FIGURE 4.
Protein-coding capacity testing of uninformatively named known genes that overlapped, or were located <10 kb away from, lncRNA genes. This is a flowchart of our approach toward the definition of protein-coding capacity of uninformatively named known genes overlapping, or in proximity to, lncRNA genes. This is a manual-curation approach.
FIGURE 5.
FIGURE 5.
De novo lncRNA identification with our ORF-finding and BLASTP-parsing pipeline. This flowchart illustrates our use of the ORF-Predictor, which we developed, along with NCBI BLASTP to gauge the protein-coding capacity of any cDNA. This is an automated approach.

Similar articles

Cited by

References

    1. Carninci P, Hayashizaki Y 2007. Noncoding RNA transcription beyond annotated genes. Curr Opin Genet Dev 17: 139–144 - PubMed
    1. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, et al. 2005. The transcriptional landscape of the mammalian genome. Science 309: 1559–1563 - PubMed
    1. Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES 2007. Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci 104: 19428–19433 - PMC - PubMed
    1. Dinger ME, Amaral PP, Mercer TR, Pang KC, Bruce SJ, Gardiner BB, Askarian-Amiri ME, Ru K, Solda G, Simons C, et al. 2008a. Long noncoding RNAs in mouse embryonic stem cell pluripotency and differentiation. Genome Res 18: 1433–1445 - PMC - PubMed
    1. Dinger ME, Pang KC, Mercer TR, Mattick JS 2008b. Differentiating protein-coding and noncoding RNA: Challenges and ambiguities. PLoS Comput Biol 4: e1000176 doi: 10.1371/journal.pcbi.1000176 - PMC - PubMed

Publication types

LinkOut - more resources