Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Jun;1(1):e10.
doi: 10.1371/journal.pcbi.0010010. Epub 2005 Jun 24.

Extraction of transcript diversity from scientific literature

Affiliations

Extraction of transcript diversity from scientific literature

Parantu K Shah et al. PLoS Comput Biol. 2005 Jun.

Abstract

Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term "alternative splicing" to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Creating Specialized Databases for Events of Interest
A database of physiologically occurring AS events can be generated in two steps. Each step may involve machine learning or rule-based methods. The first step involves the identification of sentences from scientific text. These sentences can be parsed in a second step to extract frequently occurring semantic patterns.
Figure 2
Figure 2. Preference for the Utilization of TD-Generating Mechanisms across Anatomical Systems
Nonredundant instances of AS, DP, and AP are plotted against anatomical systems in which expression was found. The color of each square in the top panel signifies the ratio of the number of events detected for the system to the highest number of events within the row. Total number of nonredundant instances for each mechanism is on the left. The bottom panel shows the negative logarithm of p-values (see Materials and Methods for details). The anatomical systems are as follows: A, cardio vascular system; B, cells; C, connective tissues; D, digestive system; E, fetal/embryonic structures; F, endocrine system; G, exocrine glands; H, genitalia; I, immune system; J, integumentary system; K, musculoskeletal system; L, nervous system; M, respiratory system; N, sense regions; O, urinal system.
Figure 3
Figure 3. Tissue Specificity in AS
The figure shows the body system distribution of differential/specific splicing. The instances were obtained from literature mining (left panel) and analysis of EST data ([33]; right panel). Each square is colored according to the ratio between the corresponding count and the highest count within the panel. Letter codes for anatomical systems are as in Figure 2. P represents a unique transcript.
Figure 4
Figure 4. Assignment of Function Using Database Knowledge
This figure shows a database entry that derives very little functional annotation from sequence databases. Text extraction rules were successful in identifying gene name, tissue, and event mechanism for the Dopachrome tautomerase gene. Multiple transcripts of the gene using SPLICE-POA [37] were produced by utilizing alternative 3′ splice sites and polyadenylation signals as speculated in the research article (bottom panel). Pink rectangles denote the exons, black lines describe constitutive splice sites, and blue lines show alternative splice sites. Black arrows show the different proteins generated via AS.

References

    1. Landry JR, Mager DL, Wilhelm BT. Complex controls: The role of alternative promoters in mammalian genomes. Trends Genet. 2003;19:640–648. - PubMed
    1. Garcia-Blanco MA, Baraniak AP, Lasda EL. Alternative splicing in disease and therapy. Nat Biotechnol. 2004;22:535–546. - PubMed
    1. Modrek B, Lee C. A genomic view of alternative splicing. Nat Genet. 2002;30:13–19. - PubMed
    1. Black DL. Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Biochem. 2003;72:291–336. - PubMed
    1. Boue S, Letunic I, Bork P. Alternative splicing and evolution. Bioessays. 2003;25:1031–1034. - PubMed