Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Sep 26;2(9):e946.
doi: 10.1371/journal.pone.0000946.

Mammalian microRNA prediction through a support vector machine model of sequence and structure

Affiliations

Mammalian microRNA prediction through a support vector machine model of sequence and structure

Ying Sheng et al. PLoS One. .

Abstract

Background: MicroRNAs (miRNAs) are endogenous small noncoding RNA gene products, on average 22 nt long, found in a wide variety of organisms. They play important regulatory roles by targeting mRNAs for degradation or translational repression. There are 377 known mouse miRNAs and 475 known human miRNAs in the May 2007 release of the miRBase database, the majority of which are conserved between the two species. A number of recent reports imply that it is likely that many mammalian miRNAs remain to be discovered. The possibility that there are more of them expressed at lower levels or in more specialized expression contexts calls for the exploitation of genome sequence information to accelerate their discovery.

Methodology/principal findings: In this article, we describe a computational method-mirCoS-that uses three support vector machine models sequentially to discover new miRNA candidates in mammalian genomes based on sequence, secondary structure, and conservation. mirCoS can efficiently detect the majority of known miRNAs and predicts an extensive set of hairpin structures based on human-mouse comparisons. In total, 3476 mouse candidates and 3441 human candidates were found. These hairpins are more similar to known miRNAs than to negative controls in several aspects not considered by the prediction algorithm. A significant fraction of predictions is supported by existing expression evidence.

Conclusions/significance: Using a novel approach, mirCoS performs comparably to or better than existing miRNA prediction methods, and contributes a significant number of new candidate miRNAs for experimental verification.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Outline of the mirCoS method.
Figure 2
Figure 2. pre-miRNAs display a characteristic conservation profile.
Typically, pre-miRNAs are highly conserved, but the conservation drops off rapidly at their borders and is often lower in the middle region, which corresponds to the loop. (A) Conservation profile of known pre-miRNA hsa-mir-1-1 in the UCSC Genome browser (http://www.genome.ucsc.edu/). (B) Cumulative conservation profile of known mouse pre-miRNAs (from miRBase 8.2) conserved in human. Pre-miRNA regions were extended by 50 bp on each end and length-normalized to the range [-50,50]. The y-axis shows the fraction of analyzed sequences that are conserved at the position indicated on the x-axis.
Figure 3
Figure 3. Predicted secondary structures and conservation profiles of five candidate pre-miRNA genes.
The figure shows five examples from our predictions. Black bars indicate which regions of conservation profiles that correspond to predicted hairpins. Secondary structures of candidate pre-miRNAs were predicted by MFOLD v3.1 . Conservation profiles were obtained from the UCSC Genome Browser (http://www.genome.ucsc.edu/). The candidates show canonical secondary structures and conservation profiles.
Figure 4
Figure 4. Candidate pre-miRNAs are conserved in sequence and predicted structure over large evolutionary distances.
We used alignments between mouse and eleven other organisms to assess over what evolutionary distance each mouse region was conserved in both sequence and structure. Bars indicate what fraction of a particular set of regions that are conserved at a given distance. For each region, we only noted the most evolutionary distant species/clade at which we found it to be conserved. E.g. the leftmost gray bar spans 4%, indicating that 4% of known mouse miRNAs were found to be conserved in opossum, but not in chicken, frog or fish. The requirement for conservation was that regions should align over at least 37 nt and their predicted secondary structures have an RNAdistance score ≤48 (see Methods).
Figure 5
Figure 5. mirCoS can distinguish pre-miRNAs from highly conserved developmental enhancer regions.
We compared differences in pattern composition among known pre-miRNAs, candidate pre-miRNAs and HCNEs. Each sequence was searched for putative transcription factor binding sites using the familial binding profile for homeobox transcription factors from the JASPAR database at a score threshold of 80%. (A) Sequences were partitioned into four non-overlapping sets (I-IV) as indicated in the Venn diagram. (B) Cumulative distributions of number of predicted binding sites per 100 bp for sequence sets I-IV. The distributions for candidate pre-miRNAs (blue, green) are more similar to the distribution for known pre-miRNAs (red) than to the distribution for HCNEs not predicted to be pre-miRNAs (gray). (C) Solid bars show the average number of predicted sites per 100 bp over each of sequence sets I-IV. Shaded bars show results for corresponding control sets: controls for dinucleotide composition generated by, for each sequence, constructing a first-order Markov chain and using it to generate a new sequence (diagonal shading lines), and controls for single nucleotide composition generated by randomly shuffling the bases in each sequence (vertical shading lines). Error bars indicate 95% confidence intervals.
Figure 6
Figure 6. CAGE expression data supports miRNA predictions.
Cumulative distribution of number of CAGE tags mapping to known intergenic pre-miRNA genes or within 500 bp upstream (red), and corresponding distributions for predicted intergenic pre-miRNAs (blue), randomly selected intergenic genomic regions of the same size (green) and intergenic regions from the CRS (black). Known and predicted pre-miRNAs tend to have more overlapping or upstream CAGE tags than either of the control sets. The inset shows a magnification for tags counts of 0–40.
Figure 7
Figure 7. The overlap between our predictions and those from Berezikov et al. is small.
Venn diagram showing the intersections between human miRNAs predicted by Berezikov et al. (gray rectangle), our human predictions (large open rectangle) and known human miRNAs (horizontal rectangle).
Figure 8
Figure 8. Effects of varying the proportion of positive entries in the training set for SVM1.
Proportions of positive examples that were correctly classified (sensitivity, red line) and negative examples that were incorrectly classified (1-specificity, black line), as functions of the proportion of positive entries in the training set. Ideally, sensitivity should be maximized while 1-specificity should be minimized. From the figure, we can see that 1-specificity increases in three stages. At the beginning of the second increasing stage, sensitivity has already entered into a slowly increasing stage. We therefore chose this point (x = 50%, vertical line) as the proportion of positive entries to use in the training set.

Similar articles

Cited by

References

    1. Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004;116:281–297. - PubMed
    1. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006;34:D140–144. - PMC - PubMed
    1. Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004;32:D109–111. - PMC - PubMed
    1. Lee CT, Risom T, Strauss WM. MicroRNAs in mammalian development. Birth Defects Res C Embryo Today. 2006;78:129–139. - PubMed
    1. John B, Enright AJ, Aravin A, Tuschl T, Sander C, et al. Human MicroRNA targets. PLoS Biol. 2004;2:e363. - PMC - PubMed

Publication types