Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Feb 13;9(2):e87670.
doi: 10.1371/journal.pone.0087670. eCollection 2014.

Discriminative motif discovery via simulated evolution and random under-sampling

Affiliations

Discriminative motif discovery via simulated evolution and random under-sampling

Tao Song et al. PLoS One. .

Abstract

Conserved motifs in biological sequences are closely related to their structure and functions. Recently, discriminative motif discovery methods have attracted more and more attention. However, little attention has been devoted to the data imbalance problem, which is one of the main reasons affecting the performance of the discriminative models. In this article, a simulated evolution method is applied to solve the multi-class imbalance problem at the stage of data preprocessing, and at the stage of Hidden Markov Models (HMMs) training, a random under-sampling method is introduced for the imbalance between the positive and negative datasets. It is shown that, in the task of discovering targeting motifs of nine subcellular compartments, the motifs found by our method are more conserved than the methods without considering data imbalance problem and recover the most known targeting motifs from Minimotif Miner and InterPro. Meanwhile, we use the found motifs to predict protein subcellular localization and achieve higher prediction precision and recall for the minority classes.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Total accuracy of predictions.
Motifs are discovered by the five methods on flat and hierarchical (tree) structure respectively.
Figure 2
Figure 2. Top 3 motif candidates.
These motifs are most predictive of localization, which are discovered on hierarchical compartment structure by (A) Disc and (B) DiscMU. The x-axis title of each HMM logo is the rank and compartment of the motif.
Figure 3
Figure 3. Number of known motifs recovered by different methods.
The p-values are calculated by generating random motifs.
Figure 4
Figure 4. Percentage of conserved instances of the top 20 candidate motifs.
The p-values are calculated by hypergeometric test.

Similar articles

Cited by

References

    1. Bailey TL (2008) Discovering sequence motifs. In: Comparative Genomics, Springer. 271–292.
    1. Eddy SR (1998) Profile hidden markov models. Bioinformatics 14: 755–763. - PubMed
    1. Bailey TL, Williams N, Misleh C, Li WW (2006) Meme: discovering and analyzing dna and protein sequence motifs. Nucleic acids research 34: W369–W373. - PMC - PubMed
    1. Sinha S (2006) On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics 22: e454–e463. - PubMed
    1. Fauteux F, Blanchette M, Strömvik MV (2008) Seeder: discriminative seeding dna motif discovery. Bioinformatics 24: 2303–2307. - PMC - PubMed

Publication types

LinkOut - more resources