Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jul 1;24(13):i24-31.
doi: 10.1093/bioinformatics/btn172.

ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles

Affiliations

ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles

Thomas Abeel et al. Bioinformatics. .

Abstract

Motivation: More and more genomes are being sequenced, and to keep up with the pace of sequencing projects, automated annotation techniques are required. One of the most challenging problems in genome annotation is the identification of the core promoter. Because the identification of the transcription initiation region is such a challenging problem, it is not yet a common practice to integrate transcription start site prediction in genome annotation projects. Nevertheless, better core promoter prediction can improve genome annotation and can be used to guide experimental work.

Results: Comparing the average structural profile based on base stacking energy of transcribed, promoter and intergenic sequences demonstrates that the core promoter has unique features that cannot be found in other sequences. We show that unsupervised clustering by using self-organizing maps can clearly distinguish between the structural profiles of promoter sequences and other genomic sequences. An implementation of this promoter prediction program, called ProSOM, is available and has been compared with the state-of-the-art. We propose an objective, accurate and biologically sound validation scheme for core promoter predictors. ProSOM performs at least as well as the software currently available, but our technique is more balanced in terms of the number of predicted sites and the number of false predictions, resulting in a better all-round performance. Additional tests on the ENCODE regions of the human genome show that 98% of all predictions made by ProSOM can be associated with transcriptionally active regions, which demonstrates the high precision.

Availability: Predictions for the human genome, the validation datasets and the program (ProSOM) are available upon request.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An SOM for clustering structural profiles. Every position in the structural profile is associated to an input node of the SOM. The output represents the different clusters organized into a grid structure.
Fig. 2.
Fig. 2.
Validation techniques for PPPs. The top figure shows the classic technique, the bottom figure the new one. Genes or tags that have a prediction inside the black area are considered TPs. Predictions in the white area are FPs. All predictions inside the gray area are ignored. Notice that only the classic technique has ignored regions. For the strict validation the distance is 50 bp.
Fig. 3.
Fig. 3.
Structural profile of promoter (a), transcribed (b) and intergenic (c) human sequences. The profiles are the averages over all sequences in the respective training sets. We used the base-stacking energy as physical property. (a) Region [−200, 50] around the TSS, while for (b) and (c) there is no reference point and the location are numbered from 0 to 250.
Fig. 4.
Fig. 4.
Result of SOM clustering on a 6 × 6 grid. The profile in each graph is the average of all sequences that map to that cluster. The sequences are converted using the base-stacking energy. The X-axis shows the position relative to the putative TSS. The two left-most clusters in the top row and the left-most cluster in the second row are promoter-rich and show the typical core promoter profile. The promoter-rich clusters are displayed with a white background, the others with a gray one. The legend shows the total number, the number of promoter (+) and the number of other (−) sequences.

Similar articles

Cited by

References

    1. Abeel T, et al. Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res. 2008;18:310–323. - PMC - PubMed
    1. Aerts S, et al. Comprehensive analysis of the base composition around the transcription start site in Metazoa. BMC Genomics. 2004;5:34. - PMC - PubMed
    1. Bajic VB, Brusic V. Computational detection of vertebrate RNA polymerase II promoters. Methods Enzymol. 2003;370:237–250. - PubMed
    1. Bajic VB, et al. Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters. Bioinformatics. 2002;18:198–199. - PubMed
    1. Bajic VB, et al. Promoter prediction analysis on the whole human genome. Nat. Biotechnol. 2004;22:1467–1473. - PubMed

Publication types