Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2006;7 Suppl 1(Suppl 1):S3.1-13.
doi: 10.1186/gb-2006-7-s1-s3. Epub 2006 Aug 7.

Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment

Affiliations
Review

Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment

Vladimir B Bajic et al. Genome Biol. 2006.

Abstract

Background: This study analyzes the predictions of a number of promoter predictors on the ENCODE regions of the human genome as part of the ENCODE Genome Annotation Assessment Project (EGASP). The systems analyzed operate on various principles and we assessed the effectiveness of different conceptual strategies used to correlate produced promoter predictions with the manually annotated 5' gene ends.

Results: The predictions were assessed relative to the manual HAVANA annotation of the 5' gene ends. These 5' gene ends were used as the estimated reference transcription start sites. With the maximum allowed distance for predictions of 1,000 nucleotides from the reference transcription start sites, the sensitivity of predictors was in the range 32% to 56%, while the positive predictive value was in the range 79% to 93%. The average distance mismatch of predictions from the reference transcription start sites was in the range 259 to 305 nucleotides. At the same time, using transcription start site estimates from DBTSS and H-Invitational databases as promoter predictions, we obtained a sensitivity of 58%, a positive predictive value of 92%, and an average distance from the annotated transcription start sites of 117 nucleotides. In this experiment, the best performing promoter predictors were those that combined promoter prediction with gene prediction. The main reason for this is the reduced promoter search space that resulted in smaller numbers of false positive predictions.

Conclusion: The main finding, now supported by comprehensive data, is that the accuracy of human promoter predictors for high-throughput annotation purposes can be significantly improved if promoter prediction is combined with gene prediction. Based on the lessons learned in this experiment, we propose a framework for the preparation of the next similar promoter prediction assessment.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Prediction results for the distance criterion of 1,000 nucleotides. The light blue row shows the results of comparison of DBTSS+H-Invitational data to the manual HAVANA annotation. We used this as a reference to enable assessment of promoter predictor performance. The highlighted blue fields denote the score for the best performing promoter predictor. MaxTol is the maximum allowed mismatch between the predictions and the reference TSS locations. The programs with names in red officially participated in the EGASP data submission. The results shown are for the MaxTol = 1,000 nucleotides. AE is the average mismatch of predictions relative to the most close TSS location from the HAVANA annotation. It is divided by 1,000 to scale for the graph presentation. DIP1 and DIP2 are two measures representing distance from the ideal predictor as defined in [10]. ASM is the average score measure as defined in [10].
Figure 2
Figure 2
Prediction results for the distance criterion of 250 nucleotides. The light blue row shows the results of comparison of DBTSS+H-Invitational data to the manual HAVANA annotation. We used this as a reference to enable assessment of promoter predictor performance. The highlighted blue fields denote the score for the best performing promoter predictor(s). MaxTol is the maximum allowed mismatch between the predictions and the reference TSS locations. AE is the average mismatch of predictions relative to the closest TSS location from the HAVANA annotation. It is divided by 1,000 to scale for the graph presentation. DIP1 and DIP2 are two measures representing distance from the ideal predictor as defined in [10]. ASM is the average score measure as defined in [10]. The programs with names in red officially participated in the EGASP data submission. The results shown are for the MaxTol = 250 nucleotides.
Figure 3
Figure 3
The results for different ENCODE regions. The results presented are for the maximum allowed distance of 1,000 nucleotides between the predicted TSS and the reference one. AE is the average mismatch of predictions relative to the most close TSS location from the HAVANA annotation. It is divided by 1,000 to scale for the graph presentation. Results are presented for: all ENCODE regions; the training set; and the test set. Relation of scores to the predictor performance is as follows: for Se and ppv, the higher the score, the better the performance. The scores for these two measures range from 0 to 1. For AE, the lower the score, the better.
Figure 4
Figure 4
Another set of results for ENCODE regions. The results presented are for the maximum allowed distance of 1,000 nucleotides between the predicted TSS and the reference one. DIP1 and DIP2 are two measures of prediction qualities expressed as distances from the ideal predictor [10]. CC is the Pearson correlation coefficient. ASM is the average score measure as defined in [10]. DIP2 and ASM are scaled down to fit into the graph. Results are presented for all ENCODE regions, for the training set and for the test set. Relation of scores to the predictor performance is as follows: for distances from the ideal predictor (DIP1 and DIP2), as well as for ASM, the lower the score, the better. ASM represents the averaged rank position of the predictor calculated based on the individual measures of success. For CC, the greater the score, the better. CC ranges from -1 to +1.
Figure 5
Figure 5
The counting method for TPs and FPs. All hits to the 'orange' segments count as FPs. Only one hit within A, B, or C counts as a TP for a unique position of TSS (for example, three hits within C will count only as one TP). Note that all TSS locations that were mutually different were considered as valid reference TSSs. So, alternative TSSs were considered different TSSs. Each of these had to be predicted. If one prediction falls on the intersection of A and B, then that prediction identifies two TSS locations (one that correspond to TSS related to A, and the other corresponding to TSS related to B). In other words, one prediction correctly identifies all reference TSS locations within the distance criterion.

Similar articles

Cited by

References

    1. Weinzierl ROJ. Mechanisms of Gene Expression: Structure, Function, and Evolution of the Basal Transcriptional Machinery. London: Imperial College Press; 1999.
    1. Smale ST, Kadonaga JT. The RNA polymerase II core promoter. Annu Rev Biochem. 2003;72:449–479. doi: 10.1146/annurev.biochem.72.121801.161520. - DOI - PubMed
    1. FANTOM Consortium; RIKEN Genome Exploration Research Group and Genome Science Group (Genome Network Project Core Group) The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. doi: 10.1126/science.1112014. - DOI - PubMed
    1. RIKEN Genome Exploration Research Group, Genome Science Group (Genome Network Project Core Group) and FANTOM Consortium Antisense transcription in the mammalian transcriptome. Science. 2005;309:1564–1566. doi: 10.1126/science.1112009. - DOI - PubMed
    1. Pedersen AG, Baldi P, Chauvin Y, Brunak S. The biology of eukaryotic promoter prediction - a review. Computers Chem. 1999;23:191–207. doi: 10.1016/S0097-8485(99)00015-7. - DOI - PubMed

LinkOut - more resources