Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;8(1):e53112.
doi: 10.1371/journal.pone.0053112. Epub 2013 Jan 7.

Automatic peak selection by a Benjamini-Hochberg-based algorithm

Affiliations

Automatic peak selection by a Benjamini-Hochberg-based algorithm

Ahmed Abbas et al. PLoS One. 2013.

Abstract

A common issue in bioinformatics is that computational methods often generate a large number of predictions sorted according to certain confidence scores. A key problem is then determining how many predictions must be selected to include most of the true predictions while maintaining reasonably high precision. In nuclear magnetic resonance (NMR)-based protein structure determination, for instance, computational peak picking methods are becoming more and more common, although expert-knowledge remains the method of choice to determine how many peaks among thousands of candidate peaks should be taken into consideration to capture the true peaks. Here, we propose a Benjamini-Hochberg (B-H)-based approach that automatically selects the number of peaks. We formulate the peak selection problem as a multiple testing problem. Given a candidate peak list sorted by either volumes or intensities, we first convert the peaks into [Formula: see text]-values and then apply the B-H-based algorithm to automatically select the number of peaks. The proposed approach is tested on the state-of-the-art peak picking methods, including WaVPeak [1] and PICKY [2]. Compared with the traditional fixed number-based approach, our approach returns significantly more true peaks. For instance, by combining WaVPeak or PICKY with the proposed method, the missing peak rates are on average reduced by 20% and 26%, respectively, in a benchmark set of 32 spectra extracted from eight proteins. The consensus of the B-H-selected peaks from both WaVPeak and PICKY achieves 88% recall and 83% precision, which significantly outperforms each individual method and the consensus method without using the B-H algorithm. The proposed method can be used as a standard procedure for any peak picking method and straightforwardly applied to some other prediction selection problems in bioinformatics. The source code, documentation and example data of the proposed method is available at http://sfb.kaust.edu.sa/pages/software.aspx.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Illustration of the Benjamini-Hochberg procedure.
In this example, the number of hypotheses (formula image) is 10 and the false discovery proportion (formula image) is 0.2. The largest index of the hypotheses that is below the line is 6 (formula image). Therefore, the first six hypotheses are rejected as the predicted peaks.
Figure 2
Figure 2. Original volume curves and the corresponding p-value curves.
(a) and (d): sorted volume curve (a) and the corresponding p-value curve (d) of peaks predicted by WaVPeak on the 2D 15N-HSQC spectrum of the protein ATC1776; (b) and (e): sorted volume curve (b) and the corresponding p-value curve (e) of peaks predicted by WaVPeak on the 3D HNCO spectrum of the protein VRAR; (c) and (f): sorted volume curve (c) and the corresponding p-value curve (f) of peaks predicted by WaVPeak on the 3D CBCA(CO)NH spectrum of the protein COILIN. In all figures, true peaks are shown in black and false ones are shown in cyan. In (d), (e) and (f), the decision boundaries of formula image and B-H procedure are shown in black and magenta, respectively.
Figure 3
Figure 3. Original intensity curves and the corresponding p-value curves.
(a) and (d): sorted intensity curve (a) and the corresponding p-value curve (d) of peaks predicted by PICKY on the 2D 15N-HSQC spectrum of the protein TM1112; (b) and (e): sorted intensity curve (b) and the corresponding p-value curve (e) of peaks predicted by PICKY on the 3D HNCO spectrum of the protein COILIN; (c) and (f): sorted intensity curve (c) and the corresponding p-value curve (f) of peaks predicted by PICKY on the 3D CBCA(CO)NH spectrum of the protein RP3384. In these figures, true peaks are shown in black and false ones are shown in cyan. In (d), (e) and (f), the decision boundaries of formula image and the B-H procedure are shown in black and magenta, respectively.
Figure 4
Figure 4. Precision-recall curves for different peak picking methods and sensitivity analysis of B-H WaVPeak.
(a)–(e): precision-recall curves for different methods on 15N-HSQC, HNCO, HNCA, CBCA(CO)NH and NHCACB, respectively. The solid black curves are for B-H consensus method; the dashed black curves are for the 1.5formula image consensus method; the solid cyan curves are for B-H WaVPeak; the dashed cyan curves are for the original WaVPeak; the solid magenta curves are for B-H PICKY; and the dashed magenta curves are for the original PICKY. The relative area under curve (AUC) values are in legends, which are the area under curve over the total area of recall at least 0.7. (f): sensitivity analysis for different number of peaks. The precision and recall values of B-H WaVPeak are shown when formula image, formula image, formula image and formula image top peaks are used to calculate the p-values.

References

    1. Liu Z, Abbas A, Jing B, Gao X (2012) WaVPeak: picking NMR peaks through wavelet-based smoothing and volume-based filtering. Bioinformatics 28: 914–920. - PMC - PubMed
    1. Alipanahi B, Gao X, Karakoc E, Donaldson L, Li M (2009) PICKY: a novel SVD-based NMR spectra peak picking method. Bioinformatics 25: i268–i275. - PMC - PubMed
    1. Wüthrich K (1986) NMR of Proteins and Nucleic Acids. New York: John Wiley and Sons.
    1. Gao X (2009) Towards automating protein structure determination from NMR data. PhD dissertation, University of Waterloo.
    1. Gao X (2012) Mathematical approaches to the NMR peak-picking problem. Journal of Applied and Computational Mathematics 1: 1.

Publication types

MeSH terms

LinkOut - more resources