Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Feb 15;28(4):573-80.
doi: 10.1093/bioinformatics/btr709. Epub 2012 Jan 12.

Robust rank aggregation for gene list integration and meta-analysis

Affiliations

Robust rank aggregation for gene list integration and meta-analysis

Raivo Kolde et al. Bioinformatics. .

Abstract

Motivation: The continued progress in developing technological platforms, availability of many published experimental datasets, as well as different statistical methods to analyze those data have allowed approaching the same research question using various methods simultaneously. To get the best out of all these alternatives, we need to integrate their results in an unbiased manner. Prioritized gene lists are a common result presentation method in genomic data analysis applications. Thus, the rank aggregation methods can become a useful and general solution for the integration task.

Results: Standard rank aggregation methods are often ill-suited for biological settings where the gene lists are inherently noisy. As a remedy, we propose a novel robust rank aggregation (RRA) method. Our method detects genes that are ranked consistently better than expected under null hypothesis of uncorrelated inputs and assigns a significance score for each gene. The underlying probabilistic model makes the algorithm parameter free and robust to outliers, noise and errors. Significance scores also provide a rigorous way to keep only the statistically relevant genes in the final list. These properties make our approach robust and compelling for many settings.

Availability: All the methods are implemented as a GNU R package RobustRankAggreg, freely available at the Comprehensive R Archive Network http://cran.r-project.org/.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Visual description of RRA. (A) Shows an example of 20 ranked lists, with the positions of two genes highlighted. The first gene is placed to the top of the lists and the second distributed uniformly. (B) Shows in detail how βk,n scores change and how the ρ score is found.
Fig. 2.
Fig. 2.
Results from a simulation study. (A) Shows significance scores calculated with different methods on the 10 lists, which contained 50 planted elements. The number of true and false positives was computed on FDR level of 0.05. Both methods based on order statistics (RRA and Stuart) separate planted elements from noise better than average rank. Still, the Stuart method produces many false positives and thus cannot be used for deciding the significance of genes. (B) Shows ROC curves of different methods on noisy data (10 lists with signal, 30 random). Methods based on order statistics outperform the average rank considerably. (C) Shows the number of true positives given at different levels of noise. At each level, we simulated 10 datasets. RRA shows much higher resistance to noise than an average rank. The Stuart method was excluded from (C) as it failed to identify planted elements from noise.
Fig. 3.
Fig. 3.
The proportion of planted elements that were correctly identified by RRA given different numbers of top elements available in input rankings. The gray line shows the proportion of planted elements in the inputs. We can see that the number of correctly identified elements starts to drop only after almost the whole list is dropped. Therefore, by using partial instead of full rankings we usually lose very little information.
Fig. 4.
Fig. 4.
Predicting genes to a GO category based on the knockouts of its transcription factors. A gene name on the x-axis corresponds to a knockout and each bubble represents the Fisher's exact test P-value, showing the enrichment of the knock-out affected genes in the GO category. The horizontal line shows the same enrichment P-value for the aggregated list. The size of the bubble corresponds to the number of regulated genes in the knockout and the color shows if the P-value is significant. The P-values show that the aggregated list is more enriched in the genes related to the corresponding process than most of the inputs.
Fig. 5.
Fig. 5.
AUC scores when predicting transcription factor targets based on gene co-expression. The gray dots represent the individual results and black dots and plus signs aggregated results with RRA and Stuart method. These values show that in the presence of a signal in the inputs, aggregation methods pick it up and outperform most of the inputs. When the signal is low in the input (AUC ∼0.5), aggregated results are not considerably better. The results for RRA and Stuart method are almost identical, since they use very similar criteria for aggregation.

References

    1. Adler P., et al. Mining for coexpression across hundreds of datasets using novel rank aggregation and visualization methods. Genome Biol. 2009;10:R139. - PMC - PubMed
    1. Aerts S., et al. Gene prioritization through genomic data fusion. Nat. Biotechnol. 2006;24:537–544. - PubMed
    1. Barrett T., et al. Ncbi geo: archive for high-throughput functional genomic data. Nucleic Acids Res. 2009;37:D885–D890. - PMC - PubMed
    1. Bie T.D., et al. Kernel-based data fusion for gene prioritization. Bioinformatics. 2007;23:i125–i132. - PubMed
    1. Boulesteix A., Slawski M. Stability and aggregation of ranked gene lists. Brief. Bioinformatics. 2009;10:556. - PubMed

Publication types