Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2009 Apr;16(4):565-77.
doi: 10.1089/cmb.2008.0151.

Statistical comparison framework and visualization scheme for ranking-based algorithms in high-throughput genome-wide studies

Affiliations
Comparative Study

Statistical comparison framework and visualization scheme for ranking-based algorithms in high-throughput genome-wide studies

Waibhav D Tembe et al. J Comput Biol. 2009 Apr.

Abstract

As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Kendall, Spearman, and intersection coefficients between GenePool-generated top candidate SNPs lists and the true list for CEPH population in HapMap (a–f ). The x-axis is the number “top k” SNPs considered. For Kendall and Spearman, lower y-values mean high similarity. For intersection, higher y-values mean high similarity. Note the similarity between c, e, and f, indicating redundant methods generating similar results, and difference between d and e, indicating non-redundant methods.
Fig. 2.
Fig. 2.
Pair-wise exhaustive comparison of five methods. The x-axis shows selected number of “top k” SNPs. The y-axis shows the respective coefficient values. Legend indicates the clustering algorithm and data transformation. For example, Silh-AkB versus Ttest-Akb shows comparison of candidate lists obtained using Silhouette and T-test methods, where image intensity values were transformed using A/(A + kB) transform. Abbreviations used: Silho, Silhouette Index; Centr, Centroid; DunIn, Dunn Index; CoUdr, Consistency Undirectional; ConDir, Consistency Directional; Ttest, T-Test; MdfdT, Modified T-Test.
Fig. 3.
Fig. 3.
Four inferno plots are shown to visualize the results of the genome-wide study carried out on Affymetrix 500 K genotyping microarrays. (Top left) Plot shows the locations of the top 5,000 SNPs using the centroid method. The ring of alternating black and white regions represents chromosomes, the central starburst shows the locations and relative magnitudes of individual SNP scores, and the outermost ring of ticks shows the location of the top 100 hits. (Upper right) Plot shows results from same dataset analyzed using the T-test method. (Lower left) Plot shows a composite of the two upper plots—centroid and T-test results presented on the same plot—by rescaling the scores and plotting the centroid scores inwards from the chromosome ring and the T-test scores outward from the plot origin. This plot is useful for looking for regions that are concordant for high ranking SNPs across both analysis methods. A clear example of this can be seen in the middle of chromosome 2, where the score lines for a very high-scoring centroid SNP and a very high-scoring T-test SNP almost touch. The outer rings of location of top-100 SNPs confirm the concordance. (Lower right) Plot shows the same data as the composite plot except it has been trimmed to show the results for chromosomes 1–6 only, allowing for higher resolution of the chromosomal location of hits. For convenience, all the scores have been normalized to a user-defined range between zero and five (vertical line going from zero to five and back to zero).
Fig. 4.
Fig. 4.
Relative allele strength (RAS) values for a case-control association study on Affymetrix genotyping microarray using three arrays each for cases and controls. Squares correspond to case cohort and diamonds correspond to control cohort. The x-axis corresponds to quartets—ten quartets show that each SNP is interrogated ten times on each chip generating ten RAS measurements. The separation between cases and controls indicate the degree of association of the SNP with the phenotype/disease being studied.

References

    1. Azuaje F. A cluster validity framework for genome expression data. Bioinformatics. 2002;18:319–320. - PubMed
    1. Bansal A. van den Boom D. Kammerer S., et al. Association testing by DNA pooling: an effective initial screen. Proc. Natl. Acad. Sci. USA. 2002;99:16871–16874. - PMC - PubMed
    1. Brun M. Sima C. Hua J., et al. Model-based evaluation of clustering validation measures. Pattern Recogn. 2007;40:807–824.
    1. Craig D.W. Huentelman M.J. Hu-Lince D., et al. Identification of disease causing loci using an array-based genotyping approach on pooled DNA. BMC Genom. 2005;6:138. - PMC - PubMed
    1. Dwork C. Kumar R. Naor M., et al. Rank aggregation methods for the web. WWW. 2001:613–622.

Publication types