Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;6(9):e24085.
doi: 10.1371/journal.pone.0024085. Epub 2011 Sep 9.

An exhaustive, non-euclidean, non-parametric data mining tool for unraveling the complexity of biological systems--novel insights into malaria

Affiliations

An exhaustive, non-euclidean, non-parametric data mining tool for unraveling the complexity of biological systems--novel insights into malaria

Cheikh Loucoubar et al. PLoS One. 2011.

Erratum in

  • PLoS One. 2011;6(10). doi:10.1371/annotation/654e34ce-f1cd-4207-b2ac-ebc873b821e9

Abstract

Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992-2003, aged 1-5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Typical result from HyperCube®.
A) Table “Key Indicators” shows Lift: 1.39; Size: 1,689; Purity: 0.73. B) Graph showing comparative proportion of events within the rule and events in the entire population, pink: affected (PFA positive), green unaffected (PFA negative). Both pink and green bars would reach the horizontal red line if there was same proportion of positive PFA in the rule and in the entire population. C) Table “Rule space” shows marginal contribution of each variable to the lift. Loss: gives partial decreases of lift when removing each variable (or risk factor) from the rule; Coverage: percentage of events {PFA = 1} defined by the corresponding variable alone compared to the total number of events {PFA = 1} in the whole dataset; Size: increase of events in a rule when the constraint defined within a variable is cancelled or by dropping the variable. D) Graphs showing distribution (in blue) of each variable, and the range of values (in green) within the rule.
Figure 2
Figure 2. Decision tree generated by Classification and Regression Tree (CART) analysis of risk factors determining the occurrence of P. falciparum malaria attacks (PFA) per trimester.
Figure shows the cut-off values identified by CART that divide the dataset into two. At each leaf are given the Relative Risk (RR) and the number of events associated with that leaf.
Figure 3
Figure 3. Effect on relative risk (RR) of modifying the ranges of continuous variables.
Graphs show RR for all other possible definitions of risk group on the explanatory variables, with equal or greater size than the HyperCube® rule. Y-axis indicates the RR. A) Only ranges of Age are modified: 102 choices among 4,851 possible choices had size equal or greater than 1,689 (size of the HyperCube® rule) and are plotted; B) Only ranges of previous PMIs are modified: 35 choices among 1,035 possible; C) Only ranges of Year are modified: 25 choices among 190 possible; D) Ranges of both Age and previous PMIs are modified simultaneously: 25,040 choices among 5,020,785 possible; E) Ranges of both Age and Year are modified simultaneously: 8,912 choices among 921,690 possible; F) Ranges of both previous PMIs and Year are modified simultaneously: 1,110 choices among 196,650 possible. Filled red triangle represents the RR of HyperCube®'s rule (HyperCube®'s risk group), empty black circles represent the RR of other choices of risk groups.

References

    1. Nelson MR, Kardia SL, Ferrell RE, Sing CF. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;11:458–470. - PMC - PubMed
    1. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69:138–147. - PMC - PubMed
    1. McKinney BA, Reif DM, Ritchie MD, Moore JH. Machine learning for detecting gene-gene interactions: a review. Appl Bioinformatics. 2006;5:77–88. - PMC - PubMed
    1. Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10:392–404. - PMC - PubMed
    1. Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. Chapman and Hall; 1984.

Publication types

MeSH terms

Substances