Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep 1;2(3):164-176.
doi: 10.1089/big.2014.0023.

Hard Data Analytics Problems Make for Better Data Analysis Algorithms: Bioinformatics as an Example

Affiliations

Hard Data Analytics Problems Make for Better Data Analysis Algorithms: Bioinformatics as an Example

Jaume Bacardit et al. Big Data. .

Abstract

Data mining and knowledge discovery techniques have greatly progressed in the last decade. They are now able to handle larger and larger datasets, process heterogeneous information, integrate complex metadata, and extract and visualize new knowledge. Often these advances were driven by new challenges arising from real-world domains, with biology and biotechnology a prime source of diverse and hard (e.g., high volume, high throughput, high variety, and high noise) data analytics problems. The aim of this article is to show the broad spectrum of data mining tasks and challenges present in biological data, and how these challenges have driven us over the years to design new data mining and knowledge discovery procedures for biodata. This is illustrated with the help of two kinds of case studies. The first kind is focused on the field of protein structure prediction, where we have contributed in several areas: by designing, through regression, functions that can distinguish between good and bad models of a protein's predicted structure; by creating new measures to characterize aspects of a protein's structure associated with individual positions in a protein's sequence, measures containing information that might be useful for protein structure prediction; and by creating accurate estimators of these structural aspects. The second kind of case study is focused on omics data analytics, a class of biological data characterized for having extremely high dimensionalities. Our methods were able not only to generate very accurate classification models, but also to discover new biological knowledge that was later ratified by experimentalists. Finally, we describe several strategies to tightly integrate knowledge extraction and data mining in order to create a new class of biodata mining algorithms that can natively embrace the complexity of biological data, efficiently generate accurate information in the form of classification/regression models, and extract valuable new knowledge. Thus, a complete data-to-information-to-knowledge pipeline is presented.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
Two measures to represent structural aspects of protein amino acids based on geometry (A) and topology (B). (A) Recursive convex hull measure, represented in 2D in an idealized representation and in 3D for real protein 1MUW. Each color identifies a layer of amino acids formally defined as a convex hull of points. The aim of this measure is to quantify the degree of burial of a given amino acid within the 3D structure of a protein where it belongs. Reproduced from Stout et al. by permission of Oxford University Press. (B) Proximity graphs family of topological structural aspects, represented for real protein 153L. Each amino acid in a protein is represented as a vertex in the graph. Edges connect the amino acids considered to be structural neighbors (in the 3D space) by each of the four topology measures. DT, Delanuy tessellation; GG, Gabriel graph; MST, minimum spanning tree; RNG, relative neighborhood graph. Reproduced from Stout et al. with permission from Springer Science and Business Media.
FIG. 2.
FIG. 2.
Structure of our contact map predictors for the CASP9 (left) and CASP11 (right) experiments. Each bubble represents one attribute in the representation of contact map classification problem, and the size of the bubble indicates its importance in our classification method. Color identifies the source of information. The difference between the CASP9 and CASP11 predictors is the addition of two attributes (top right in light green) that produce a noticeable change in the shape of the overall predictor.
FIG. 3.
FIG. 3.
(A) Simplified representation of the co-prediction principle for network inference from rule-based machine learning. (B) Co-prediction network generated from a plant seed transcriptomics dataset.
FIG. 4.
FIG. 4.
Tree map representation of the structure of our CASP9 rule-based contact map predictor. Each box represents an attribute, and the size of the box indicates the attribute's relevance. Color of the box identifies the source of information. Attributes that participate in the rules being activated for a specific protein are connected with edges. Red edges: activations for a particular class of protein structure, an all-alpha protein (code 1ECA). Blue edges: activations for another particular class of protein structure, an all-beta protein (code 1CD8). An edge bundling technique is used to visualize the resulting network. All rules classify instances as belonging to the same class (contact), but using different strategies depending on the type of protein.

References

    1. Schneider MV, Orchard S. Omics technologies, data and bioinformatics principles. Methods Mol Biol 2011; 719:3–30 - PubMed
    1. Stout M, Bacardit J, Hirst JD, Krasnogor N. Prediction of recursive convex hull class assignments for protein residues. Bioinformatics 2008; 24:916–923 - PubMed
    1. Stout M, Bacardit J, Hirst JD, et al. . Prediction of topological contacts in proteins using learning classifier systems. Soft Comput 2009; 13:245–258
    1. Bacardit J, Stout M, Hirst JD, et al. . Automated alphabet reduction for protein datasets. BMC Bioinform 2009; 10:6 - PMC - PubMed
    1. Widera P, Garibaldi JM, Krasnogor N. GP challenge: evolving energy function for protein structure prediction. Genet Program Evol Mach 2010; 11:61–88

LinkOut - more resources