Review

. 2014:844:153-87.

doi: 10.1007/978-1-4939-2095-2_8.

Systems analysis of high-throughput data

Rosemary Braun¹

Affiliations

Affiliation

¹ Biostatistics Division, Department of Preventive Medicine and Northwestern Institute on Complex Systems, Northwestern University, 680 N. Lake Shore Dr., Suite 1400, 60611, Chicago, IL, USA, rbraun@northwestern.edu.

PMID: 25480641
PMCID: PMC4426208
DOI: 10.1007/978-1-4939-2095-2_8

Review

Systems analysis of high-throughput data

Rosemary Braun. Adv Exp Med Biol. 2014.

. 2014:844:153-87.

doi: 10.1007/978-1-4939-2095-2_8.

Author

Rosemary Braun¹

Affiliation

¹ Biostatistics Division, Department of Preventive Medicine and Northwestern Institute on Complex Systems, Northwestern University, 680 N. Lake Shore Dr., Suite 1400, 60611, Chicago, IL, USA, rbraun@northwestern.edu.

PMID: 25480641
PMCID: PMC4426208
DOI: 10.1007/978-1-4939-2095-2_8

Abstract

Modern high-throughput assays yield detailed characterizations of the genomic, transcriptomic, and proteomic states of biological samples, enabling us to probe the molecular mechanisms that regulate hematopoiesis or give rise to hematological disorders. At the same time, the high dimensionality of the data and the complex nature of biological interaction networks present significant analytical challenges in identifying causal variations and modeling the underlying systems biology. In addition to identifying significantly disregulated genes and proteins, integrative analysis approaches that allow the investigation of these single genes within a functional context are required. This chapter presents a survey of current computational approaches for the statistical analysis of high-dimensional data and the development of systems-level models of cellular signaling and regulation. Specifically, we focus on multi-gene analysis methods and the integration of expression data with domain knowledge (such as biological pathways) and other gene-wise information (e.g., sequence or methylation data) to identify novel functional modules in the complex cellular interaction network.

PubMed Disclaimer

Figures

**Fig. 1.1**
Regulatory mechanisms in molecular biology. DNA is transcribed to messenger RNA and then translated into protein. The rate of transcription is controlled by a feedback loop in which the level of transcription factor proteins is regulated the activity of the transcriptional complex, and genes can be permanently silenced by methylation of cytosine in CpG promoter regions of the DNA sequence. More recently, it has been discovered that the expression of small non-coding RNA molecules (e.g., microRNAs) can downregulate entire sets of genes by binding to complementary sequences in the messenger RNA.

**Fig. 1.2**
In Principal Components Analysis, the principal components are defined such the first principal component (PC1) lies along the direction of greatest variation and each succeeding component (in two dimensions, only PC2) is defined to lie in an orthogonal direction with the highest variance. Geometrically, the PCA space is a rotation of the original axes.

**Fig. 1.3**
Comparison of SOM vs. PCA. While the first PC captures only 76.77% of the variance, the first component of the SOM captures 93.14% and provides a better description of the underlying pattern.

**Fig. 1.4**
GPC-Score identifies differential gene–pathway coexpression for the MSH2 (mismatch repair) gene and the RNA polymerase pathway for a subset of prostate tumor samples; these samples corresponded to worse clinical outcomes. (Image: [62])

**Fig. 1.5**
Nearest shrunken centroids classifier. In (a), the nearest centroid classifier in two dimensions is illustrated. There are two classes of samples k, shown as light circles and dark squares. After scaling each gene (here, g1 and g2) to unit variance within each group k, the unknown sample x is classified based upon the nearest centroid μ (in this case, the dark squares). (b)—(d) illustrate the shrinkage of the centroids for a gene g. Centroids *μ_gk*, shown as a black line, are moved in the direction of the center line to a new position $μ_{g k}^{'}$ . In (b), neither cross the center line, and the new position is retained. In (c), the centroid for the light circles crosses the center line and is thresholded to 0. In (d), both centroids cross the center line and are thresholded to 0; because the new centroids are equal, the gene no longer contributes to the classification.

**Fig. 1.6**
Application of the nearest shrunken centroids classifier to distinguish cytogenetically normal cases (“NEG”) from those with BCR/ABL fusion based on gene expression profiles of patients with acute lymphoblastic leukemia (ALL). The overall misclassification error is shown on the left, while the misclassification error for the known groups is shown on the right. As the shrinkage parameter Δ increases, fewer genes remain in the model. Initially, the removal of genes improves the accuracy as “noisy” genes are removed. Optimal values of Δ, corresponding to the smallest error observed in the cross–validation, are obtained at Δ = 2.272 (115 genes) and Δ = 2.796 (40 genes). Increasing Δ beyond 3 removes informative genes (only 20 remain at Δ = 3), causing a dramatic increase in the error rate, particularly amongst BCR/ABL cases.

**Fig. 1.7**
Genetic distance metric (Eqs. 1.5–1.6). If Y is closer to G than to F for locus i, *D_Y,i* is positive. If *D_Y,i* is consistently positive across all l loci, *S_Y* will be so as well, indicating a tendency for Y to have more “G–like” patterns of genetic variation.

**Fig. 1.8**
Schematic of a decision tree. At each step, a variable and threshold is chosen to optimally partition the samples based on known labels. The decision rules may operate on continuous variables (like color here, with blue and red coming closest to the mauve and periwinkle ideals, respectively), categorical variables (like column, which can take on values 1–5), or booleans (like “bottom,” which is either true or false). The partitioning stops when the nodes are pure. Variables may be used multiple places in the tree (such as color here), so long as they are not used along the same branch twice.

**Fig. 1.9**
Possible and impossible decision tree partitions. On the left, a possible partition (at levels 1, 2, and 3 in the decision tree) is shown; on the right, a partition that cannot be achieved with the classical decision tree algorithm.

**Fig. 1.10**
Expression levels for three oscillatory yeast cell-cycle genes from two different treatments: +, elutriation-synchronized samples; Δ, CDC-28 synchronized samples. The samples have different amplitudes of expression oscillation, leading to a “bullseye” pattern (note that the means for each gene in the two groups is approximately the same). Cluster assignment for each sample is shown by color for linear k means clustering (red/black) above the diagonal, and non-linear spectral clustering (blue/green) below the diagonal. Note the difference in accuracy. (Image: [30])

**Fig. 1.11**
Multi–layered, highly accurate unsupervised class discovery using PDM. Left, two “layers” of clusters correspond to the radiation exposure (UV light, Ionizing radiation, Mock) and the case (high-RS) group (vs. three control groups) in a radiation sensitivity study. The number of clusters in each layer is determined by the PDM itself from the data yielding three clusters in the first layer (top left panel) and two in the second (center left panel); the resulting classification is near–perfect discrimination of both phenotype and exposure (bottom left panel). Right, we see the clustering for leukemia data from [64]. The PDM automatically detects three clusters; in the top panel, comparison against the provided labels (AML/ALL) shows that the ALL group has been split by PDM; in the lower panel, it is revealed that this corresponds to a subtype difference (ALL-B, ALL-T), demonstrating PDM’s ability to identify sample subtypes even when they may be unknown or unannotated in the data. (Image: [30])

See this image and copyright information in PMC

References

1. van den Akker-van Marle ME, Gurwitz D, Detmar SB, Enzing CM, Hopkins MM, de Mesa EG, Ibarreta D. Cost-effectiveness of pharmacogenomics in clinical practice: a case study of thiopurine methyltransferase genotyping in acute lymphoblastic leukemia in Europe. Pharmacogenomics. 2006;7(5):783–92. - PubMed
1. Karajannis M, Vincent L, Direnzo R, Shmelkov S, Zhang F, Feldman E, Bohlen P, Zhu Z, Sun H, Kussie P, Rafii S. Activation of fgfr1beta signaling pathway promotes survival, migration and resistance to chemotherapy in acute myeloid leukemia cells. Leukemia. 2006 - PubMed
1. Savageau MA, Rosen R. Biochemical systems analysis: a study of function and design in molecular biology. Vol. 725. Addison-Wesley; Reading, MA: 1976.
1. Von Bertalanffy L. In: Modern theories of development: An introduction to theoretical biology. Woodger JH, translator. Oxford University Press; 1933. originally published 1928.
1. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249–264. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Systems analysis of high-throughput data

Affiliation

Systems analysis of high-throughput data

Author

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources