Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jun 24;8(6):e66341.
doi: 10.1371/journal.pone.0066341. Print 2013.

Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data

Affiliations

Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data

Thomas A Lasko et al. PLoS One. .

Erratum in

  • PLoS One. 2013;8(8). doi: 10.1371/annotation/0c88e0d5-dade-4376-8ee1-49ed4ff238e2

Abstract

Inferring precise phenotypic patterns from population-scale clinical data is a core computational task in the development of precision, personalized medicine. The traditional approach uses supervised learning, in which an expert designates which patterns to look for (by specifying the learning task and the class labels), and where to look for them (by specifying the input variables). While appropriate for individual tasks, this approach scales poorly and misses the patterns that we don't think to look for. Unsupervised feature learning overcomes these limitations by identifying patterns (or features) that collectively form a compact and expressive representation of the source data, with no need for expert input or labeled examples. Its rising popularity is driven by new deep learning methods, which have produced high-profile successes on difficult standardized problems of object recognition in images. Here we introduce its use for phenotype discovery in clinical data. This use is challenging because the largest source of clinical data - Electronic Medical Records - typically contains noisy, sparse, and irregularly timed observations, rendering them poor substrates for deep learning methods. Our approach couples dirty clinical data to deep learning architecture via longitudinal probability densities inferred using Gaussian process regression. From episodic, longitudinal sequences of serum uric acid measurements in 4368 individuals we produced continuous phenotypic features that suggest multiple population subtypes, and that accurately distinguished (0.97 AUC) the uric-acid signatures of gout vs. acute leukemia despite not being optimized for the task. The unsupervised features were as accurate as gold-standard features engineered by an expert with complete knowledge of the domain, the classification task, and the class labels. Our findings demonstrate the potential for achieving computational phenotype discovery at population scale. We expect such data-driven phenotypes to expose unknown disease variants and subtypes and to provide rich targets for genetic association studies.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Gaussian process regression transforms noisy, irregular, and sparse observations to a longitudinal probability distribution.
A cross section at any point in time in these plots is a proper Gaussian probability density centered at posterior mean formula image with standard deviation formula image. The top panel is a selected leukemia sequence, the bottom panel a selected gout sequence. Black dots: observed values. Dark blue line: posterior mean formula image. Light blue lines: standard deviation formula image.
Figure 2
Figure 2. First-layer learned features are simple functional element detectors, in various combinations and phases.
For example, uphill- and downhill-ramp detectors (blue), single- and multiple-spot detectors/Fourier components (red), short- and long-edge detectors (green), and mixed-element detectors (grey). These features are visualized directly as the normalized rows formula image.
Figure 3
Figure 3. Second-layer learned features are complex nonlinear combinations of first-layer features.
Because second layer features cannot be visualized directly, each feature in this set is represented as the confluence of the 100 input patches that most strongly activate the feature (those with the highest values of formula image for feature formula image).
Figure 4
Figure 4. Learned features form a distributed representation and interact via constructive and destructive interference.
The interference, as well as the autoencoder’s use of ramp detectors in blocks, are manifest in the confluence of features of this waterfall display. Thick blue lines: selected reconstructions of 30-day patches from the top panel in Figure 1. Stacked thin black lines: all 100 first-layer features, scaled and sorted by the magnitude of their contribution to the reconstruction.
Figure 5
Figure 5. Data distribution in the learned feature spaces suggests disease subpopulations.
A: First-layer features. B: Second-layer features. C: Expert engineered features. These two-dimensional embeddings using t-SNE suggest several subpopulations of gout (red) and leukemia (blue) in both learned feature spaces. We suspect that these subpopulations largely represent differences in treatment approach, but they may also be illuminating pathophysiologic differences. The engineered feature space separates the two known phenotypes adequately for a discrimination task, but offers only weak suggestions of subpopulations: without the colors corresponding to known phenotypes, it would be difficult to identify more than a single large cluster in this space. The t-SNE algorithm preserves near neighbor distances at the expense of far neighbor distances, so we cannot draw conclusions from the macro-scale shape or relative orientation of the clusters, only their number and substructure.

References

    1. Wenzel SE (2012) Asthma phenotypes: the evolution from clinical to molecular approaches. Nat Med 18: 716–725. - PubMed
    1. De Keulenaer GW, Brutsaert DL (2009) The heart failure spectrum: time for a phenotype-oriented approach. Circulation 119: 3044–3046. - PubMed
    1. De Keulenaer GW, Brutsaert DL (2011) Systolic and diastolic heart failure are overlapping phenotypes within the heart failure spectrum. Circulation 123: 1996{2004; discussion 2005. - PubMed
    1. Matheny ME, Miller RA, Ikizler TA, Waitman LR, Denny JC, et al. (2010) Development of inpatient risk stratification models of acute kidney injury for use in electronic health records. Med Decis Making 30: 639–650. - PMC - PubMed
    1. Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. JMLR Workshop and Conference Proceedings 27: 17–36.

Publication types