Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 7;21(15):2922-2931.
doi: 10.1039/d0lc01148g. Epub 2021 Jun 10.

Machine learning-aided protein identification from multidimensional signatures

Affiliations

Machine learning-aided protein identification from multidimensional signatures

Yuewen Zhang et al. Lab Chip. .

Abstract

The ability to determine the identity of specific proteins is a critical challenge in many areas of cellular and molecular biology, and in medical diagnostics. Here, we present a macine learning aided microfluidic protein characterisation strategy that within a few minutes generates a three-dimensional fingerprint of a protein sample indicative of its amino acid composition and size and, thereby, creates a unique signature for the protein. By acquiring such multidimensional fingerprints for a set of ten proteins and using machine learning approaches to classify the fingerprints, we demonstrate that this strategy allows proteins to be classified at a high accuracy, even though classification using a single dimension is not possible. Moreover, we show that the acquired fingerprints correlate with the amino acid content of the samples, which makes it is possible to identify proteins directly from their sequence without requiring any prior knowledge about the fingerprints. These findings suggest that such a multidimensional profiling strategy can lead to the development of a novel method for protein identification in a microfluidic format.

PubMed Disclaimer

Conflict of interest statement

Parts of this work have been the subject of a patent application filed by Cambridge Enterprise Limited, a fully owned subsidiary of the University of Cambridge.

Figures

Fig. 1
Fig. 1. Protein identification from multidimensional signatures on a microfluidic platform. (a) The proteins and (b) the microfluidic device used in this study. The device allows obtaining multi-dimensional fingerprints of protein samples that include information about their tryptophan, tyrosine and lysine content as well as the hydrodynamic radius, Rh.
Fig. 2
Fig. 2. The platform used for obtaining multidimensional signatures for proteins. (a) The microfluidic device used in this study allowed extracting a multidimensional characteristic signature of each analysed sample describing its tryptophan (Trp) and tyrosine (Tyr) residues (yellow highlighted region), its hydrodynamic radius Rh obtained by monitoring the diffusion of the sample molecules into a co-flowing buffer (blue highlighted region) and its lysine (Lys) content (pink highlighted region). The scale bars on all insets are 200 μm. (b) Schematic representation of the home-built inverted fluorescence microscope used. The two light sources (280 nm and 365 nm) and emission filters can be switched readily to record the characteristic fluorescent signals.
Fig. 3
Fig. 3. Protein classification from their multidimensional fingerprints. A set of ten proteins was profiled by acquiring their three-dimensional fingerprints described by (a) the ratio of the signals measured at the wavelengths where tyrosine and OPA fluoresce (Materials and methods; dimension 1), (b) the ratio of the signals measured at the wavelengths where tryptophan and OPA fluoresce (dimension 2) and (c) the hydrodynamic radius, Rh (dimension 3). All these parameters are concentration independent. (d) Multidimensional signatures of the proteins in a 3D space. The radii of the ellipsoids correspond to one standard deviation. (e) The likelihoods of protein identification and misidentification in the 3D space showed in panel (d) assuming multivariate Gaussian model. (f) The confidence levels of identification process using a random forest classifier approach that assumes no underlying data distribution. The models identified correctly 82% (multivariate Gaussian) and 90% (random forest classifier) of the tested samples.
Fig. 4
Fig. 4. Protein identification from their sequence. The correlations between the measured signals and the sequences of the analysed proteins, specifically (a) the ratio of the measured tryptophan and OPA signals as a function of the tryptophan and lysine composition of the proteins, (b) the ratio of the measured tyrosine and OPA fluorescence signals as a function of their tyrosine and lysine composition and (c) the measured hydrodynamic radius, Rh, as a function of the molecular weight. In all cases, the dotted line shows the best fit linear regression function with the intercept set to 0. We note that the fits shown here included all the proteins that were part of the study. The identification of an unseen protein was performed by excluding all the proteins of that particular sample, so that a slightly different fit was obtained each time. (d) The measured signals for each of the ten samples (A–J) were converted to estimates of their sequence-composition using the relationships outlined in panels (a)–(c) and the latter estimates were used to evaluate the probabilities of each of the ten samples being any one of the ten proteins in our dataset by using Gaussian mixture models. The data are shown such that the correct sample appears on the diagonal of the matrix. Individual samples were identified correctly on 21 out of 40 occasions. When averaging the results over n = 4 repeats, 7 out of 10 proteins were identified correctly.
Fig. 5
Fig. 5. Comparison of the performance of the protein classification (Fig. 3) and identification (Fig. 4) strategies. When identifying a measured sample directly from its sequence, samples were identified correctly on 53% of the occasions or on 70% of the occasions when the results were averaged across the four repeats performed on each samples. When pre-determined fingerprints were used, proteins were classified correctly on 83% of the occasions or on 100% of the occasion when the results were averaged across the repeats. The red dotted line corresponds to the case where the classification or identification was performed by a process of random guessing.

References

    1. Alberts B. Johnson A. Lewis J. Raff M. Roberts K. Walter P. Mol. Biol. Cell. 2002:53–80.
    1. Berg J. M. Tymoczko J. L. Stryer L. Biochemistry. 2002:84–137.
    1. Alberts B. Cell. 1998;92:291–294. - PubMed
    1. Collins F. S. Green E. D. Guttmacher A. E. Guyer M. S. Nature. 2003;422:835. - PubMed
    1. Coscia F. Watters K. Curtis M. Eckert M. Chiang C. Tyanova S. Montag A. Lastra R. Lengyel E. Mann M. Nat. Commun. 2016;7:12645. - PMC - PubMed

Publication types

MeSH terms