Affinity regression predicts the recognition code of nucleic acid-binding proteins

Raphael Pelossof¹, Irtisha Singh^{1

2}, Julie L Yang^{1

2}, Matthew T Weirauch^{3

4

5}, Timothy R Hughes⁵, Christina S Leslie¹

Affiliations

¹ Computational Biology Program, Memorial Sloan Kettering Cancer Center, New York, New York, USA.
² Tri-I Program in Computational Biology and Medicine, Weill Cornell Graduate College, New York, New York, USA.
³ Center for Autoimmune Genomics and Etiology (CAGE), Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.
⁴ Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.
⁵ Donnelly Centre, University of Toronto, Toronto, ON, Canada.

PMID: 26571099
PMCID: PMC4871164
DOI: 10.1038/nbt.3343

Affinity regression predicts the recognition code of nucleic acid-binding proteins

Raphael Pelossof et al. Nat Biotechnol. 2015 Dec.

. 2015 Dec;33(12):1242-1249.

doi: 10.1038/nbt.3343. Epub 2015 Nov 16.

Authors

Raphael Pelossof¹, Irtisha Singh^{1

2}, Julie L Yang^{1

2}, Matthew T Weirauch^{3

4

5}, Timothy R Hughes⁵, Christina S Leslie¹

Affiliations

¹ Computational Biology Program, Memorial Sloan Kettering Cancer Center, New York, New York, USA.
² Tri-I Program in Computational Biology and Medicine, Weill Cornell Graduate College, New York, New York, USA.
³ Center for Autoimmune Genomics and Etiology (CAGE), Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.
⁴ Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.
⁵ Donnelly Centre, University of Toronto, Toronto, ON, Canada.

PMID: 26571099
PMCID: PMC4871164
DOI: 10.1038/nbt.3343

Abstract

Predicting the affinity profiles of nucleic acid-binding proteins directly from the protein sequence is a challenging problem. We present a statistical approach for learning the recognition code of a family of transcription factors or RNA-binding proteins (RBPs) from high-throughput binding data. Our method, called affinity regression, trains on protein binding microarray (PBM) or RNAcompete data to learn an interaction model between proteins and nucleic acids using only protein domain and probe sequences as inputs. When trained on mouse homeodomain PBM profiles, our model correctly identifies residues that confer DNA-binding specificity and accurately predicts binding motifs for an independent set of divergent homeodomains. Similarly, when trained on RNAcompete profiles for diverse RBPs, our model correctly predicts the binding affinities of held-out proteins and identifies key RNA-binding residues, despite the high level of sequence divergence across RBPs. We expect that the method will be broadly applicable to modeling and predicting paired macromolecular interactions in settings where high-throughput affinity data are available.

PubMed Disclaimer

Figures

**Figure 1. Affinity regression learns highly accurate models of transcription factor-DNA binding interactions from protein binding microarray experiments**
a) Affinity regression decomposes the binding intensity for each TF and DNA probe as a weighted interaction between the k-mer features of the probe and the K-mer features of the TF amino acid sequence. Training the interaction model involves solving a regularized bilinear regression to minimize errors in reconstructing the probe intensity data across all TFs and probes. The model is represented by the interaction matrix W, whereas P and D represent the K-mer features of protein sequences and the k-mer features of DNA probes, respectively. b) Lowering the number of equations by left multiplication with *Y^T* makes the problem computationally feasible on a standard computer, and the matrix *Y^TD* is amenable to low rank approximation. c) Full-dimensional probe intensity profile prediction is achieved by mapping the lower dimensional solution back into the span of the training probe intensity profiles. d) Predicted probe intensities (y-axis) are plotted against experimental probe intensities (x-axis) for the homeodomain Cart1, using a model trained on 90% of the mouse homeodomain PBM data set with Cart1 among the held-out proteins. Probes containing the three most enriched 8-mers are correctly predicted to have high intensities. e) Replicate experimental probe intensities (black) and predicted probe intensities (blue) are both plotted against Cart1 experimental probe intensities, showing that the prediction method has a similar level of variation as replicate noise. f) Probe correlation performance on held-out homeodomains for affinity regression (y-axis) versus BLOSUM nearest neighbor (x-axis). Each point is the Spearman correlation between the predicted and actual probe intensities, reporting results on held-out TFs using 10-fold cross-validation. g) The bar plots show prediction performance measured by Spearman correlation of probe intensities (left) and AUPR (area under precision-recall curve) for detection of the top 1% of probes (right) for affinity regression, BLOSUM nearest neighbor, nearest neighbor, and an ‘oracle’ method that chooses the training example with optimal performance for the evaluation metric (best possible neighbor). ‘BLOSUM nearest neighbor’ uses local alignment scores with the BLOSUM50 substitution matrix to compute the nearest neighbor; ‘nearest neighbor’ uses Euclidean distance in the k-mer vector space to identify the nearest neighbor. Error bars represent the standard error of the mean across 10 folds. Affinity regression is significantly better than both nearest neighbor methods, and there is no significant difference between affinity regression and the ‘oracle’ method.

**Figure 2. Affinity regression identifies key residues that contribute to homeodomain-DNA binding specificity**
a) Mapping the experimental or predicted PBM intensity profile through the model produces a weighting over amino acid K-mers, which is used to compute a positional importance profile over residues of the TF sequence binding. b) Sequence conservation of the homeodomain family (top track) and the predicted binding importance profiles across members of the homeodomain family (bottom map) are shown. Binding importance profiles are computed from K-mer weights via *_y^TDW* and mapped to each TF sequence. The brightest band of columns corresponds to the core DNA-contacting residues. Binding-specificity features particular to groups of homeodomains are also correctly identified, such as the PYP sequence corresponding to the TALE domain. For Hoxa9 and Pknox1, 4-mers with positional importance score satisfying a 5% FDR threshold are shown with red boxes (see **Supplementary Fig. 4** for all mouse homeodomains). c) Actual mapped amino acid positional importance scores are shown for human PKNOX1 (TALE homeodomain) and mouse Hoxa9. A local peak can be seen for PKNOX1 at the TALE domain (PYP) that does not appear for Hoxa9. Statistically significant positional 4-mers are shown in boldface on the sequences at the bottom of the panel. **d,e)** Statistically significant 4-mers from the positional importance maps for Hoxa9 and Pknox1 are highlighted on known structures from PDB. For Hoxa9, the PDB co-crystal structure is shown; for PKNOX1, the homeodomain structure is aligned to the previous co-crystal structure. The protein is shown in yellow, and the predicted residues that contact DNA are in red. In Hoxa9, identified components of two salt bridges that stabilize the binding conformation are in cyan; in PKNOX1, a significant region potentially contributing to the hydrophobic core is shown in green; predicted residues without a known role in binding specificity are indicated in orange. See methods and materials for highlighted residues.

**Figure 3. DNA binding profiles predicted by affinity regression generate accurate binding motifs for diverse homeodomains**
a) In 10-fold cross-validation, for each test TF we predicted probe intensities, generated PSSMs using Seed-and-Wobble, and compared these predicted motifs to PSSMs estimated directly from the experimental data. We used the log₂ Kullback-Leibler divergence (D_KL) to compare motifs; these scores are shifted by adding the min D_KL to all values, so that the adjusted scores are all positive and small values correspond to good detection of the target motif. The gray regions correspond to motif detection that is as good or better than the (adjusted) median log(D_KL) between motifs from replicate experiments. For most TFs, affinity regression and nearest neighbor produce PSSMs in a similar score range, and these with no statistical significance between their performance (p > 0.05, one-sided KS tests). b) Examples of predicted PSSMs are presented with corresponding target PSSMs (derived from experimental PBM data). c) Example of predicted Z-scores from the Z-score affinity regression model, trained on 75 non-redundant mouse homeodomains, versus experimental Z-scores for SNAPOd2T00005194001, one of the diverse homeodomains assayed by Weirauch et al. Binding motifs generated by PWM-Align-Z based on the top 100 8-mers predicted by affinity regression and the top 100 8-mers based on actual Z-scores are shown. d) Performance comparison of the Z-score affinity regression model versus the ‘oracle’ nearest neighbor, BLOSUM nearest neighbor, and nearest neighbor in 4-mer space. Error bars represent the standard error of the mean across 10 folds. e) Motif accuracy of affinity regression predicted motifs, generated by running PWM-Align-Z on the top 100 predicted 8-mers, versus phylogenetic distance from the nearest training set homeodomain for all 218 Weirauch et al. homeodomains, based on the phylogenetic tree shown in **Supplementary Fig. 8**. Motif accuracy is reported as log(*D_KL*) – min log(*D_KL*) relative ground truth motifs generated by PWM-Align-Z; motif scores < 5 are shown in the green region and indicate accurate motifs, while those above this threshold are in the red region. f) Examples of predicted and ground truth motifs based on PWM-Align-Z motif extraction.

**Figure 4. Affinity regression learns a predictive model of RBP-RNA interactions from RNA compete experiments**
a) Test probe correlation comparison between BLOSUM nearest neighbor and affinity regression for 130 RBPs, using 10-fold cross-validation and showing performance for held-out proteins. Each point is the Spearman correlation between the predicted and actual RNA compete probe intensities. b) The bar plots show performance on held-out RBPs using 10-fold cross-validation for affinity regression, nearest neighbor methods, and an oracle that returns the optimal training example as neighbor. Error bars represent the standard error of the mean across 10 folds. Affinity regression performs significantly better than both BLOSUM nearest neighbor and nearest neighbor, and the is no significant difference in comparison to the ‘oracle’ neighbor for probe intensity Spearman correlation and top 1% probe prediction AUROC. c) Predicted binding importance profiles across a subset of RRM proteins (see Supplementary Note for KH domains), computed by mapping K-mer weights *_y^TDW* onto each RRM. RBPs that have multiple RRM binding domains are represented as multiple rows. The learned model finds several amino acid K-mers that are correlated with binding. For specific RBPs, amino acids 4-mers with positional importance score satisfying a 5% FDR threshold are shown with red boxes (see **Supplementary Fig. 12** for all RBPs). d) The co-crystal structure shows human splicing factor RBFOX1, one of the RRM RBPs in the heatmap, in complex with the RNA sequence UGCAUGU; identified in red are significant positional K-mers corresponding to the sequence GFGFVT, containing two phenylalanines critical for RNA-binding within a beta sheet contacting the RNA, as well as the RNA-proximal K-mer (EIIF). e) Predicted PSSMs for protein subfamilies with the RRM and KH domains. The inner PSSM wheel shows the PWM-Align-Z PSSM for the actual RNA compete experiment, while the outer wheel shows the affinity regression predicted motif on unseen TFs in a 10-fold cross-validation setting.

See this image and copyright information in PMC

References

1. Berger MF, Badis G, Gehrke AR, Talukder S, Philippakis AA, Pena-Castillo L, et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell. 2008 Jun 27;133(7):1266–76. PubMed PMID: 18585359. Pubmed Central PMCID: PMC2531161. - PMC - PubMed
1. Liu J, Stormo GD. Context-dependent DNA recognition code for C2H2 zinc-finger transcription factors. Bioinformatics. 2008 Sep 1;24(17):1850–7. PubMed PMID: 18586699. Pubmed Central PMCID: PMC2732218. - PMC - PubMed
1. Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW, 3rd, Bulyk ML. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature biotechnology. 2006 Nov;24(11):1429–35. PubMed PMID: 16998473. Pubmed Central PMCID: PMC4419707. - PMC - PubMed
1. Jolma A, Kivioja T, Toivonen J, Cheng L, Wei G, Enge M, et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome research. 2010 Jun;20(6):861–73. PubMed PMID: 20378718. Pubmed Central PMCID: PMC2877582. - PMC - PubMed
1. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, et al. DNA-binding specificities of human transcription factors. Cell. 2013 Jan 17;152(1-2):327–39. PubMed PMID: 23332764. - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Affinity regression predicts the recognition code of nucleic acid-binding proteins

Affiliations

Affinity regression predicts the recognition code of nucleic acid-binding proteins

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources