This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Mar 28:2023.01.25.525572.

doi: 10.1101/2023.01.25.525572.

An intrinsically interpretable neural network architecture for sequence to function learning

Ali Tuğrul Balcı^{1

2}, Mark Maher Ebeid^{1

2}, Panayiotis V Benos³, Dennis Kostka^{1

2}, Maria Chikina^{1

2}

Affiliations

¹ Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and.
² Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and.
³ Department of Epidemiology, University of Florida, Gainesville, 32610, Unites States.

PMID: 36747873
PMCID: PMC9900791
DOI: 10.1101/2023.01.25.525572

An intrinsically interpretable neural network architecture for sequence to function learning

Ali Tuğrul Balcı et al. bioRxiv. 2023.

[Preprint]. 2023 Mar 28:2023.01.25.525572.

doi: 10.1101/2023.01.25.525572.

Authors

Ali Tuğrul Balcı^{1

2}, Mark Maher Ebeid^{1

2}, Panayiotis V Benos³, Dennis Kostka^{1

2}, Maria Chikina^{1

2}

Affiliations

¹ Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and.
² Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and.
³ Department of Epidemiology, University of Florida, Gainesville, 32610, Unites States.

PMID: 36747873
PMCID: PMC9900791
DOI: 10.1101/2023.01.25.525572

Update in

An intrinsically interpretable neural network architecture for sequence-to-function learning.
Balcı AT, Ebeid MM, Benos PV, Kostka D, Chikina M. Balcı AT, et al. Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i413-i422. doi: 10.1093/bioinformatics/btad271. Bioinformatics. 2023. PMID: 37387140 Free PMC article.

Abstract

Motivation: Sequence-based deep learning approaches have been shown to predict a multitude of functional genomic readouts, including regions of open chromatin and RNA expression of genes. However, a major limitation of current methods is that model interpretation relies on computationally demanding post hoc analyses, and even then, one can often not explain the internal mechanics of highly parameterized models. Here, we introduce a deep learning architecture called tiSFM (totally interpretable sequence to function model). tiSFM improves upon the performance of standard multi-layer convolutional models while using fewer parameters. Additionally, while tiSFM is itself technically a multi-layer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs.

Results: We analyze published open chromatin measurements across hematopoietic lineage cell-types and demonstrate that tiSFM outperforms a state-of-the-art convolutional neural network model custom-tailored to this dataset. We also show that it correctly identifies context specific activities of transcription factors with known roles in hematopoietic differentiation, including Pax5 and Ebf1 for B-cells, and Rorc for innate lymphoid cells. tiSFM's model parameters have biologically meaningful interpretations, and we show the utility of our approach on a complex task of predicting the change in epigenetic state as a function of developmental transition.

Availability and implementation: The source code, including scripts for the analysis of key findings, can be found at https://github.com/boooooogey/ATAConv, implemented in Python.

PubMed Disclaimer

Figures

**Fig. 1.**
tiSFM improves on prediction accuracy when compared to the current state-of-the-art. (A) A graphical display of the architecture of tiSFM. (B) The improvement of tiSFM, measured via the change in R2 on the full model; additionally, different iterations of tiSFM were tested with the inclusion/exclusion of certain model components in order to weigh their contribution to the overall performance. (C) similar to (B), but without kernel fine-tuning.

**Fig. 2.**
tiSFM consistently finds motifs that are relevant to immune cell differentiation. (A) The final layer of a single fully trained model, after the rows were normalized to [−1, 1], while preserving the sparsity. The 5 motifs with the highest absolute weight for each cell-type are shown. (B) Inclusion ratio was defined as the number of times a motif appeared among the top 10 motifs across the 10 folds in our 10-fold cross-validation procedure, with the highest weights in the final layer for each cell-type. The motifs included in more than 50% of the models are shown here.

**Fig. 3.**
(A) A scatter plot of pooling parameters vs. the information content (IC) calculated from motif PWMs. The color annotation of the points correspond to the final layer coefficients for B-cell, OCR activity prediction. The 5 motifs with the highest absolute weight for B cells are annotated (Ebf1, Pax5, Mef2a, Pou2f1, Bbx) along with some outliers to the pooling/IC trend (Pax4, Bcl6, Rela, Ctcf). (B) A scatter plot contrasting the coefficients in the final linear layer (pooled from absolute values of the cell-type prediction coefficients corresponding to the same TF, then the coefficients are normalized to [−1, 1], range preserving the sparsity) with total contribution to the interaction layer. The two are notably different, only one TF influences the other TFs while contributing to the cell-type prediction significantly, as some TFs contribute far more to the interaction matrix but have near 0 linear coefficients.

**Fig. 4.**
Sparsity constraints improve the performance for the most cell-types and increase interpretability of the tiSFM. (A) MSE vs the MCP hyperparameter (the second hyperparameter is fixed at 3). The lines for every cell-type and the overall performance are assigned a color. The vertical line indicates the hyperparameter that resulted in the best performance for the corresponding cell-type. (B) Coefficients vs the MCP hyperparameter path plot. The best model in terms of MSE is marked by a dashed vertical line. Colored, are the top 9 motifs with the highest absolute coefficients from the best model, and the black lines are the motifs with 0 coefficient in the best model. (C) Median motif redundancy as a function of regularization. We observe that as expected increasing regularization decreases motif redundancy (see Methods for motif similarity calculation). Path plots for other cell-types are in Figure S3.

**Fig. 5.**
tiSFM predictions and TF contributions applied to the problem of predicting differentiation transitions. (A) The model predicts the output corresponding to each edge along the differentiation tree computed as the difference between child and parent. Edges are scaled according to R² and those with a value of > 0.05 are selected for TF contribution analysis depicted in the color matched heatmap (B). Highly predictable transitions in the lymphocyte lineage are highlighted on the lineage tree with cell-type identifiers. The heatmap includes motifs that are among top 5 with the highest absolute coefficients for at least one target.

See this image and copyright information in PMC

References

1. Alipanahi B. et al. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. - PubMed
1. Avsec Ž. et al. (2021a). Base-resolution models of transcription-factor binding reveal soft motif syntax. Nature Genetics, 53, 354–366. - PMC - PubMed
1. Avsec, et al. (2021b). Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10), 1196–1203. Number: 10 Publisher: Nature Publishing Group. - PMC - PubMed
1. Banovich N. E. et al. (2017). Impact of regulatory variation across human iPSCs and differentiated cells. Genome Research. - PMC - PubMed
1. Basu J. et al. (2021). Essential role of a ThPOK autoregulatory loop in the maintenance of mature CD4+ T cell identity and function. Nat. Immunol., 22, 969–982. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

An intrinsically interpretable neural network architecture for sequence to function learning

Affiliations

An intrinsically interpretable neural network architecture for sequence to function learning

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources