. 2024 Apr 30;25(1):170.

doi: 10.1186/s12859-024-05787-6.

Assessing the reliability of point mutation as data augmentation for deep learning with genomic data

Hyunjung Lee^#¹, Utku Ozbulak^#², Homin Park^{2

3}, Stephen Depuydt⁴, Wesley De Neve^{2

3}, Joris Vankerschaver^{5

6}

Affiliations

¹ Korea University, Seoul, South Korea.
² Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
³ IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium.
⁴ Erasmus Brussels University of Applied Sciences and Arts, Brussels, Belgium.
⁵ Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea. joris.vankerschaver@ghent.ac.kr.
⁶ Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium. joris.vankerschaver@ghent.ac.kr.

^# Contributed equally.

PMID: 38689247
PMCID: PMC11059627
DOI: 10.1186/s12859-024-05787-6

Assessing the reliability of point mutation as data augmentation for deep learning with genomic data

Hyunjung Lee et al. BMC Bioinformatics. 2024.

. 2024 Apr 30;25(1):170.

doi: 10.1186/s12859-024-05787-6.

Authors

Hyunjung Lee^#¹, Utku Ozbulak^#², Homin Park^{2

3}, Stephen Depuydt⁴, Wesley De Neve^{2

3}, Joris Vankerschaver^{5

6}

Affiliations

¹ Korea University, Seoul, South Korea.
² Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
³ IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium.
⁴ Erasmus Brussels University of Applied Sciences and Arts, Brussels, Belgium.
⁵ Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea. joris.vankerschaver@ghent.ac.kr.
⁶ Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium. joris.vankerschaver@ghent.ac.kr.

^# Contributed equally.

PMID: 38689247
PMCID: PMC11059627
DOI: 10.1186/s12859-024-05787-6

Abstract

Background: Deep neural networks (DNNs) have the potential to revolutionize our understanding and treatment of genetic diseases. An inherent limitation of deep neural networks, however, is their high demand for data during training. To overcome this challenge, other fields, such as computer vision, use various data augmentation techniques to artificially increase the available training data for DNNs. Unfortunately, most data augmentation techniques used in other domains do not transfer well to genomic data.

Results: Most genomic data possesses peculiar properties and data augmentations may significantly alter the intrinsic properties of the data. In this work, we propose a novel data augmentation technique for genomic data inspired by biology: point mutations. By employing point mutations as substitutes for codons, we demonstrate that our newly proposed data augmentation technique enhances the performance of DNNs across various genomic tasks that involve coding regions, such as translation initiation and splice site detection.

Conclusion: Silent and missense mutations are found to positively influence effectiveness, while nonsense mutations and random mutations in non-coding regions generally lead to degradation. Overall, point mutation-based augmentations in genomic datasets present valuable opportunities for improving the accuracy and reliability of predictive models for DNA sequences.

Keywords: Data augmentation; Deep learning; Point mutations; Splicing; Translation initiation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no Conflict of interest.

Figures

**Fig. 1**
Arrangement of coding and non-coding regions in the six datasets used for TIS or splice site detection. Each sequence consists of either a TIS or a splice site, flanked by a coding or non-coding region of fixed length, as indicated in the figure

**Fig. 2**
Changes in neural network performance for each of the five datasets after introducing up to 10 point mutations of different types in the coding region. Applying a moderate number of silent and missense mutations improves the performance, while large numbers of missense and nonsense mutations are generally detrimental

**Fig. 3**
Changes in neural network performance for each of the five datasets after introducing up to 10 random mutations in the non-coding region. Applying random mutations generally has a detrimental effect on performance

**Fig. 4**
Effect of applying different mutation types for each dataset under comparison. Up to three mutations are applied, since higher mutation counts generally have a detrimental effect. The vertical dashed line indicates the median accuracy for the baseline case, in which no mutations are applied. Dots in the scatter plot indicate repetitions of the same experiment, with a different random seed, as explained in the body of the text

See this image and copyright information in PMC

References

1. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–444. doi: 10.1038/nature14539. - DOI - PubMed
1. Voulodimos A, Doulamis N, Doulamis A, Protopapadakis E, et al. Deep learning for computer vision: a brief review. Comput Intell Neurosci. 2018;2018:66. doi: 10.1155/2018/7068349. - DOI - PMC - PubMed
1. Zaheer M, Guruganesh G, Dubey KA, Ainslie J, Alberti C, Ontanon S, et al. Big bird: transformers for longer sequences. Adv Neural Inf Process Syst. 2020;33:17283–17297.
1. Zhang S, Hu H, Jiang T, Zhang L, Zeng J. TITER: predicting translation initiation sites by deep learning. Bioinformatics. 2017;33(14):i234–i242. doi: 10.1093/bioinformatics/btx247. - DOI - PMC - PubMed
1. Zuallaert J, Kim M, Soete A, Saeys Y, Neve WD. TISRover: ConvNets learn biologically relevant features for effective translation initiation site prediction. Int J Data Min Bioinform. 2018;20(3):267–284. doi: 10.1504/IJDMB.2018.094781. - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

BOF/STA/202109/039/Universiteit Gent

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing the reliability of point mutation as data augmentation for deep learning with genomic data

Affiliations

Assessing the reliability of point mutation as data augmentation for deep learning with genomic data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources