3Cnet: pathogenicity prediction of human variants using multitask learning with evolutionary constraints

Dhong-Gun Won¹, Dong-Wook Kim¹, Junwoo Woo¹, Kyoungyeul Lee¹

Affiliations

PMID: 34270679
PMCID: PMC8665754
DOI: 10.1093/bioinformatics/btab529

3Cnet: pathogenicity prediction of human variants using multitask learning with evolutionary constraints

Dhong-Gun Won et al. Bioinformatics. 2021.

. 2021 Dec 11;37(24):4626-4634.

doi: 10.1093/bioinformatics/btab529.

Authors

Dhong-Gun Won¹, Dong-Wook Kim¹, Junwoo Woo¹, Kyoungyeul Lee¹

Affiliation

¹ Research and Development Center, 3billion, Seoul 06193, Republic of Korea.

PMID: 34270679
PMCID: PMC8665754
DOI: 10.1093/bioinformatics/btab529

Abstract

Motivation: Improvements in next-generation sequencing have enabled genome-based diagnosis for patients with genetic diseases. However, accurate interpretation of human variants requires knowledge from a number of clinical cases. In addition, manual analysis of each variant detected in a patient's genome requires enormous time and effort. To reduce the cost of diagnosis, various computational tools have been developed to predict the pathogenicity of human variants, but the shortage and bias of available clinical data can lead to overfitting of algorithms.

Results: We developed a pathogenicity predictor, 3Cnet, that uses recurrent neural networks to analyze the amino acid context of human variants. As 3Cnet is trained on simulated variants reflecting evolutionary conservation and clinical data, it can find disease-causing variants in patient genomes with 2.2 times greater sensitivity than currently available tools, more effectively discovering pathogenic variants and thereby improving diagnosis rates.

Availability and implementation: Codes (https://github.com/KyoungYeulLee/3Cnet/) and data (https://zenodo.org/record/4716879#.YIO-xqkzZH1) are freely available to non-commercial users.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
The architecture of the recurrent neural network to train variant pathogenicity based on protein sequences. (a) The feature extractor contains two parallel layers of long short-term memory (LSTM) networks. The features of the wild-type sequence, mutated sequence and MSA are merged and processed to become a vector, known as the extracted features. (b) The pathogenicity classifier is composed of two fully connected layers that determine the binary pathogenicity of a variant from the extracted features. (c) The pathogenicity classifier utilizes SNVBox features and the extracted features by combining these two features through sigmoid activation followed by concatenation

**Fig. 2.**
Multi-task learning between clinical data and conservation data. The feature extractor is trained by all data sources, including clinical data from the ClinVar database, common variants from the gnomAD database, and conservation data generated based on the UniRef database. Therefore, the extracted features become common features for different types of data. In contrast, pathogenicity classifiers are separated for specific data types. The clinical data and the common variants are used to train the pathogenicity classifier, while the other pathogenicity classifier is trained by conservation data. After training, the pathogenicity of a variant is determined by the classifier trained by clinical data

**Fig. 3.**
Cross-validation of internal ClinVar variants for different models using the recurrent neural network. (a) PR curve for cross-validation. Conservation (solid yellow) indicates the model trained by conservation data. ClinVar (dashed blue) indicates the model trained using only clinical data from the ClinVar database. ClinVar+Common (dotted blue) indicates the model trained by ClinVar data along with the common variant. The multi-task (solid blue) model is trained by multi-task learning between clinical data and conservation data. 3Cnet (solid magenta) is the model trained by multi-task learning with pathogenicity classifiers that utilize SNVBox features. (b) ROC curve for cross-validation

**Fig. 4.**
Validation performance of external ClinVar variants and comparison with other pathogenicity prediction tools. (a) PR curve for external validation. 3Cnet showed the best performance for the independent clinical data. REVEL, an ensemble model using scores from many prediction tools, was the second best followed by VEST4, which utilizes SNVBox features for prediction. The performance of PrimateAI was the best among previous deep learning-based algorithms. (b) ROC curve for external validation

**Fig. 5.**
External validation performance for non-synonymous variants including start lost, stop gain, deletion and frameshift variants. (a) PR curve for non-synonymous variants. (b) ROC curve for non-synonymous variants. (c) ROC curve for non-synonymous variants except for missense variants

**Fig. 6.**
Discriminating disease-causing variants from other missense variants in the patient genome. (a) PR curve for classifying disease-causing variants and non-causal variants. (b) The top-k recall rate implies the probability of determining the true disease-causing variant(s) among the top ranked variants using prediction scores. This number is important because the diagnosis rate of patients is closely related to the recall rate. (c) Score distribution of different algorithms for disease-causing variants and non-causal variants. A smaller proportion of the uncertain area with similar scores for disease-causing variants and non-causing variants indicates a higher resolution of the scoring scheme for dividing these variants

See this image and copyright information in PMC

References

1. Adzhubei I.A. et al. (2010) A method and server for predicting damaging missense mutations. Nat. Methods, 7, 248–249. - PMC - PubMed
1. Amberger J.S., Hamosh A. (2017) Searching Online Mendelian Inheritance in Man (OMIM): a knowledgebase of human genes and genetic phenotypes. Curr. Protoc. Bioinf., 58, 1.2.1–1.2.12. - PMC - PubMed
1. Amendola L.M. et al. (2016) Performance of ACMG-AMP variant-interpretation guidelines among nine laboratories in the clinical sequencing exploratory research consortium. Am. J. Hum. Genet., 98, 1067–1076. - PMC - PubMed
1. Auton A. et al.; 1000 Genomes Project Consortium. (2015) A global reference for human genetic variation. Nature, 526, 68–74. - PMC - PubMed
1. Bleeker S.E. et al. (2003) External validation is necessary in prediction research: a clinical example. J. Clin. Epidemiol., 56, 826–832. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

3Cnet: pathogenicity prediction of human variants using multitask learning with evolutionary constraints

Affiliation

3Cnet: pathogenicity prediction of human variants using multitask learning with evolutionary constraints

Authors

Affiliation

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials