Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 11;37(24):4626-4634.
doi: 10.1093/bioinformatics/btab529.

3Cnet: pathogenicity prediction of human variants using multitask learning with evolutionary constraints

Affiliations

3Cnet: pathogenicity prediction of human variants using multitask learning with evolutionary constraints

Dhong-Gun Won et al. Bioinformatics. .

Abstract

Motivation: Improvements in next-generation sequencing have enabled genome-based diagnosis for patients with genetic diseases. However, accurate interpretation of human variants requires knowledge from a number of clinical cases. In addition, manual analysis of each variant detected in a patient's genome requires enormous time and effort. To reduce the cost of diagnosis, various computational tools have been developed to predict the pathogenicity of human variants, but the shortage and bias of available clinical data can lead to overfitting of algorithms.

Results: We developed a pathogenicity predictor, 3Cnet, that uses recurrent neural networks to analyze the amino acid context of human variants. As 3Cnet is trained on simulated variants reflecting evolutionary conservation and clinical data, it can find disease-causing variants in patient genomes with 2.2 times greater sensitivity than currently available tools, more effectively discovering pathogenic variants and thereby improving diagnosis rates.

Availability and implementation: Codes (https://github.com/KyoungYeulLee/3Cnet/) and data (https://zenodo.org/record/4716879#.YIO-xqkzZH1) are freely available to non-commercial users.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The architecture of the recurrent neural network to train variant pathogenicity based on protein sequences. (a) The feature extractor contains two parallel layers of long short-term memory (LSTM) networks. The features of the wild-type sequence, mutated sequence and MSA are merged and processed to become a vector, known as the extracted features. (b) The pathogenicity classifier is composed of two fully connected layers that determine the binary pathogenicity of a variant from the extracted features. (c) The pathogenicity classifier utilizes SNVBox features and the extracted features by combining these two features through sigmoid activation followed by concatenation
Fig. 2.
Fig. 2.
Multi-task learning between clinical data and conservation data. The feature extractor is trained by all data sources, including clinical data from the ClinVar database, common variants from the gnomAD database, and conservation data generated based on the UniRef database. Therefore, the extracted features become common features for different types of data. In contrast, pathogenicity classifiers are separated for specific data types. The clinical data and the common variants are used to train the pathogenicity classifier, while the other pathogenicity classifier is trained by conservation data. After training, the pathogenicity of a variant is determined by the classifier trained by clinical data
Fig. 3.
Fig. 3.
Cross-validation of internal ClinVar variants for different models using the recurrent neural network. (a) PR curve for cross-validation. Conservation (solid yellow) indicates the model trained by conservation data. ClinVar (dashed blue) indicates the model trained using only clinical data from the ClinVar database. ClinVar+Common (dotted blue) indicates the model trained by ClinVar data along with the common variant. The multi-task (solid blue) model is trained by multi-task learning between clinical data and conservation data. 3Cnet (solid magenta) is the model trained by multi-task learning with pathogenicity classifiers that utilize SNVBox features. (b) ROC curve for cross-validation
Fig. 4.
Fig. 4.
Validation performance of external ClinVar variants and comparison with other pathogenicity prediction tools. (a) PR curve for external validation. 3Cnet showed the best performance for the independent clinical data. REVEL, an ensemble model using scores from many prediction tools, was the second best followed by VEST4, which utilizes SNVBox features for prediction. The performance of PrimateAI was the best among previous deep learning-based algorithms. (b) ROC curve for external validation
Fig. 5.
Fig. 5.
External validation performance for non-synonymous variants including start lost, stop gain, deletion and frameshift variants. (a) PR curve for non-synonymous variants. (b) ROC curve for non-synonymous variants. (c) ROC curve for non-synonymous variants except for missense variants
Fig. 6.
Fig. 6.
Discriminating disease-causing variants from other missense variants in the patient genome. (a) PR curve for classifying disease-causing variants and non-causal variants. (b) The top-k recall rate implies the probability of determining the true disease-causing variant(s) among the top ranked variants using prediction scores. This number is important because the diagnosis rate of patients is closely related to the recall rate. (c) Score distribution of different algorithms for disease-causing variants and non-causal variants. A smaller proportion of the uncertain area with similar scores for disease-causing variants and non-causing variants indicates a higher resolution of the scoring scheme for dividing these variants

References

    1. Adzhubei I.A. et al. (2010) A method and server for predicting damaging missense mutations. Nat. Methods, 7, 248–249. - PMC - PubMed
    1. Amberger J.S., Hamosh A. (2017) Searching Online Mendelian Inheritance in Man (OMIM): a knowledgebase of human genes and genetic phenotypes. Curr. Protoc. Bioinf., 58, 1.2.1–1.2.12. - PMC - PubMed
    1. Amendola L.M. et al. (2016) Performance of ACMG-AMP variant-interpretation guidelines among nine laboratories in the clinical sequencing exploratory research consortium. Am. J. Hum. Genet., 98, 1067–1076. - PMC - PubMed
    1. Auton A. et al.; 1000 Genomes Project Consortium. (2015) A global reference for human genetic variation. Nature, 526, 68–74. - PMC - PubMed
    1. Bleeker S.E. et al. (2003) External validation is necessary in prediction research: a clinical example. J. Clin. Epidemiol., 56, 826–832. - PubMed