. 2024 Jan 8;16(1):3.

doi: 10.1186/s13073-023-01274-4.

MAGPIE: accurate pathogenic prediction for multiple variant types using machine learning approach

Yicheng Liu^#^{1

2}, Tianyun Zhang^#¹, Ningyuan You¹, Sai Wu^{3

4}, Ning Shen⁵

Affiliations

¹ Department of Hepatobiliary and Pancreatic Surgery, First Affiliated > Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China.
² College of Computer Science, Zhejiang University, Yuquan Campus, Zhejiang University, Rd Zheda 38, Xihu District, Hangzhou, 310007, China.
³ Department of Hepatobiliary and Pancreatic Surgery, First Affiliated > Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China. wusai@zju.edu.cn.
⁴ College of Computer Science, Zhejiang University, Yuquan Campus, Zhejiang University, Rd Zheda 38, Xihu District, Hangzhou, 310007, China. wusai@zju.edu.cn.
⁵ Department of Hepatobiliary and Pancreatic Surgery, First Affiliated > Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China. shenningzju@zju.edu.cn.

^# Contributed equally.

PMID: 38185709
PMCID: PMC10773112
DOI: 10.1186/s13073-023-01274-4

MAGPIE: accurate pathogenic prediction for multiple variant types using machine learning approach

Yicheng Liu et al. Genome Med. 2024.

. 2024 Jan 8;16(1):3.

doi: 10.1186/s13073-023-01274-4.

Authors

Yicheng Liu^#^{1

2}, Tianyun Zhang^#¹, Ningyuan You¹, Sai Wu^{3

4}, Ning Shen⁵

Affiliations

¹ Department of Hepatobiliary and Pancreatic Surgery, First Affiliated > Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China.
² College of Computer Science, Zhejiang University, Yuquan Campus, Zhejiang University, Rd Zheda 38, Xihu District, Hangzhou, 310007, China.
³ Department of Hepatobiliary and Pancreatic Surgery, First Affiliated > Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China. wusai@zju.edu.cn.
⁴ College of Computer Science, Zhejiang University, Yuquan Campus, Zhejiang University, Rd Zheda 38, Xihu District, Hangzhou, 310007, China. wusai@zju.edu.cn.
⁵ Department of Hepatobiliary and Pancreatic Surgery, First Affiliated > Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China. shenningzju@zju.edu.cn.

^# Contributed equally.

PMID: 38185709
PMCID: PMC10773112
DOI: 10.1186/s13073-023-01274-4

Erratum in

Correction: Genome Med 15, 115 & Genome Med 16, 3.
Shen N. Shen N. Genome Med. 2024 May 14;16(1):68. doi: 10.1186/s13073-024-01343-2. Genome Med. 2024. PMID: 38745249 Free PMC article. No abstract available.

Abstract

Identifying pathogenic variants from the vast majority of nucleotide variation remains a challenge. We present a method named Multimodal Annotation Generated Pathogenic Impact Evaluator (MAGPIE) that predicts the pathogenicity of multi-type variants. MAGPIE uses the ClinVar dataset for training and demonstrates superior performance in both the independent test set and multiple orthogonal validation datasets, accurately predicting variant pathogenicity. Notably, MAGPIE performs best in predicting the pathogenicity of rare variants and highly imbalanced datasets. Overall, results underline the robustness of MAGPIE as a valuable tool for predicting pathogenicity in various types of human genome variations. MAGPIE is available at https://github.com/shenlab-genomics/magpie .

Keywords: Genomic variation; Machine learning; Multimodal annotation; Pathogenic prediction.

PubMed Disclaimer

Conflict of interest statement

The authors have submitted a patent application for the method. Other than this, the authors declare that they do not have any competing interests.

Figures

**Fig. 1**
Framework of MAGPIE. The model was trained to predict pathogenic scores of multi-type variants and included three steps. First, candidate variants were annotated with high-dimensional features covering six different modalities. Second, automatic feature engineering and separated feature selection were undertaken step by step. Finally, a gradient boosting method with controllable tuning was implemented to train the model and obtain predictions for the pathogenicity of variants

**Fig. 2**
Feature importance and correlation. A Correlation between features used to train MAGPIE. B A captivating hierarchical relationship diagram is presented, displaying the intricate relationship between features and the categories they belong to. Each dot in the outermost layer represents a distinct feature, while the size of the dots indicates their importance. The second layer depicts feature categories, with the size reflecting the sum of importance of the subordinate features. C The bar plot illustrates feature importance, which shows the contribution of each feature after feature selection. Part of the add-on features is automatically removed during the training process

**Fig. 3**
MAGPIE makes accurate predictions. A The pie chart showed the proportion of pathogenic and benign variants in the independent test set, and the bar plot illustrated the percentages of multi-type variants in the dataset. B The receiver operating characteristic curve of MAGPIE and 14 other predicted tools in the independent test set. The area under the curve (AUC) scores were shown in the bar plot. C Precision-recall curve of MAGPIE and 14 other predicted tools in the ClinVarTest dataset were illustrated. D Missing rate comparison of MAGPIE and 14 other predicted tools in the independent test set. The higher missing rate represented that the prediction tools cannot predict pathogenic scores on the larger number of candidate variants. E AUC comparison of MAGPIE and 14 other predicted tools in the ClinVarRare, which only included variants with AF < 0.01. F AUBPRC comparison of MAGPIE and 14 other predicted tools in the ClinVarRare which only included variants with AF < 0.01. G Percentages of predictable variants across different variant types in various tools. H Violin plots illustrated distributions of pathogenic scores. And bar plots showed the precisions in each category of pathogenic variants and benign variants

**Fig. 4**
MAGPIE outperforms other models in orthogonal validation set and ACMG-guided dataset. A The pie chart showed the proportion of pathogenic and benign variants in the orthogonal validation set and the bar plot illustrated the percentages of multi-type variants in the dataset. B The receiver operating characteristic curve of MAGPIE and 14 other predicted tools in the orthogonal validation set. C The precision-recall curve of MAGPIE and 14 other predicted tools in the orthogonal validation set were illustrated. D Missing rate comparison of MAGPIE and 14 other predicted tools in the orthogonal validation set. The higher missing rate represented that the prediction tools cannot predict pathogenic scores on the larger number of candidate variants. E AUC comparison of MAGPIE and 14 other predicted tools in the SwissProtRare which only included variants with AF < 0.01. F AUBPRC comparison of MAGPIE and 14 other predicted tools in the SwissProtRare which only included variants with AF < 0.01. G The pie chart showed the proportion of pathogenic and benign variants in the ACMG-guided dataset, and the bar plot illustrated the percentages of multi-type variants in the dataset. H Performance comparison of MAGPIE and 14 other predicted tools in the ACMG-guided dataset. I The precision-recall curve of MAGPIE and 14 other predicted tools in the ACMG-guided dataset were illustrated

**Fig. 5**
MAGPIE detects most variants in pathogenic genes. A Comparison of the number of detected pathogenic variants in four well-known pathogenic genes between MAGPIE and 14 other prediction tools. B Density plots illustrated distributions of pathogenic scores predicted by MAGPIE, MutationTaster, and VEST4. Moreover, the pie charts showed the proportion of predictable and unpredictable variants of each tool

See this image and copyright information in PMC

References

1. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfoldi J, Wang Q, ... MacArthur DG. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434–443. - PMC - PubMed
1. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, ... Exome Aggregation Consortium. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285-291 - PMC - PubMed
1. Cao Y, Li L, Xu M, Feng Z, Sun X, Lu J, ... Wang W. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res. 2020;30(9):717–731. - PMC - PubMed
1. Landrum MJ, Lee JM, Benson M, Brown G, Chao C, Chitipiralla S, ... Maglott DR. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016;44(D1), D862-D868. - PMC - PubMed
1. Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM. org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 2015;43(D1):D789–D798. doi: 10.1093/nar/gku1205. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MAGPIE: accurate pathogenic prediction for multiple variant types using machine learning approach

Affiliations

MAGPIE: accurate pathogenic prediction for multiple variant types using machine learning approach

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources